pyspark.RDD.histogram#
- RDD.histogram(buckets)[source]#
- Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50], which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1 and 50 we would have a histogram of 1,0,1. - If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) insertion to O(1) per element (where n is the number of buckets). - Buckets must be sorted, not contain any duplicates, and have at least two elements. - If buckets is a number, it will generate buckets which are evenly spaced between the minimum and maximum of the RDD. For example, if the min value is 0 and the max is 100, given buckets as 2, the resulting buckets will be [0,50) [50,100]. buckets must be at least 1. An exception is raised if the RDD contains infinity. If the elements in the RDD do not vary (max == min), a single bucket will be used. - New in version 1.2.0. - Parameters
- bucketsint, or list, or tuple
- if buckets is a number, it computes a histogram of the data using buckets number of buckets evenly, otherwise, buckets is the provided buckets to bin the data. 
 
- Returns
- tuple
- a tuple of buckets and histogram 
 
 - See also - Examples - >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) ([0, 25, 50], [25, 26]) >>> rdd.histogram([0, 5, 25, 50]) ([0, 5, 25, 50], [5, 20, 26]) >>> rdd.histogram([0, 15, 30, 45, 60]) # evenly spaced buckets ([0, 15, 30, 45, 60], [15, 15, 15, 6]) >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"]) >>> rdd.histogram(("a", "b", "c")) (('a', 'b', 'c'), [2, 2])