Josh - 07 Mar 2014 Lately I've been writing a lot of Spark Jobs that perform some statistical analysis on datasets. One of the things I didn't realize right away - is that RDD's have built in support for basic statistic functions like mean, variance, sample variance, standard deviation. These operations are avaible on RDD's of Double import org.apache.spark.SparkContext._ // implicit conversions in here val myRDD = newRDD().map { _.toDouble } myRDD.mean myRDD.sampleVariance // divides by n-1 myRDD.sampleStdDev // divides by n-1 Getting It All At Once If you're interested in calling multiple stats functions at the same time, it's a better idea to get them all in a single pass. Spark provides the stats method in DoubleRDDFunctions for that; it also provides the total count of the RDD as well. Histograms Means and standard deviation are a decent starting point when you're looking at a new dataset;
Read full article from Statistics With Spark
No comments:
Post a Comment