Statistics With Spark



Statistics With Spark

Josh - 07 Mar 2014 Lately I've been writing a lot of Spark Jobs that perform some statistical analysis on datasets. One of the things I didn't realize right away - is that RDD's have built in support for basic statistic functions like mean, variance, sample variance, standard deviation. These operations are avaible on RDD's of Double import org.apache.spark.SparkContext._ // implicit conversions in here val myRDD = newRDD().map { _.toDouble } myRDD.mean myRDD.sampleVariance // divides by n-1 myRDD.sampleStdDev // divides by n-1 Getting It All At Once If you're interested in calling multiple stats functions at the same time, it's a better idea to get them all in a single pass. Spark provides the stats method in DoubleRDDFunctions for that; it also provides the total count of the RDD as well. Histograms Means and standard deviation are a decent starting point when you're looking at a new dataset;

Read full article from Statistics With Spark


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts