Histogram in Spark (1) | Big Data Analytics with Spark



Histogram in Spark (1) | Big Data Analytics with Spark

Spark's DoubleRDDFunctions provide a histogram function for RDD[Double]. However there are no histogram function for RDD[String]. Here is a quick exercise for doing it. We will use immutable Map in this exercise. Create a dummy RDD[String] and apply the aggregate method to calculate histogram scala> val d=sc.parallelize((1 to 10).map(_ % 3).map("val"+_.toString)) scala> d.aggregate(Map[String,Int]())( | (m,c)=>m.updated(c,m.getOrElse(c,0)+1), | (m,n)=>(m /: n){case (map,(k,v))=>map.updated(k,v+map.getOrElse(k,0))} | ) The 2nd function of aggregate method is to merge 2 maps. We can actually define a Scala function scala> def mapadd[T](m:Map[T,Int],n:Map[T,Int])={ | (m /: n){case (map,(k,v))=>map.updated(k,v+map.getOrElse(k,0))} | } It combine the histogram on the different partitions together scala> mapadd(Map("a"->1,"b"->2),Map("a"->2,"c"->1)) res3: scala.collection.mutable.Map[String,Int] = Map(b -> 2, a -> 3, c -> 1) Use mapadd we can rewrite the aggregate step scala> d.

Read full article from Histogram in Spark (1) | Big Data Analytics with Spark


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts