java - Producing a sorted wordcount with Spark - Code Review Stack Exchange



java - Producing a sorted wordcount with Spark - Code Review Stack Exchange

My method using Java 8

As addendum I'll show how I would identify your problem in question and show you how I would do it.

Input: An input file, consisting of words. Output: A list of the words sorted by frequency in which they occur.

Map<String, Long> occurenceMap = Files.readAllLines(Paths.get("myFile.txt"))          .stream()          .flatMap(line -> Arrays.stream(line.split(" ")))          .collect(Collectors.groupingBy(i -> i, Collectors.counting()));  List<String> sortedWords = occurenceMap.entrySet()          .stream()          .sorted(Comparator.comparing((Map.Entry<String, Long> entry) -> entry.getValue()).reversed())          .map(Map.Entry::getKey)          .collect(Collectors.toList());

This will do the following steps:

  1. Read all lines into a List<String> (care with large files!)
  2. Turn it into a Stream<String>.
  3. Turn that into a Stream<String> by flat mapping every String to a Stream<String> splitting on the blanks.
  4. Collect all elements into a Map<String, Long> grouping by the identity (i -> i) and using as downstream Collectors.counting() such that the map-value will be its count.
  5. Get a Set<Map.Entry<String, Long>> from the map.
  6. Turn it into a Stream<Map.Entry<String, Long>>.
  7. Sort by the reverse order of the value of the entry.
  8. Map the results to a Stream<String>, you lose the frequency information here.
  9. Collect the stream into a List<String>.

Beware that the line .sorted(Comparator.comparing((Map.Entry<String, Long> entry) -> entry.getValue()).reversed()) should really be .sorted(Comparator.comparing(Map.Entry::getValue).reversed(), but type inference is having issues with that and for some reason it will not compile.

I hope the Java 8 way can give you interesting insights.


Read full article from java - Producing a sorted wordcount with Spark - Code Review Stack Exchange


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts