Top K or K-most frequent words in a document - Algorithms and Problem SolvingAlgorithms and Problem Solving



Top K or K-most frequent words in a document - Algorithms and Problem SolvingAlgorithms and Problem Solving

Top K or K-most frequent words in a document

Given a document (or stream) of words. Find the top k most frequent words in the document (or stream).

For example, if stream = "aa bb cc bb bb cc dd dd ee ff ee dd aa ee". That is, {"dd"=3, "ee"=3, "ff"=1, "aa"=2, "bb"=3, "cc"=2}. Then top 3 most frequent words are: {"dd", "ee", "bb"}.

One quick solution would be to create a pair object with the word and its frequency and then sort the pair array with respect to the frequency of the pair. Now, take the first k pairs from the sorted array of pairs. This is O(nlgn) solution.

O(nlgk) solution with O(n) space
But we can improve this solution. Note that we are only concern about the top k elements. Sorting the array means we are sorting all the n elements which is unnecessary as we are only concerned for first k. Any idea popping in? Yeah, I am sure you have the same feeling that we could use a Min Heap of size k to keep top k most frequent words. That's right. We also need to use a hashmap to keep frequency of each word.

  1. Calculated frequency of all the words in a hashmap from the word to its frequency.
  2. Start adding pair object of word and its frequency into a min heap where we use the frequency as the key for the min heap.
  3. If the heap is full then remove the minimum element (top) form the heap and add add the new word-frequency pair only if the frequency of this word has frequency greater than the top word in the heap.
  4. Once we scanned all the words in the map and the heap is properly updated then the elements contained in the min heap are the top k most frequents.

Below is a simple implementation of the above idea using pJava Priority Queue (Or we can use a generic min heap I have implemented in a previous post here). This solution is O(nlgk) time and O(n) space.


Read full article from Top K or K-most frequent words in a document - Algorithms and Problem SolvingAlgorithms and Problem Solving


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts