Find K most frequent word in a really huge file - PrismoSkills



Find K most frequent word in a really huge file - PrismoSkills

Find K most frequent word in a really huge file



Problem: Given a very huge file (one that may not fit into several computer's hard disk), find the top K most frequent words in that file.

Solution (Case 1): If the file was a regular one, following steps would have given the top K most frequent words:
/***************************************************************************  1) Load the given file into RAM.  2) Split the big string into words using white-space and punctuation-marks as separator.  3) Put each word into a hash-map where word is key and frequency is value.  4) Find the top K most frequent words by using a min-heap, sorted on frequency.     (Note: A min-heap is required here, not max-heap)     Heap usage will give N*logK performance.  ***************************************************************************/

File not fitting in RAM

If the file is extremely large such that it cannot be loaded into RAM of one machine, then multiple threads may be launched to load small parts of file into memory and then apply the above algorithm.
Assumption here is that the entire dictionary of words in the file is small enough to be loaded into RAM.


Dictionary not fitting in RAM

Now if the dictionary is also so huge that it cannot fit one machine's RAM, then this problem cannot be solved with one machine.

Multiple machines are required which will count the word frequency after loading pieces of file.
But if the file is equally divided among different machines, then results of individual machines will need to be merged.
This is so because same word may be present in different pieces of the file.

Plus, the full dictionary would be required on each machine which is again not possible due to huge size of dictionary.


This can be solved by using distributed hashing.
Use word.hashCode() % numProcessingMachines to get the machine number which will process that word.
This approach will evenly distribute all the words on all the processing machines and can even be done in parallel on chunks of file.
Each machine will roughly receive an equal number of words and will also contain only a chunk of the dictionary.

After this step, we need to find K most frequent words whose frequencies are already present on the processing machines.
Take 2 machines at a time and find their K most frequent words by using a max-heap.
numProcessingMachines/2 tasks can do this in parallel.
Then again merge these K heaps, 2 at a time until only one heap is left.

You have the K most frequent words of this extremely huge file and a comparably huge dictionary.

Read full article from Find K most frequent word in a really huge file - PrismoSkills


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts