You have a billion urls, where each is a huge page. How do you detect the duplicate documents? | Hello World



You have a billion urls, where each is a huge page. How do you detect the duplicate documents? | Hello World

Here are two observations we can make
1. every page is huge, so bringing them to the memory will not be feasible. So we need to calculate a hashvalue for the content for each page.
2. There are billions of urls, so comparing page to each other is O(n^2), which is too slow. So we create a hashtable and check if the content of the value is in the already visited hashtable. if so, we can assume that the two pages have the same content and can through out one of it.

So, we will have a list of urls that contains the unique pages. But wait, each hashvalue is 4 bytes and there are a billion urls, so there needs to be 4G memory, plus another few G for the urls, so it is very likely that all of the data can not fit in one machine.

What do we do? Here are a few options.

»»We could split this up into files. We'll have to deal with the file loading / unloading—ugh.

»»We could hash to disk. Size wouldn't be a problem, but access time might. A hash table on disk would require a random access read for each check and write to store a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator algorithms could elimate random bouncing from track to track.

»»Or, we could split this up across machines, and deal with network latency. Let's go with this solution, and assume we have n machines.
First, we hash the document to get a hash value v.
v%n tells us which machine this document's hash table can be found on.
v / n is the value in the hash table that is located on its machine.


Read full article from You have a billion urls, where each is a huge page. How do you detect the duplicate documents? | Hello World


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts