How to detect duplicate documents in billions of urls | Runhe Tian Coding Practice
Based on the above two observations we can derive an algorithm which is as follows:
- Iterate through the pages and compute the hash table of each one.
- Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it is not, then keep the url and insert it in into the hash table.
This algorithm will provide us a list of unique urls. But wait, can this fit on one computer?
- How much space does each page take up in the hash table?
- Each page hashes to a four byte value.
- Each url is an average of 30 characters, so that's another 30 bytes at least.
- Each url takes up roughly 34 bytes.
- 34 bytes * 1 billion = 31.6 gigabytes. We're going to have trouble holding that all in memory!
What do we do?
- We could split this up into files. We'll have to deal with the file loading / unloading—ugh.
- We could hash to disk. Size wouldn't be a problem, but access time might. A hash table on disk would require a random access read for each check and write to store a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator algorithms could elimate random bouncing from track to track.
- Or, we could split this up across machines, and deal with network latency. Let's go with this solution, and assume we have n machines.
- First, we hash the document to get a hash value v
- tells us which machine this document's hash table can be found on.
- is the value in the hash table that is located on its machine.
Read full article from How to detect duplicate documents in billions of urls | Runhe Tian Coding Practice
No comments:
Post a Comment