All About Programming: Scalability and Memory Limits 6 | 吉祥三宝的各种practice记录

Scalability and Memory Limits 6 | 吉祥三宝的各种practice记录

You have 10 billion URLs. How do you detect the duplicate documents? In this case, assume that "duplicate" means that the URLs are identical.

Analysis:

First we compute how many memory for storing 10 billion urls:

Assume that a url is 100 chars long. Each URL takes 100*4 bytes

10 billion URLs takes: 10*2^30*100*4 bytes 4*2^40 byes (approximate), which is 4 TB.

Assume that we only have 4GB of memory, we can divide the URLs into 4000 files, to decide which file to store the URL, we use the following hash function

fileId = hash(u) % 4000, which u is the URL string, and hash() is the summation of ASCII code of all characters in u.

By this function, we make sure the the URLs with the same hash value will be inside the same file, and the average size of file is 1GB (4TB/4000).

Then we can apply the hash map to detect duplicate URLs file by file.

If we use 4000 machines instead of 4000 files, the pro is that we can process the file in parallel, but the con is that we can not make sure every machine is failure-free.

About these ads

Read full article from Scalability and Memory Limits 6 | 吉祥三宝的各种practice记录

Scalability and Memory Limits 6 | 吉祥三宝的各种practice记录

No comments:

Post a Comment

Labels

Popular Posts