[Design] Big Data - Fuzzy Search url (Bloom Filter) - Shuatiblog.com



[Design] Big Data - Fuzzy Search url (Bloom Filter) - Shuatiblog.com

给你A,B两个文件,各存放50亿条URL,每条URL占用64字节,内存限制是4G,让你找出A,B文件共同的URL。如果是三个乃至n个文件呢?

Bloom Filter

自从Burton Bloom在70年代提出Bloom Filter之后,Bloom Filter就被广泛用于拼写检查,数据库系统中。。。可以实现数据字典,进行数据的判重,或者集合求交集

基本原理及要点

An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.

很明显这个过程并不保证查找的结果是100%正确的。同时也不支持删除一个已经插入的关键字,因为该关键字对应的位会牵动到其他的关键字。

所以一个简单的改进就是 counting Bloom filter,用一个counter数组代替位数组,就可以支持删除了。

Error rate

m: length of BF array (in bits) n: number of input elements k: number of hash functions 

A Bloom filter with 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element (means m = 9.6 x n).

Usage

Bloom Filter可以用来实现数据字典,进行数据的判重,或者集合求交集.

Solution

Of course we can always use 【分治+trie树/hash+小顶堆】 standard solution, but for Fuzzy search, BF is the best.

4G = 232 大概是40亿 x 8大概是340亿bit,n = 50亿,如果按出错率0.01算需要的大概是480亿个bit。现在可用的是340亿,相差并不多,这样可能会使出错率上升些。


Read full article from [Design] Big Data - Fuzzy Search url (Bloom Filter) - Shuatiblog.com


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts