Counting Unique Mobile App Users with HyperLogLog | Mawazo



Counting Unique Mobile App Users with HyperLogLog | Mawazo

Cardinality is the number of unique items in a list. In a naive implementation, cardinality can be estimated using memory proportional to cardinality size.However, when the cardinality is very high (e.g., IP address, phone number), such a naive approach is not pragmatic. Various approximate probabilistic algorithms can be used for cardinality estimation.

HyperLogLog is based on analyzing some bit patterns in the hashed value of an item. We look at the length of the sequence of most significant zero bits. The maximum length among such zero bit sequences from all the hashed values is indicative of the unique item count.

In reality, to improve the quality of the result, multiple independent hash functions are used and the longest sequence zero bits resulting from each hash function is used to produce the final averaged unique count value.

Instead of using multiple hash functions, we will use a technique called stochastic averaging. We set aside a sequence of significant bits of the hash for buckets. From the remaining bits we find the sequence of most significant zero bits. For example, to use 256 buckets we use the most significant 8 bits for buckets and the remaining 24 bits to find the sequence of most significant zero bits. For each bucket, we maintain a count of maximum length of sequence of zero bits.


Read full article from Counting Unique Mobile App Users with HyperLogLog | Mawazo


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts