Counting Unique Mobile App Users with HyperLogLog | Mawazo
Cardinality is the number of unique items in a list. In a naive implementation, cardinality can be estimated using memory proportional to cardinality size.However, when the cardinality is very high (e.g., IP address, phone number), such a naive approach is not pragmatic. Various approximate probabilistic algorithms can be used for cardinality estimation.
HyperLogLog is based on analyzing some bit patterns in the hashed value of an item. We look at the length of the sequence of most significant zero bits. The maximum length among such zero bit sequences from all the hashed values is indicative of the unique item count.
In reality, to improve the quality of the result, multiple independent hash functions are used and the longest sequence zero bits resulting from each hash function is used to produce the final averaged unique count value.
Instead of using multiple hash functions, we will use a technique called stochastic averaging. We set aside a sequence of significant bits of the hash for buckets. From the remaining bits we find the sequence of most significant zero bits. For example, to use 256 buckets we use the most significant 8 bits for buckets and the remaining 24 bits to find the sequence of most significant zero bits. For each bucket, we maintain a count of maximum length of sequence of zero bits.
Read full article from Counting Unique Mobile App Users with HyperLogLog | Mawazo
No comments:
Post a Comment