Thanks for pointing this out. I was thinking about using 128-bit CityHash to generate id (hash) for billions of documents to stored in db. The intent is content deduplication with very low accidental collision. SipHash implementation in Guava produces 64-bit output. So, it has a relatively high probability of collision for my corpus size. Is there a 128-bit implementation of SipHash? Otherwise, 128-bit murmur3 seems to be a better choice for now.
How bad are collisions in your case? Note that it's pretty easy to generate any number of collisions for 128 bit Murmur3. OTOH accidental collisions should not happen.
If speed is not very important I'd go for SHA-1 as GIT does. I guess that storing the hash in the DB takes much more time than the hashing, so there's not much point in optimizing hard.
If speed is not very important I'd go for SHA-1 as GIT does. I guess that storing the hash in the DB takes much more time than the hashing, so there's not much point in optimizing hard.
Read full article from CityHash in Guava - Google Groups
No comments:
Post a Comment