Then, I read about using rolling hashes to find natural breakpoints in a file based on the shape of the data. The idea is pretty cool…I’ll find patterns in the data that I can use to break the file up! Then, I’ll have a list of smallish chunks, each of which is comparable and independently storable. We can vary between long lists of small chunks or short lists of big chunks.
A rolling hash is one where you pick a window size…let’s say 64 bytes…and then hash every 64-byte-long segment in the file. I don't mean a hash for 0-63, 64-127, 128-191, etc…but for 0-63, 1-64, 2-65, etc. I know that sounds like something that’s going to burn a hole in your processor, but here’s the surprise…it goes pretty quick with the right hash function. More on that later.
Skipping the apparent implausibility for a second, what does this hash get us? Well…if it’s a reasonably good hash function, we’ll get well distributed hash values for each chunk of data we look at. Assuming we do get random-looking hash values, we can take a regular subset of those hashes and arbitrarily declare that these hashes “qualify” as sentinel hashes.
Read full article from Using a rolling hash to break up binary files - CodeProject
No comments:
Post a Comment