Using a rolling hash to break up binary files - CodeProject



Then, I read about using rolling hashes to find natural breakpoints in a file based on the shape of the data.  The idea is pretty cool…I’ll find patterns in the data that I can use to break the file up!  Then, I’ll have a list of smallish chunks, each of which is comparable and independently storable.  We can vary between long lists of small chunks or short lists of big chunks.

A rolling hash is one where you pick a window size…let’s say 64 bytes…and then hash every 64-byte-long segment in the file.  I don't mean a hash for 0-63, 64-127, 128-191, etc…but for 0-63, 1-64, 2-65, etc.  I know that sounds like something that’s going to burn a hole in your processor, but here’s the surprise…it goes pretty quick with the right hash function.  More on that later.

Skipping the apparent implausibility for a second, what does this hash get us?  Well…if it’s a reasonably good hash function, we’ll get well distributed hash values for each chunk of data we look at.  Assuming we do get random-looking hash values, we can take a regular subset of those hashes and arbitrarily declare that these hashes “qualify” as sentinel hashes.


Read full article from Using a rolling hash to break up binary files - CodeProject


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts