Run-length encoding (part 3)



Run-length encoding (part 3)

In Run-length encoding (part 1), I presented the various run-length encoding formats. In part 2, I discussed the coding of the counters. In this third part, I want to discuss the ordering of the elements.

Indeed, the compression efficiency of run-length encoding depends on the ordering. For example, the sequence aaabbb is far more compressible than the sequence ababab.

Reordering sequences

Text, images, and sound are stored on disk as strings. Intuitively, we may think that strings cannot be reordered without losing information. But this intuition is wrong!

You can reorder strings without losing any information, as long as the reordering is invertible. The Burrows–Wheeler transform—invented in 1994—is an invertible transform which tends to generate streams of repeated characters. It is used by the open source bzip2 software. This invertible reordering trick works so well that bzip2 is "within 10% to 15% of the best available techniques, whilst being around twice as fast at compression and six times faster at decompression."

Ordering sets

Sometimes we want to compress sets of elements. For example, the ordering of the rows in a relational database is semantically irrelevant. Similarly, in a phone directory, we are not concerned about the order of the entries. Of course, we are trained to expect entries to be sorted lexicographically starting from the last names:

Last name First name
Abeck John
Becket Patricia
Smith John
Tucker Patricia

Such sequences of tuples can be compressed column-wise: compress each column with run-length encoding. Reordering the rows so as to maximize compression is NP-hard by reduction from the Hamiltonian path problem. Moreover, it can be reduced to an instance of the traveling salesman problem.

Fortunately, a simple and efficient heuristic is available. Reorder the columns in increasing cardinality: put the columns having the fewest number of distinct values first. Next, sort the table lexicographically.

Applying this heuristic, our previous table becomes:

First name Last name
John Abeck
John Smith
Patricia Becket
Patricia Tucker

Read full article from Run-length encoding (part 3)


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts