The Compression Pipeline
These are my notes on implementing each stage of the compression pipeline.
BWT - Burrows-Wheeler Transform
This is the clever part. BWT rearranges the input so that similar characters cluster together. The Wikipedia explanation made sense in theory but actually implementing it took a few tries.
The basic idea:
- Generate all rotations of the string
- Sort them
- Take the last column
For "banana" you get "nnbaaa" - notice how the as and ns grouped together. That’s the whole point.
The SA-IS rabbit hole
The naive BWT implementation sorted all rotations with qsort. Works fine, but O(n²) is brutal for anything over a few KB.
I ended up implementing SA-IS (Suffix Array by Induced Sorting). It’s O(n) but the algorithm is… not simple. The key insight is that you can categorize suffixes as L-type or S-type based on their neighbors, then sort the “LMS” (leftmost S-type) suffixes first and induce the rest.
Took me a while to get the bucket indexing right. Off-by-one errors everywhere.
The implementation lives in sais.c. It’s about 500 lines and I had to debug it with print statements for days.
MTF - Move-to-Front
This one’s simple. Keep a list of all 256 byte values. For each input byte, output its position in the list, then move that byte to the front.
void mtf_encode(const uint8_t *input, uint8_t *output, size_t size) {
uint8_t table[256];
for (int i = 0; i < 256; i++) table[i] = i;
for (size_t i = 0; i < size; i++) {
uint8_t c = input[i];
int pos = 0;
while (table[pos] != c) pos++;
output[i] = pos;
memmove(table + 1, table, pos);
table[0] = c;
}
}
After BWT, similar characters are clustered. MTF turns those clusters into runs of zeros (position 0 = “same as last time”). That’s what RLE will compress.
RLE - Run-Length Encoding
First version was buggy. The issue was handling the escape byte - if your data contains 0xFE, you need to escape it too.
Current approach:
0xFEfollowed by value and count means “repeat this byte N times”0xFE 0xFEmeans a literal0xFE
Had some off-by-one issues with the count encoding too. 0 means 3 repeats (the minimum worth encoding), 255 means 258 repeats.
Huffman coding
Standard canonical Huffman. Build a tree from frequencies, extract code lengths, generate codes from lengths.
The trick is limiting code length to 32 bits. Deep trees happen with skewed frequencies. Had to implement code length limiting which required reading a paper on optimal binary trees with length restrictions.
The decoder is table-driven. Precompute a lookup table indexed by the first N bits, then handle longer codes with a slower path.
Putting it together
The pipeline in packrat.c:
bwt_index = bwt_encode(input, bwt_out, size);
mtf_encode(bwt_out, mtf_out, size);
rle_encode(mtf_out, size, rle_out, &rle_size);
padding = huffman_encode(rle_out, rle_size, codes, huff_out, &huff_size);
Decompression is the reverse. Each step is invertible.
Things I learned
- SA-IS is beautiful but hard to debug
- Off-by-one errors in compression are catastrophic - data comes out as garbage
- The BWT is reversible even though it looks like it shouldn’t be (you only need the last column and the primary index)
- Huffman code length limiting is its own research topic