JV Compression Tool
Try compressing text using my Huffman compression library.
Built from scratch in Python using a custom Huffman tree, heap, and header format.
Compress
Max ~250 KB
Results
Decompress
Paste base64
Output
Why the samples behave like they do
Tiny input → often gets bigger
Huffman compression always includes a header with padding information and the data needed to rebuild the tree during decompression. For very small inputs, this fixed overhead dominates, so the compressed output can be larger than the original.
Takeaway: Compression has a startup cost.
Highly repetitive → big wins
When one or a few symbols dominate the input, Huffman assigns them very short codes. This lowers the average bits per symbol and results in strong compression.
Takeaway: Skewed frequency distributions work best.
Uniform / random → little or no win
If symbols appear with roughly equal frequency, Huffman codes end up close to fixed-width length. In that case there is little redundancy to remove, and the header overhead can make the result worse.
Takeaway: High entropy resists compression.
Large & skewed → best case
With enough input data, the header cost becomes negligible. Strong frequency imbalance then drives excellent compression ratios.
Takeaway: Size helps, but structure matters more.
How this implementation works
These notes describe how this implementation of Huffman coding works in practice.
Frequency counting
The compressor starts by counting how often each byte value appears in the input. These frequencies fully determine the structure of the Huffman tree and the final codes.
Common symbols receive shorter codes, while rare symbols receive longer ones.
Nodes & tree construction
Each symbol begins as a leaf node containing its frequency. Nodes are merged using a min-heap, repeatedly combining the two lowest-weight nodes into a new internal node.
This continues until a single Huffman tree remains, defining a prefix-free code for all symbols.
For a visual explanation, see the interactive Huffman tree example on W3Schools.
Header structure & overhead
The compressed output includes a header containing padding length information and the data required to reconstruct the Huffman tree during decompression. This makes decompression deterministic and self-contained, but it also introduces fixed overhead — especially noticeable for small inputs.