What is Huffman Coding?
Huffman Coding is a lossless data compression algorithm. The idea is to assign variable-length codes to input characters, with lengths based on the frequencies of corresponding characters.
The most frequent character gets the smallest code and the least frequent character gets the largest code. This variable-length coding is created using a Huffman Tree, which is built using a greedy approach.
Visual Example
A: 1, B: 00, C: 01. The most frequent character 'A' has the shortest code.
Algorithm Strategy
1. Calculate Frequencies
Count how often each character appears in the data. Create a leaf node for each character and build a min-heap of all leaf nodes.
2. Build Tree
Extract two nodes with the minimum frequency from the min-heap. Create a new internal node with a frequency equal to the sum of the two nodes' frequencies. Insert this new node back into the min-heap. Repeat until one node remains.
3. Generate Codes
Traverse the constructed tree from root to leaves. Assign '0' for a left branch and '1' for a right branch. The sequence of bits from root to a leaf node is the code for that character.
Complexity Analysis
Where N is the number of unique characters. Extracting minimum from the priority queue takes O(log N) and it is done 2*(N-1) times.
Time Complexity: O(N log N)
Key Takeaways
- Huffman Coding generates prefix codes, meaning no code is a prefix of another code. This ensures unambiguous decoding.
- It is optimally efficient for character-by-character coding when character frequencies are known.