.. _Huffman: .. raw:: html .. |--| unicode:: U+2013 .. en dash .. |---| unicode:: U+2014 .. em dash, trimming surrounding whitespace :trim: .. This file is part of the OpenDSA eTextbook project. See .. http://algoviz.org/OpenDSA for more details. .. Copyright (c) 2012-2016 by the OpenDSA Project Contributors, and .. distributed under an MIT open source license. .. avmetadata:: :author: Cliff Shaffer :requires: BST :satisfies: Huffman coding tree :topic: BinTreeImp .. odsalink:: DataStructures/huffman.css .. odsalink:: AV/Binary/huffmanCON.css .. index:: ! Huffman Huffman Coding Trees ==================== Huffman Coding Trees -------------------- One can often gain an improvement in space requirements in exchange for a penalty in running time. There are many situations where this is a desirable tradeoff. A typical example is storing files on disk. If the files are not actively used, the owner might wish to compress them to save space. Later, they can be uncompressed for use, which costs some time, but only once. We often represent a set of items in a computer program by assigning a unique code to each item. For example, the standard :term:`ASCII coding ` scheme assigns a unique eight-bit value to each character. It takes a certain minimum number of bits to provide enough unique codes so that we have a different one for each character. For example, it takes :math:`\left\lceil log\ 128\right\rceil` or seven bits to provide the 128 unique codes needed to represent the 128 symbols of the ASCII character set. [#]_ The requirement for :math:`\left \lceil log\ n \right\rceil` bits to represent :math:`n` unique code values assumes that all codes will be the same length, as are ASCII codes. These are called :term:`fixed-length codes `. If all characters were used equally often, then a fixed-length coding scheme is the most space efficient method. However, you are probably aware that not all characters are used equally often in many applications. For example, the various letters in an English language document have greatly different frequencies of use. Table :num:`#Freq` shows the relative frequencies of the letters of the alphabet. From this table we can see that the letter 'E' appears about 60 times more often than the letter 'Z'. In normal ASCII, the words "DEED" and "MUCK" require the same amount of space (four bytes). It would seem that words such as "DEED", which are composed of relatively common letters, should be storable in less space than words such as "MUCK", which are composed of relatively uncommon letters. .. _Freq: .. topic:: Table Relative frequencies for the 26 letters of the alphabet as they appear in a selected set of English documents. "Frequency" represents the expected frequency of occurrence per 1000 letters, ignoring case. .. math:: \begin{array}{c|c|c|c} \textbf{Letter}&\textbf{Frequency}&\textbf{Letter}&\textbf{Frequency}\\ \textrm A & 77 & N & 67\\ \textrm B & 17 & O & 67\\ \textrm C & 32 & P & 20\\ \textrm D & 42 & Q & 5\\ \textrm E &120 & R & 59\\ \textrm F & 24 & S & 67\\ \textrm G & 17 & T & 85\\ \textrm H & 50 & U & 37\\ \textrm I & 76 & V & 12\\ \textrm J & 4 & W & 22\\ \textrm K & 7 & X & 4\\ \textrm L & 42 & Y & 22\\ \textrm M & 24 & Z & 2\\ \end{array} If some characters are used more frequently than others, is it possible to take advantage of this fact and somehow assign them shorter codes? The price could be that other characters require longer codes, but this might be worthwhile if such characters appear rarely enough. This concept is at the heart of file compression techniques in common use today. The next section presents one such approach to assigning :term:`variable-length codes `, called :term:`Huffman coding `. While it is not commonly used in its simplest form for file compression (there are better methods), Huffman coding gives the flavor of such coding schemes. One motivation for studying Huffman coding is because it provides our first opportunity to see a type of tree structure referred to as a :term:`search trie`. .. [#] To keep things simple, these examples for building Huffman trees uses a :term:`sorted list` to keep the partial Huffman trees ordered by frequency. But a real implementation would use a :term:`heap` to implement a :term:`priority queue` keyed by the frequencies. Building Huffman Coding Trees ----------------------------- Huffman coding assigns codes to characters such that the length of the code depends on the relative frequency or :term:`weight` of the corresponding character. Thus, it is a variable-length code. If the estimated frequencies for letters match the actual frequency found in an encoded message, then the length of that message will typically be less than if a fixed-length code had been used. The Huffman code for each letter is derived from a full binary tree called the :term:`Huffman coding tree`, or simply the :term:`Huffman tree`. Each leaf of the Huffman tree corresponds to a letter, and we define the weight of the leaf node to be the weight (frequency) of its associated letter. The goal is to build a tree with the :term:`minimum external path weight`. Define the :term:`weighted path length` of a leaf to be its weight times its depth. The binary tree with minimum external path weight is the one with the minimum sum of weighted path lengths for the given set of leaves. A letter with high weight should have low depth, so that it will count the least against the total path length. As a result, another letter might be pushed deeper in the tree if it has less weight. The process of building the Huffman tree for :math:`n` letters is quite simple. First, create a collection of :math:`n` initial Huffman trees, each of which is a single leaf node containing one of the letters. Put the :math:`n` partial trees onto a priority queue organized by weight (frequency). Next, remove the first two trees (the ones with lowest weight) from the priority queue. Join these two trees together to create a new tree whose root has the two trees as children, and whose weight is the sum of the weights of the two trees. Put this new tree back into the priority queue. This process is repeated until all of the partial Huffman trees have been combined into one. .. _FreqExamp: .. topic:: Table The relative frequencies for eight selected letters. .. math:: \begin{array}{|c|cccccccc|} \hline \textrm Letter & C & D & E & K & L & M & U & Z\\ \textrm Frequency & 32 & 42 & 120 & 7 & 42 & 24 & 37 & 2\\ \hline \end{array} The following slideshow illustrates the Huffman tree construction process for the eight letters of Table :num:`Table #FreqExamp`. [#]_ .. inlineav:: huffmanBuildCON ss :long_name: Huffman Coding Tree Slideshow: Build :points: 0.0 :required: False :threshold: 1.0 :output: show Here is the implementation for Huffman tree nodes. .. codeinclude:: Binary/Huffman :tag: HuffmanNode This implementation is similar to a typical :ref:`class hierarchy ` for implementing full binary trees. There is an abstract :term:`base class`, named ``HuffNode``, and two :term:`subclasses `, named ``LeafNode`` and ``IntlNode``. This implementation reflects the fact that leaf and internal nodes contain distinctly different information. Here is the implementation for the Huffman Tree class. .. codeinclude:: Binary/Huffman :tag: HuffmanTree Here is the implementation for the tree-building process. .. codeinclude:: Binary/Huffman :tag: HuffmanTreeBuild ``buildHuff`` takes as input ``fl``, the min-heap of partial Huffman trees, which initially are single leaf nodes as shown in Step 1 of the slideshow above. The body of function ``buildTree`` consists mainly of a ``for`` loop. On each iteration of the ``for`` loop, the first two partial trees are taken off the heap and placed in variables ``temp1`` and ``temp2``. A tree is created (``temp3``) such that the left and right subtrees are ``temp1`` and ``temp2``, respectively. Finally, ``temp3`` is returned to ``fl``. .. [#] ASCII coding actually uses 8 bits per character. Seven bits are used to represent the 128 codes of the ASCII character set. The eigth bit as a :term:`parity` bit, that can be used to check if there is a transmission error for the character. Assigning and Using Huffman Codes (1) ------------------------------------- Once the Huffman tree has been constructed, it is an easy matter to assign codes to individual letters. Beginning at the root, we assign either a '0' or a '1' to each edge in the tree. '0' is assigned to edges connecting a node with its left child, and '1' to edges connecting a node with its right child. This process is illustrated by the following slideshow. .. inlineav:: huffmanLabelCON ss :long_name: Huffman Coding Tree Slideshow: Label Edges :points: 0.0 :required: False :threshold: 1.0 :output: show Assigning and Using Huffman Codes (2) ------------------------------------- Now that we see how the edges associate with bits in the code, it is a simple matter to generate the codes for each letter (since each letter corresponds to a leaf node in the tree). .. inlineav:: huffmanCodesCON ss :long_name: Huffman Coding Tree Slideshow: Setting Codes :points: 0.0 :required: False :threshold: 1.0 :output: show Now that we have a code for each letter, encoding a text message is done by replacing each letter of the message with its binary code. A lookup table can be used for this purpose. Decoding -------- A set of codes is said to meet the :term:`prefix property` if no code in the set is the prefix of another. The prefix property guarantees that there will be no ambiguity in how a bit string is decoded. In other words, once we reach the last bit of a code during the decoding process, we know which letter it is the code for. Huffman codes certainly have the prefix property because any prefix for a code would correspond to an internal node, while all codes correspond to leaf nodes. When we decode a character using the Huffman coding tree, we follow a path through the tree dictated by the bits in the code string. Each '0' bit indicates a left branch while each '1' bit indicates a right branch. The following slideshow shows an example for how to decode a message by traversing the tree appropriately. .. inlineav:: huffmanDecodeCON ss :long_name: Huffman Coding Tree Slideshow: Decoding :points: 0.0 :required: False :threshold: 1.0 :output: show Decoding Practice ----------------- .. avembed:: Exercises/Binary/HuffmanDecodePRO.html ka :module: Huffman :long_name: Huffman Decoding Proficiency Exercise :points: 1.0 :required: True :threshold: 5.0 :exer_opts: JOP-lang=en&JXOP-code=java How efficient is Huffman coding? -------------------------------- In theory, Huffman coding is an optimal coding method whenever the true frequencies are known, and the frequency of a letter is independent of the context of that letter in the message. In practice, the frequencies of letters in an English text document do change depending on context. For example, while E is the most commonly used letter of the alphabet in English documents, T is more common as the first letter of a word. This is why most commercial compression utilities do not use Huffman coding as their primary coding method, but instead use techniques that take advantage of the context for the letters. Another factor that affects the compression efficiency of Huffman coding is the relative frequencies of the letters. Some frequency patterns will save no space as compared to fixed-length codes; others can result in great compression. In general, Huffman coding does better when there is large variation in the frequencies of letters. .. topic:: Example In the particular case of the frequencies shown in Table :num:`Table #Freq`, we can determine the expected savings from Huffman coding if the actual frequencies of a coded message match the expected frequencies. Because the sum of the frequencies is 306 and E has frequency 120, we expect it to appear 120 times in a message containing 306 letters. An actual message might or might not meet this expectation. Letters D, L, and U have code lengths of three, and together are expected to appear 121 times in 306 letters. Letter C has a code length of four, and is expected to appear 32 times in 306 letters. Letter M has a code length of five, and is expected to appear 24 times in 306 letters. Finally, letters K and Z have code lengths of six, and together are expected to appear only 9 times in 306 letters. The average expected cost per character is simply the sum of the cost for each character (:math:`c_i`) times the probability of its occurring (:math:`p_i`), or :math:`c_1 p_1 + c_2 p_2 + \cdots + c_n p_n.` This can be reorganized as :math:`\frac{c_1 f_1 + c_2 f_2 + \cdots + c_n f_n}{f_T}`, where :math:`f_i` is the (relative) frequency of letter :math:`i` and :math:`f_T` is the total for all letter frequencies. For this set of frequencies, the expected cost per letter is :math:`[(1 \times 120) + (3 \times 121) + (4 \times 32) + (5 \times 24) + (6 \times 9)]/306 = 785/306 \approx 2.57.` A fixed-length code for these eight characters would require :math:`\log 8 = 3` bits per letter as opposed to about 2.57 bits per letter for Huffman coding. Thus, Huffman coding is expected to save about 14% for this set of letters. Huffman coding for all ASCII symbols should do better than this example. The letters of Table :num:`Table #Freq` are atypical in that there are too many common letters compared to the number of rare letters. Huffman coding for all 26 letters would yield an expected cost of 4.29 bits per letter. The equivalent fixed-length code would require about five bits. This is somewhat unfair to fixed-length coding because there is actually room for 32 codes in five bits, but only 26 letters. More generally, Huffman coding of a typical text file will save around 40% over ASCII coding if we charge ASCII coding at eight bits per character. Huffman coding for a binary file (such as a compiled executable) would have a very different set of distribution frequencies and so would have a different space savings. Most commercial compression programs use two or three coding schemes to adjust to different types of files. In decoding example, "DEED" was coded in 8 bits, a saving of 33% over the twelve bits required from a fixed-length coding. However, "MUCK" would require 18 bits, more space than required by the corresponding fixed-length coding. The problem is that "MUCK" is composed of letters that are not expected to occur often. If the message does not match the expected frequencies of the letters, than the length of the encoding will not be as expected either. .. odsascript:: DataStructures/huffman.js .. odsascript:: AV/Binary/huffmanBuildCON.js .. odsascript:: AV/Binary/huffmanLabelCON.js .. odsascript:: AV/Binary/huffmanCodesCON.js .. odsascript:: AV/Binary/huffmanDecodeCON.js