This section provides practice in the use of list structure and data
abstraction to manipulate sets and trees. The application is to
methods for representing data as sequences of ones and zeros (bits).
For example, the
ASCII standard code used to represent text in
computers encodes each
character as a sequence of seven bits. Using
seven bits allows us to distinguish $2^7$, or
128, possible different characters. In general, if we want to distinguish
$n$ different symbols, we will need to use
$\log_2 n$ bits per symbol. If all our messages
are made up of the eight symbols A, B, C, D, E, F, G, and H, we can choose
a code with three bits per character, for example
このセクションでは、リスト構造とデータ抽象を使って集合とツリーを操作する練習をします。応用は、データを0と1のシーケンス(ビット)として表現する方法です。例えば、コンピュータでテキストを表現するために使われる
ASCII 標準コードは、各
文字を7ビットのシーケンスとして符号化します。7ビットを使うと、$2^7$ すなわち 128 の異なる文字を区別できます。一般に、$n$ 個の異なるシンボルを区別したい場合、シンボルごとに $\log_2 n$ ビットが必要になります。メッセージがすべて A, B, C, D, E, F, G, H の8つのシンボルで構成されている場合、例えば次のような1文字あたり3ビットのコードを選ぶことができます。
Codes such as ASCII and the A-through-H code above are known as
fixed-length codes, because they represent each symbol in the
message with the same number of bits. It is sometimes advantageous to use
variable-length codes, in which different symbols may be
represented by different numbers of bits. For example,
Morse code does not use the same number of dots and dashes for each letter
of the alphabet. In particular, E, the most frequent letter, is represented
by a single dot. In general, if our messages are such that some symbols
appear very frequently and some very rarely, we can encode data more
efficiently (i.e., using fewer bits per message) if we assign shorter
codes to the frequent symbols. Consider the following alternative code for
the letters A through H:
ASCII や上記の A から H のコードのようなコードは、メッセージの各シンボルを同じ数のビットで表現するため、
固定長コードとして知られています。異なるシンボルを異なるビット数で表現する
可変長コードを使うほうが有利な場合があります。例えば、
モールス符号はアルファベットの各文字に同じ数の点と線を使いません。特に、最も頻繁な文字 E は1つの点で表現されます。一般に、メッセージの中で非常に頻繁に現れるシンボルと非常にまれなシンボルがある場合、頻繁なシンボルに短いコードを割り当てれば、より効率的に(つまり、メッセージあたりより少ないビットで)データを符号化できます。文字 A から H に対する次の代替コードを考えてみましょう。
A
0
C
1010
E
1100
G
1110
B
100
D
1011
F
1101
H
1111
With this code, the same message as above is encoded as the string
100010100101101100011010100100000111001111
This string contains 42 bits, so it saves more than 20% in space in
comparison with the fixed-length code shown above.
One of the difficulties of using a variable-length code is knowing
when you have reached the end of a symbol in reading a sequence of
zeros and ones. Morse code solves this problem by using a special
separator code (in this case, a pause) after the sequence of
dots and dashes for each letter. Another solution is to design the
code in such a way that no complete code for any symbol is the
beginning (or prefix) of the code for another symbol. Such a
code is called a
prefix code. In the example above, A is encoded by 0 and B is
encoded by 100, so no other symbol can have a code that begins with 0 or
with 100.
In general, we can attain significant savings if we use
variable-length prefix codes that take advantage of the relative
frequencies of the symbols in the messages to be encoded. One
particular scheme for doing this is called the Huffman encoding
method, after its discoverer,
David Huffman. A Huffman code can be represented as a
binary tree whose leaves are the symbols that are encoded. At each
non-leaf node of the tree there is a set containing all the symbols in the
leaves that lie below the node. In addition, each symbol at a leaf is
assigned a weight (which is its relative frequency), and each non-leaf
node contains a weight that is the sum of all the weights of the leaves
lying below it. The weights are not used in the encoding or the decoding
process. We will see below how they are used to help construct the tree.
一般に、符号化するメッセージ中のシンボルの相対頻度を利用した可変長プレフィックスコードを使えば、大幅な節約が達成できます。これを行う特定の方式の一つは、発見者の
David Huffman にちなんでハフマン符号化法と呼ばれています。ハフマンコードは、符号化されるシンボルを葉に持つ
二分木として表現できます。ツリーの各非葉ノードには、そのノードの下にある葉のすべてのシンボルを含む集合があります。さらに、葉の各シンボルには重み(相対頻度)が割り当てられ、各非葉ノードにはその下にあるすべての葉の重みの合計である重みが含まれます。重みは符号化や復号化のプロセスでは使われません。ツリーの構成を助けるためにどのように使われるかは、以下で見ていきます。
Figure 2.27 shows the Huffman tree for the
A-through-H code given above. The weights at the leaves indicate that the
tree was designed for messages in which A appears with relative frequency
8, B with relative frequency 3, and the other letters each with relative
frequency 1.
図 2.27は、上で示した A から H のコードに対するハフマン木を示しています。葉の重みは、このツリーが A の相対頻度 8、B の相対頻度 3、その他の文字はそれぞれ相対頻度 1 のメッセージ用に設計されたことを示しています。
Given a Huffman tree, we can find the encoding of any symbol by
starting at the root and moving down until we reach the leaf that
holds the symbol. Each time we move down a left branch we add a 0 to
the code, and each time we move down a right branch we add a 1. (We
decide which branch to follow by testing to see which branch either is
the leaf node for the symbol or contains the symbol in its set.) For
example, starting from the root of the tree in
figure 2.27, we arrive at the leaf for D by
following a right branch, then a left branch, then a right branch, then a
right branch; hence, the code for D is 1011.
To decode a bit sequence using a Huffman tree, we begin at the root
and use the successive zeros and ones of the bit sequence to determine
whether to move down the left or the right branch. Each time we come
to a leaf, we have generated a new symbol in the message, at which
point we start over from the root of the tree to find the next symbol.
For example, suppose we are given the tree above and the sequence
10001010. Starting at the root, we move down the right branch (since
the first bit of the string is 1), then down the left branch (since
the second bit is 0), then down the left branch (since the third bit
is also 0). This brings us to the leaf for B, so the first
symbol of the decoded message is B. Now we start again at the root,
and we make a left move because the next bit in the string is 0.
This brings us to the leaf for A. Then we start again at the root
with the rest of the string 1010, so we move right, left, right, left and
reach C. Thus, the entire message is BAC.
ハフマン木を使ってビット列を復号化するには、根から始めて、ビット列の連続する 0 と 1 を使って左の枝を下るか右の枝を下るかを判断します。葉に到着するたびに、メッセージの新しいシンボルが生成され、その時点でツリーの根から再び始めて次のシンボルを見つけます。例えば、上のツリーとシーケンス 10001010 が与えられたとします。根から始めて、右の枝を下り(文字列の最初のビットが 1 なので)、次に左の枝を下り(2番目のビットが 0 なので)、さらに左の枝を下ります(3番目のビットも 0 なので)。これで B の葉に到着するので、復号化されたメッセージの最初のシンボルは B です。次に根から再び始め、文字列の次のビットが 0 なので左に移動します。これで A の葉に到着します。そして残りの文字列 1010 で根から再び始めるので、右、左、右、左と移動して C に到達します。したがって、メッセージ全体は BAC です。
Generating Huffman trees
ハフマン木の生成
Given an alphabet of symbols and their relative frequencies,
how do we construct the best code? (In other words, which
tree will encode messages with the fewest bits?) Huffman gave an algorithm
for doing this and showed that the resulting code is indeed the best
variable-length code for messages where the relative frequency of the
symbols matches the frequencies with which the code was constructed.
We will not prove this optimality of Huffman codes here, but we will
show how Huffman trees are constructed.
The algorithm for generating a Huffman tree is very simple. The idea
is to arrange the tree so that the symbols with the lowest frequency
appear farthest away from the root. Begin with the set of leaf nodes,
containing symbols and their frequencies, as determined by the initial data
from which the code is to be constructed. Now find two leaves with
the lowest weights and merge them to produce a node that has these
two nodes as its left and right branches. The weight of the new node
is the sum of the two weights. Remove the two leaves from the
original set and replace them by this new node. Now continue this
process. At each step, merge two nodes with the smallest weights,
removing them from the set and replacing them with a node that has
these two as its left and right branches. The process stops when
there is only one node left, which is the root of the entire tree.
Here is how the Huffman tree of figure 2.27 was
generated:
$\{$(A 8) (B 3) ($\{$C D$\}$ 2) ($\{$E F G H$\}$ 4)$\}$
Merge
$\{$(A 8) ($\{$B C D$\}$ 5) ($\{$E F G H$\}$ 4)$\}$
Merge
$\{$(A 8) ($\{$B C D E F G H$\}$ 9)$\}$
Final merge
$\{$($\{$A B C D E F G H$\}$ 17)$\}$
The algorithm does not always specify a unique tree, because there may
not be unique smallest-weight nodes at each step. Also, the choice of
the order in which the two nodes are merged (i.e., which will be the
right branch and which will be the left branch) is arbitrary.
In the exercises below we will work with a system that uses
Huffman trees to encode and decode messages and generates Huffman
trees according to the algorithm outlined above. We will begin by
discussing how trees are represented.
Leaves of the tree are represented by a list consisting of the
string "leaf",
the symbol at the leaf, and the weight:
ツリーの葉は、
文字列 "leaf"、
葉のシンボル、および重みからなるリストで表現されます。
function make_leaf(symbol, weight) {
return list("leaf", symbol, weight);
}
function is_leaf(object) {
return head(object) === "leaf";
}
function symbol_leaf(x) { return head(tail(x)); }
function weight_leaf(x) { return head(tail(tail(x))); }
A general tree will be a list of
a string "code_tree",
a left branch, a right branch, a set
of symbols, and a weight. The set of symbols will be simply a list of
the symbols, rather than some more sophisticated set representation.
When we make a tree by merging two nodes, we obtain the weight of the
tree as the sum of the weights of the nodes, and the set of symbols as
the union of the sets of symbols for the nodes. Since our symbol sets are
represented as lists, we can form the union by using the
append
function
we defined in section 2.2.1:
If we make a tree in this way, we have the following selectors:
このようにツリーを作ると、次のセレクタが使えます。
function left_branch(tree) { return head(tail(tree)); }
function right_branch(tree) { return head(tail(tail(tree))); }
function symbols(tree) {
return is_leaf(tree)
? list(symbol_leaf(tree))
: head(tail(tail(tail(tree))));
}
function weight(tree) {
return is_leaf(tree)
? weight_leaf(tree)
: head(tail(tail(tail(tail(tree)))));
}
The
functions
symbols and
weight must do something slightly different
depending on whether they are called with a leaf or a general tree.
These are simple examples of
generic
functions
(functions
that can handle more than one kind of data), which we will have much more
to say about in sections 2.4
and 2.5.
function decode(bits, tree) {
function decode_1(bits, current_branch) {
if (is_null(bits)) {
return null;
} else {
const next_branch = choose_branch(head(bits),
current_branch);
return is_leaf(next_branch)
? pair(symbol_leaf(next_branch),
decode_1(tail(bits), tree))
: decode_1(tail(bits), next_branch);
}
}
return decode_1(bits, tree);
}
function choose_branch(bit, branch) {
return bit === 0
? left_branch(branch)
: bit === 1
? right_branch(branch)
: error(bit, "bad bit -- choose_branch");
}
The
function
decode_1
takes two arguments: the list of remaining bits and the current position in
the tree. It keeps moving down the tree, choosing a left or
a right branch according to whether the next bit in the list is a zero or a
one. (This is done with the
function
choose_branch.)
When it reaches a leaf, it returns the symbol at that leaf as the next
symbol in the message by
adjoining
it to the result of decoding the rest of the message,
starting at the root of the tree.
Note the error check in the final clause of
choose_branch,
which complains if the
function
finds something other than a zero or a one in the input data.
In our representation of trees, each non-leaf node contains a set of
symbols, which we have represented as a simple list. However, the
tree-generating algorithm discussed above requires that we also work
with sets of leaves and trees, successively merging the two smallest
items. Since we will be required to repeatedly find the smallest item
in a set, it is convenient to use an ordered representation for this
kind of set.
We will represent a set of leaves and trees as a list of elements,
arranged in increasing order of weight.
The following
adjoin_set
function
for constructing sets is similar to the one
described in exercise 2.61; however, items
are compared by their weights, and the element being added to the set is
never already in it.
The function encode_symbol,
which you must write, returns the list of bits that encodes a given symbol
according to
a given tree.
You should design
encode_symbol
so that it signals an error if the symbol is not in the tree at all.
Test your
function
by encoding the result you obtained in
exercise 2.67 with the sample tree and
seeing whether it is the same as the original sample message.
The following
function
takes as its argument a list of symbol-frequency pairs (where no symbol
appears in more than one pair) and generates a Huffman encoding tree
according to the Huffman algorithm.
function generate_huffman_tree(pairs) {
return successive_merge(make_leaf_set(pairs));
}
The function make_leaf_set
that transforms the list of pairs into an ordered set of leaves is
given above. Write the function
successive_merge using
make_code_tree to successively
merge the smallest-weight elements of the set until there is only one
element left, which is the desired Huffman tree.
(This
function
is slightly tricky, but not really complicated. If you find yourself
designing a complex
function,
then you are almost certainly doing something wrong. You can take
significant advantage of the fact that we are using an ordered set
representation.)
The following eight-symbol alphabet with associated relative
frequencies was designed to efficiently encode the lyrics of 1950s
rock songs. (Note that the symbols of an
alphabet need not be individual letters.)
How many bits are required for the encoding? What is the smallest number
of bits that would be needed to encode this song if we used a fixed-length
code for the eight-symbol alphabet?
We have an alphabet of $n = 8$ symbols, and a
message of $m = 36$ symbols. Then the minimum
number of bits to encode a specific symbol using a fixed-length code is
$\lceil \log_2{n} \rceil = 3$. Thus the minimum
number of bits to encode all the lyrics is
$m \lceil\log_2{n}\rceil = 36 \times 3 = 108$.
Suppose we have a Huffman tree for an alphabet of
$n$ symbols, and that the relative frequencies
of the symbols are 1, 2, 4, …,
$2^{n-1}$. Sketch the tree for
$n$=5; for $n$=10.
In such a tree (for general $n$) how many bits
are required to encode the most frequent symbol? the least frequent symbol?
The tree will be unbalanced, similar to the tree given in
figure 2.26. Encoding the most
frequent symbol requires one bit, whereas
$n - 1$ bits are required to encode the
least frequent symbol.
Consider the encoding
function
that you designed in exercise 2.68. What
is the
order of growth in the number of steps needed to encode a symbol?
Be sure to include the number of steps needed to search the symbol list at
each node encountered. To answer this question in general is difficult.
Consider the special case where the relative frequencies of the
$n$ symbols are as described in
exercise 2.71, and give the order of
growth (as a function of $n$) of the number of
steps needed to encode the most frequent and least frequent symbols in the
alphabet.
Consider the special case in
exercise 2.68. At each step down the path
of length $n$, we need to do a linear search in
a list of length $n, n-1, \ldots, 1$. In the
worst case, there are
$O(n \times n / 2) = O(n^2)$ number of steps.
$O(n^2)$.