2.3.3 Example: Representing Sets - SICP Comparison Edition

Permalink copied!

2.3.3

In the previous examples we built representations for two kinds of compound data objects: rational numbers and algebraic expressions. In one of these examples we had the choice of simplifying (reducing) the expressions at either construction time or selection time, but other than that the choice of a representation for these structures in terms of lists was straightforward. When we turn to the representation of sets, the choice of a representation is not so obvious. Indeed, there are a number of possible representations, and they differ significantly from one another in several ways.

Informally, a set is simply a collection of distinct objects. To give a more precise definition we can employ the method of data abstraction. That is, we define set by specifying the operations that are to be used on sets. These are union_set, intersection_set, is_element_of_set, and adjoin_set. The function is_element_of_set is a predicate that determines whether a given element is a member of a set. The function adjoin_set takes an object and a set as arguments and returns a set that contains the elements of the original set and also the adjoined element. The function union_set computes the union of two sets, which is the set containing each element that appears in either argument. The function intersection_set computes the intersection of two sets, which is the set containing only elements that appear in both arguments. From the viewpoint of data abstraction, we are free to design any representation that implements these operations in a way consistent with the interpretations given above.

[1]

Sets as unordered lists
順序なしリストとしての集合

One way to represent a set is as a list of its elements in which no element appears more than once. The empty set is represented by the empty list. In this representation, is_element_of_set is similar to the function member of section 2.3.1. It uses equal instead of === so that the set elements need not be just numbers or strings:

function is_element_of_set(x, set) {
    return is_null(set) 
           ? false
           : equal(x, head(set))
           ? true
           : is_element_of_set(x, tail(set));
}

Using this, we can write adjoin_set. If the object to be adjoined is already in the set, we just return the set. Otherwise, we use pair to add the object to the list that represents the set:

function adjoin_set(x, set) {
    return is_element_of_set(x, set)
           ? set
           : pair(x, set);
}

For intersection_set we can use a recursive strategy. If we know how to form the intersection of set2 and the tail of set1, we only need to decide whether to include the head of set1 in this. But this depends on whether head(set1) is also in set2. Here is the resulting function:

function intersection_set(set1, set2) {
    return is_null(set1) || is_null(set2)
           ? null
           : is_element_of_set(head(set1), set2)
           ? pair(head(set1), intersection_set(tail(set1), set2))
           : intersection_set(tail(set1), set2);
}

In designing a representation, one of the issues we should be concerned with is efficiency. Consider the number of steps required by our set operations. Since they all use is_element_of_set, the speed of this operation has a major impact on the efficiency of the set implementation as a whole. Now, in order to check whether an object is a member of a set, is_element_of_set may have to scan the entire set. (In the worst case, the object turns out not to be in the set.) Hence, if the set has $n$ elements, is_element_of_set might take up to $n$ steps. Thus, the number of steps required grows as $\Theta(n)$. The number of steps required by adjoin_set, which uses this operation, also grows as $\Theta(n)$. For intersection_set, which does an is_element_of_set check for each element of set1, the number of steps required grows as the product of the sizes of the sets involved, or $\Theta(n^{2})$ for two sets of size $n$. The same will be true of union_set.

Exercise 2.59

Implement the union_set operation for the unordered-list representation of sets.

function union_set(set1, set2) {
    return is_null(set1)
           ? set2
           : adjoin_set(head(set1), 
                        union_set(tail(set1), set2));
}

Exercise 2.60

We specified that a set would be represented as a list with no duplicates. Now suppose we allow duplicates. For instance, the set $\{1,2,3\}$ could be represented as the list list(2, 3, 2, 1, 3, 2, 2). Design functions is_element_of_set, adjoin_set, union_set, and intersection_set that operate on this representation. How does the efficiency of each compare with the corresponding function for the non-duplicate representation? Are there applications for which you would use this representation in preference to the non-duplicate one?

The functions is_element_of_set and intersection_set remain unchanged. Here is the new implementation of adjoin_set and union_set.

function adjoin_set(x, set) {
    return pair(x, set);
}
function union_set(set1, set2) {
    return append(set1, set2);
}

In the version with no duplicates, the required number of steps for is_element_of_set and adjoin_set has an order of growth of $O(n)$, where $n$ is the number of element occurrences in the given representation, and the required number of steps for intersection_set and union_set has an order of growth of $O(n m)$, where $n$ is the number of element occurrences in the representation of the first set and $m$ is the number of element occurrences in the representation of the second set. In the version that allows duplicates, the number of steps for adjoin_set shrinks to $O(n)$, and the number of steps for union_set shrinks to $O(n)$. However, note that the number of element occurrences may be much larger in the second version, because many duplicates may accumulate. For applications where duplicate elements are rare, the version that allows duplicates is preferrable.

Sets as ordered lists
順序付きリストとしての集合

One way to speed up our set operations is to change the representation so that the set elements are listed in increasing order. To do this, we need some way to compare two objects so that we can say which is bigger. For example, we could compare strings lexicographically, or we could agree on some method for assigning a unique number to an object and then compare the elements by comparing the corresponding numbers. To keep our discussion simple, we will consider only the case where the set elements are numbers, so that we can compare elements using > and <. We will represent a set of numbers by listing its elements in increasing order. Whereas our first representation above allowed us to represent the set $\{1,3,6,10\}$ by listing the elements in any order, our new representation allows only the list list(1, 3, 6, 10).

One advantage of ordering shows up in is_element_of_set: In checking for the presence of an item, we no longer have to scan the entire set. If we reach a set element that is larger than the item we are looking for, then we know that the item is not in the set:

function is_element_of_set(x, set) { return is_null(set) ? false : x === head(set) ? true : x < head(set) ? false : // $\texttt{x > head(set)}$ is_element_of_set(x, tail(set)); }

How many steps does this save? In the worst case, the item we are looking for may be the largest one in the set, so the number of steps is the same as for the unordered representation. On the other hand, if we search for items of many different sizes we can expect that sometimes we will be able to stop searching at a point near the beginning of the list and that other times we will still need to examine most of the list. On the average we should expect to have to examine about half of the items in the set. Thus, the average number of steps required will be about $n/2$. This is still $\Theta(n)$ growth, but it does save us, on the average, a factor of 2 in number of steps over the previous implementation.

We obtain a more impressive speedup with intersection_set. In the unordered representation this operation required $\Theta(n^2)$ steps, because we performed a complete scan of set2 for each element of set1. But with the ordered representation, we can use a more clever method. Begin by comparing the initial elements, x1 and x2, of the two sets. If x1 equals x2, then that gives an element of the intersection, and the rest of the intersection is the intersection of the tails of the two sets. Suppose, however, that x1 is less than x2. Since x2 is the smallest element in set2, we can immediately conclude that x1 cannot appear anywhere in set2 and hence is not in the intersection. Hence, the intersection is equal to the intersection of set2 with the tail of set1. Similarly, if x2 is less than x1, then the intersection is given by the intersection of set1 with the tail of set2. Here is the function:

function intersection_set(set1, set2) { if (is_null(set1) || is_null(set2)) { return null; } else { const x1 = head(set1); const x2 = head(set2); return x1 === x2 ? pair(x1, intersection_set(tail(set1), tail(set2))) : x1 < x2 ? intersection_set(tail(set1), set2) : // $\texttt{x2 < x1}$ intersection_set(set1, tail(set2)); } }

To estimate the number of steps required by this process, observe that at each step we reduce the intersection problem to computing intersections of smaller sets—removing the first element from set1 or set2 or both. Thus, the number of steps required is at most the sum of the sizes of set1 and set2, rather than the product of the sizes as with the unordered representation. This is $\Theta(n)$ growth rather than $\Theta(n^2)$—a considerable speedup, even for sets of moderate size.

Exercise 2.61

Give an implementation of adjoin_set using the ordered representation. By analogy with is_element_of_set show how to take advantage of the ordering to produce a function that requires on the average about half as many steps as with the unordered representation.

function adjoin_set(x, set) {
    return is_null(set)
           ? list(x)
           : x === head(set) 
           ? set
           : x < head(set)
           ? pair(x, set)
           : pair(head(set), 
                  adjoin_set(x, tail(set)));
}

Exercise 2.62

Give a $\Theta(n)$ implementation of union_set for sets represented as ordered lists.

function union_set(set1, set2) {
    if (is_null(set1)) {
        return set2;
    } else if (is_null(set2)) {
        return set1;
    } else {
      const x1 = head(set1);
      const x2 = head(set2);
      return x1 === x2
             ? pair(x1, union_set(tail(set1),
                                  tail(set2)))
             : x1 < x2 
             ? pair(x1, union_set(tail(set1), set2))
             : pair(x2, union_set(set1, tail(set2)));
    }
}

union_set(
   adjoin_set(10, adjoin_set(20, adjoin_set(30, null))),
   adjoin_set(10, adjoin_set(15, adjoin_set(20, null))));

Sets as binary trees
二分木としての集合

We can do better than the ordered-list representation by arranging the set elements in the form of a tree. Each node of the tree holds one element of the set, called the entry at that node, and a link to each of two other (possibly empty) nodes. The left link points to elements smaller than the one at the node, and the right link to elements greater than the one at the node. Figure 2.25 shows some trees that represent the set $\{1,3,5,7,9,11\}$. The same set may be represented by a tree in a number of different ways. The only thing we require for a valid representation is that all elements in the left subtree be smaller than the node entry and that all elements in the right subtree be larger.

Figure 2.25

Various binary trees that represent the set $\{ 1,3,5,7,9,11 \}$.

The advantage of the tree representation is this: Suppose we want to check whether a number $x$ is contained in a set. We begin by comparing $x$ with the entry in the top node. If $x$ is less than this, we know that we need only search the left subtree; if $x$ is greater, we need only search the right subtree. Now, if the tree is balanced, each of these subtrees will be about half the size of the original. Thus, in one step we have reduced the problem of searching a tree of size $n$ to searching a tree of size $n/2$. Since the size of the tree is halved at each step, we should expect that the number of steps needed to search a tree of size $n$ grows as $\Theta(\log n)$.

[2]

For large sets, this will be a significant speedup over the previous representations.

We can represent trees by using lists. Each node will be a list of three items: the entry at the node, the left subtree, and the right subtree. A left or a right subtree of the empty list will indicate that there is no subtree connected there. We can describe this representation by the following functions:

[3]

function entry(tree) { return head(tree); }

function left_branch(tree) { return head(tail(tree)); }

function right_branch(tree) { return head(tail(tail(tree))); }

function make_tree(entry, left, right) { 
    return list(entry, left, right);
}

Now we can write is_element_of_set using the strategy described above:

function is_element_of_set(x, set) { return is_null(set) ? false : x === entry(set) ? true : x < entry(set) ? is_element_of_set(x, left_branch(set)) : // $\texttt{x > entry(set)}$ is_element_of_set(x, right_branch(set)); }

Adjoining an item to a set is implemented similarly and also requires $\Theta(\log n)$ steps. To adjoin an item x, we compare x with the node entry to determine whether x should be added to the right or to the left branch, and having adjoined x to the appropriate branch we piece this newly constructed branch together with the original entry and the other branch. If x is equal to the entry, we just return the node. If we are asked to adjoin x to an empty tree, we generate a tree that has x as the entry and empty right and left branches. Here is the function:

function adjoin_set(x, set) { return is_null(set) ? make_tree(x, null, null) : x === entry(set) ? set : x < entry(set) ? make_tree(entry(set), adjoin_set(x, left_branch(set)), right_branch(set)) : // $\texttt{x > entry(set)}$ make_tree(entry(set), left_branch(set), adjoin_set(x, right_branch(set))); }

The above claim that searching the tree can be performed in a logarithmic number of steps rests on the assumption that the tree is balanced, i.e., that the left and the right subtree of every tree have approximately the same number of elements, so that each subtree contains about half the elements of its parent. But how can we be certain that the trees we construct will be balanced? Even if we start with a balanced tree, adding elements with adjoin_set may produce an unbalanced result. Since the position of a newly adjoined element depends on how the element compares with the items already in the set, we can expect that if we add elements randomly the tree will tend to be balanced on the average. But this is not a guarantee. For example, if we start with an empty set and adjoin the numbers 1 through 7 in sequence we end up with the highly unbalanced tree shown in figure 2.26. In this tree all the left subtrees are empty, so it has no advantage over a simple ordered list. One way to solve this problem is to define an operation that transforms an arbitrary tree into a balanced tree with the same elements. Then we can perform this transformation after every few adjoin_set operations to keep our set in balance. There are also other ways to solve this problem, most of which involve designing new data structures for which searching and insertion both can be done in $\Theta(\log n)$ steps.

[4]

Figure 2.26

Unbalanced tree produced by adjoining 1 through 7 in sequence.

Exercise 2.63

Each of the following two functions converts a binary tree to a list.

function tree_to_list_1(tree) {
    return is_null(tree)
           ? null
           : append(tree_to_list_1(left_branch(tree)),
                    pair(entry(tree),
                         tree_to_list_1(right_branch(tree))));
}

function tree_to_list_2(tree) {
    function copy_to_list(tree, result_list) {
        return is_null(tree)
               ? result_list
               : copy_to_list(left_branch(tree),
                              pair(entry(tree),
                                   copy_to_list(right_branch(tree),
                                                result_list)));
    }
    return copy_to_list(tree, null);
}

Do the two functions produce the same result for every tree? If not, how do the results differ? What lists do the two functions produce for the trees in figure 2.25?
2つの関数はすべてのツリーに対して同じ結果を生成しますか？そうでなければ、結果はどう異なりますか？図 2.25のツリーに対して、2つの関数はどのようなリストを生成しますか？
Do the two functions have the same order of growth in the number of steps required to convert a balanced tree with $n$ elements to a list? If not, which one grows more slowly?
$n$ 個の要素を持つ平衡ツリーをリストに変換するのに必要なステップ数について、2つの関数は同じ増加のオーダーを持ちますか？そうでなければ、どちらがよりゆっくり増加しますか？

The two procedures produce the same results. For the trees in figure 2.25, the result will always be list(1, 3, 5, 7, 9, 11).
2つの関数は同じ結果を生成します。図 2.25のツリーに対して、結果は常に list(1, 3, 5, 7, 9, 11) になります。
A balanced tree with $n$ elements has a height of $O(\log{n})$ and $O(n)$ nodes. To convert the tree into a list using function tree_to_list_1, we call tree_to_list_1 $O(n)$ times. We call append at each node of the tree, but at each level, we apply append with a combined $O(n)$ elements in the first arguments. Thus, the run time of tree_to_list_1 has an order of growth of $O(n\log{n})$. Instead of append, the function tree_to_list_2 gets away with calling pair at each node, and thus tree_to_list_2 has an order of growth of $O(n)$.
$n$ 個の要素を持つ平衡ツリーの高さは $O(\log{n})$ で、$O(n)$ 個のノードがあります。関数 tree_to_list_1 を使ってツリーをリストに変換するには、tree_to_list_1 を $O(n)$ 回呼び出します。ツリーの各ノードで append を呼び出しますが、各レベルで第1引数の要素は合計 $O(n)$ 個です。したがって、tree_to_list_1 の実行時間の増加のオーダーは $O(n\log{n})$ です。関数 tree_to_list_2 は append の代わりに各ノードで pair を呼び出すだけなので、tree_to_list_2 の増加のオーダーは $O(n)$ です。

Exercise 2.64

The following function list_to_tree converts an ordered list to a balanced binary tree. The helper function partial_tree takes as arguments an integer $n$ and list of at least $n$ elements and constructs a balanced tree containing the first $n$ elements of the list. The result returned by partial_tree is a pair (formed with pair) whose head is the constructed tree and whose tail is the list of elements not included in the tree.

function list_to_tree(elements) {
    return head(partial_tree(elements, length(elements)));
}
function partial_tree(elts, n) {
    if (n === 0) {
        return pair(null, elts);
    } else {
        const left_size = math_floor((n - 1) / 2);
        const left_result = partial_tree(elts, left_size);
        const left_tree = head(left_result);
        const non_left_elts = tail(left_result);
        const right_size = n - (left_size + 1);
        const this_entry = head(non_left_elts);
        const right_result = partial_tree(tail(non_left_elts), right_size);
        const right_tree = head(right_result);
        const remaining_elts = tail(right_result);
        return pair(make_tree(this_entry, left_tree, right_tree),
                    remaining_elts);
    }
}

Write a short paragraph explaining as clearly as you can how partial_tree works. Draw the tree produced by list_to_tree for the list list(1, 3, 5, 7, 9, 11).
partial_tree がどのように動作するかを、できるだけ明確に短い段落で説明しなさい。リスト list(1, 3, 5, 7, 9, 11) に対して list_to_tree が生成するツリーを描きなさい。
What is the order of growth in the number of steps required by list_to_tree to convert a list of $n$ elements?
$n$ 個の要素のリストを変換するのに list_to_tree が必要とするステップ数の増加のオーダーは何ですか？

The function partial_tree(elts, n) returns a pair whose head is a balanced tree for the first $\lfloor (n - 1) / 2 \rfloor$ elements of elts, and whose tail is the list containing the remaining elements of elts. It works by calling itself recursively, to construct the left subtree and right subtree, and then makes the tree, and the required return pair. Thus, the overall function list_to_tree just needs to call partial_tree with the given list and its length, and return the head of the result.
The tree for list(1, 3, 5, 7, 9, 11) is the tree on the right in figure 2.25.
関数 partial_tree(elts, n) は、elts の最初の $\lfloor (n - 1) / 2 \rfloor$ 個の要素に対する平衡ツリーを head とし、elts の残りの要素を含むリストを tail とするペアを返します。自身を再帰的に呼び出して左の部分木と右の部分木を構成し、それからツリーと必要な戻り値のペアを作ります。したがって、全体の関数 list_to_tree は partial_tree を与えられたリストとその長さで呼び出し、結果の head を返すだけで済みます。
list(1, 3, 5, 7, 9, 11) に対するツリーは、図 2.25の右のツリーです。
The order of growth for the run time of function list_to_tree is $O(n)$ because for every node of the result tree, only a constant amount of work is needed.
関数 list_to_tree の実行時間の増加のオーダーは $O(n)$ です。結果のツリーの各ノードに対して、必要な作業は定数量のみだからです。

Exercise 2.65

Use the results of exercises 2.63 and 2.64 to give $\Theta(n)$ implementations of

union_set and intersection_set

for sets implemented as (balanced) binary trees.

[5]

function union_set_as_tree(set1, set2) {
    const list1 = tree_to_list_2(set1);
    const list2 = tree_to_list_2(set2);
    return list_to_tree(union_set(list1, list2));
}

function intersection_set_as_tree(set1, set2) {
    const list1=tree_to_list_2(set1);
    const list2=tree_to_list_2(set2);
    return list_to_tree(intersection_set(list1, list2));
}

Sets and information retrieval
集合と情報検索

We have examined options for using lists to represent sets and have seen how the choice of representation for a data object can have a large impact on the performance of the programs that use the data. Another reason for concentrating on sets is that the techniques discussed here appear again and again in applications involving information retrieval.

Consider a data base containing a large number of individual records, such as the personnel files for a company or the transactions in an accounting system. A typical data-management system spends a large amount of time accessing or modifying the data in the records and therefore requires an efficient method for accessing records. This is done by identifying a part of each record to serve as an identifying key. A key can be anything that uniquely identifies the record. For a personnel file, it might be an employee's ID number. For an accounting system, it might be a transaction number. Whatever the key is, when we define the record as a data structure we should include a key selector function that retrieves the key associated with a given record.

Now we represent the data base as a set of records. To locate the record with a given key we use a function lookup, which takes as arguments a key and a data base and which returns the record that has that key, or false if there is no such record. The function lookup is implemented in almost the same way as is_element_of_set. For example, if the set of records is implemented as an unordered list, we could use

function lookup(given_key, set_of_records) {
    return is_null(set_of_records)
           ? false
           : equal(given_key, key(head(set_of_records)))
           ? head(set_of_records)
           : lookup(given_key, tail(set_of_records));
}

Of course, there are better ways to represent large sets than as unordered lists. Information-retrieval systems in which records have to be randomly accessed are typically implemented by a tree-based method, such as the binary-tree representation discussed previously. In designing such a system the methodology of data abstraction can be a great help. The designer can create an initial implementation using a simple, straightforward representation such as unordered lists. This will be unsuitable for the eventual system, but it can be useful in providing a quick and dirty data base with which to test the rest of the system. Later on, the data representation can be modified to be more sophisticated. If the data base is accessed in terms of abstract selectors and constructors, this change in representation will not require any changes to the rest of the system.

Exercise 2.66

Implement the lookup function for the case where the set of records is structured as a binary tree, ordered by the numerical values of the keys.

function lookup(given_key, tree_of_records) {
    if (is_null(tree_of_records)) {
        return null;
    } else {
        const this_entry = entry(tree_of_records);
        const this_key = key(this_entry);
        return given_key === this_key 
               ? this_entry
               : given_key < this_key
               ? lookup(given_key, 
                        left_branch(tree_of_records))
               : lookup(given_key, 
                        right_branch(tree_of_records));
    }
}

[1]

If we want to be more formal, we can specify consistent with the interpretations given above to mean that the operations satisfy a collection of rules such as these:

For any set S and any object x, is_element_of_set(x, adjoin_set(x, S)) is true (informally: Adjoining an object to a set produces a set that contains the object).
For any sets S and T and any object x, is_element_of_set(x, union_set(S, T)) is equal to is_element_of_set(x, S) || is_element_of_set(x, T) (informally: The elements of union_set(S, T) are the elements that are in S or in T).
For any object x, is_element_of_set(x, null) is false (informally: No object is an element of the empty set).

[2]

Halving the size of the problem at each step is the distinguishing characteristic of logarithmic growth, as we saw with the fast-exponentiation algorithm of section 1.2.4 and the half-interval search method of section 1.3.3.

[3]

We are representing sets in terms of trees, and trees in terms of lists—in effect, a data abstraction built upon a data abstraction. We can regard the functions entry, left_branch, right_branch, and make_tree as a way of isolating the abstraction of a binary tree from the particular way we might wish to represent such a tree in terms of list structure.

[4]

Examples of such structures include B-trees and red-black trees. There is a large literature on data structures devoted to this problem. See Cormen, Leiserson, Rivest, and Stein 2022.

[5] Exercises 2.63–2.65 are due to Paul Hilfinger.

< Previous

Next >

2.3.3

Example: Representing Sets

Sets as unordered lists順序なしリストとしての集合

Sets as ordered lists順序付きリストとしての集合

Sets as binary trees二分木としての集合

Sets and information retrieval集合と情報検索

Sets as unordered lists
順序なしリストとしての集合

Sets as ordered lists
順序付きリストとしての集合

Sets as binary trees
二分木としての集合

Sets and information retrieval
集合と情報検索