5.5.1 Structure of the Compiler - SICP Comparison Edition

Permalink copied!

In section 4.1.7 we modified our original metacircular interpreter to separate analysis from execution. We analyzed each component to produce an execution function that took an environment as argument and performed the required operations. In our compiler, we will do essentially the same analysis. Instead of producing execution functions, however, we will generate sequences of instructions to be run by our register machine.

The function compile is the top-level dispatch in the compiler. It corresponds to the evaluate function of section 4.1.1, the analyze function of section 4.1.7, and the eval_dispatch entry point of the explicit-control-evaluator in section 5.4.1. The compiler, like the interpreters, uses the component-syntax functions defined in section 4.1.2.[1] The function compile performs a case analysis on the syntactic type of the component to be compiled. For each type of component, it dispatches to a specialized code generator:

function compile(component, target, linkage) {
    return is_literal(component)
           ? compile_literal(component, target, linkage)
           : is_name(component)
           ? compile_name(component, target, linkage)
           : is_application(component)
           ? compile_application(component, target, linkage)
           : is_operator_combination(component)
           ? compile(operator_combination_to_application(component),
                     target, linkage)
           : is_conditional(component)
           ? compile_conditional(component, target, linkage)
           : is_lambda_expression(component)
           ? compile_lambda_expression(component, target, linkage)
           : is_sequence(component)
           ? compile_sequence(sequence_statements(component),
                              target, linkage)
           : is_block(component)
           ? compile_block(component, target, linkage)
           : is_return_statement(component)
           ? compile_return_statement(component, target, linkage)
           : is_function_declaration(component)
           ? compile(function_decl_to_constant_decl(component),
                     target, linkage)
           : is_declaration(component)
           ? compile_declaration(component, target, linkage)
           : is_assignment(component)
           ? compile_assignment(component, target, linkage)
           : error(component, "unknown component type -- compile");
}

Targets and linkages

The function compile and the code generators that it calls take two arguments in addition to the component to compile. There is a target, which specifies the register in which the compiled code is to return the value of the component. There is also a linkage descriptor, which describes how the code resulting from the compilation of the component should proceed when it has finished its execution. The linkage descriptor can require the code to do one of the following three things:

proceed to the next instruction in sequence (this is specified by the linkage descriptor "next"),
jump to the current value of the continue register as part of returning from a function call (this is specified by the linkage descriptor "return"), or
jump to a named entry point (this is specified by using the designated label as the linkage descriptor).

For example, compiling the literal 5 with a target of the val register and a linkage of "next" should produce the instruction

assign("val", constant(5))

Compiling the same expression with a linkage of "return" should produce the instructions

assign("val", constant(5)),
go_to(reg("continue"))

In the first case, execution will continue with the next instruction in the sequence. In the second case, we will jump to whatever entry point is stored in the continue register. In both cases, the value of the expression will be placed into the target val register. Our compiler uses the "return" linkage when compiling the return expression of a return statement. Just as in the explicit-control evaluator, returning from a function call happens in three steps:

reverting the stack to the marker and restoring continue (which holds a continuation set up at the beginning of the function call)
computing the return value and placing it in val
jumping to the entry point in continue

Compilation of a return statement explicitly generates code for reverting the stack and restoring continue. The return expression is compiled with target val and linkage "return" so that the generated code for computing the return value places the return value in val and ends by jumping to continue.

Instruction sequences and stack usage

Each code generator returns an instruction sequence containing the object code it has generated for the component. Code generation for a compound component is accomplished by combining the output from simpler code generators for subcomponents, just as evaluation of a compound component is accomplished by evaluating the subcomponents.

The simplest method for combining instruction sequences is a function called append_instruction_sequences, which takes as arguments two instruction sequences that are to be executed sequentially. It appends them and returns the combined sequence. That is, if $seq_1$ and $seq_2$ are sequences of instructions, then evaluating append_instruction_sequences($seq$$_1$, $seq$$_2$) produces the sequence $seq$$_1$ $seq$$_2$

Whenever registers might need to be saved, the compiler's code generators use preserving, which is a more subtle method for combining instruction sequences. The function preserving takes three arguments: a set of registers and two instruction sequences that are to be executed sequentially. It appends the sequences in such a way that the contents of each register in the set is preserved over the execution of the first sequence, if this is needed for the execution of the second sequence. That is, if the first sequence modifies the register and the second sequence actually needs the register's original contents, then preserving wraps a save and a restore of the register around the first sequence before appending the sequences. Otherwise, preserving simply returns the appended instruction sequences. Thus, for example, preserving(list($reg$$_1$, $reg$$_2$), $seq$$_1$, $seq$$_2$) produces one of the following four sequences of instructions, depending on how $seq$$_1$ and $seq$$_2$ use $reg$$_1$ and $reg$$_2$:

\[ \begin{array}{l|l|l|l} \textit{seq}_1 & \texttt{save(}\textit{reg}_1\texttt{),} & \texttt{save(}\textit{reg}_2\texttt{),} & \texttt{save(}\textit{reg}_2\texttt{),} \\ \textit{seq}_2 & \textit{seq}_1 & \textit{seq}_1 & \texttt{save(}\textit{reg}_1\texttt{),} \\ & \texttt{restore(}\textit{reg}_1\texttt{),} & \texttt{restore(}\textit{reg}_2\texttt{),} & \textit{seq}_1 \\ & \textit{seq}_2 & \textit{seq}_2 & \texttt{restore(}\textit{reg}_1\texttt{),} \\ & & & \texttt{restore(}\textit{reg}_2\texttt{),} \\ & & & \textit{seq}_2 \end{array} \]

By using preserving to combine instruction sequences the compiler avoids unnecessary stack operations. This also isolates the details of whether or not to generate save and restore instructions within the preserving function, separating them from the concerns that arise in writing each of the individual code generators. In fact no save or restore instructions are explicitly produced by the code generators, except that the code for calling a function saves continue and the code for returning from a function restores it: These corresponding save and restore instructions are explicitly generated by different calls to compile, not as a matched pair by preserving (as we will see in section 5.5.3).

In principle, we could represent an instruction sequence simply as a list of instructions. The function append_instruction_sequences could then combine instruction sequences by performing an ordinary list append. However, preserving would then be a complex operation, because it would have to analyze each instruction sequence to determine how the sequence uses its registers. The function preserving would be inefficient as well as complex, because it would have to analyze each of its instruction sequence arguments, even though these sequences might themselves have been constructed by calls to preserving, in which case their parts would have already been analyzed. To avoid such repetitious analysis we will associate with each instruction sequence some information about its register use. When we construct a basic instruction sequence we will provide this information explicitly, and the functions that combine instruction sequences will derive register-use information for the combined sequence from the information associated with the sequences being combined.

An instruction sequence will contain three pieces of information:

the set of registers that must be initialized before the instructions in the sequence are executed (these registers are said to be needed by the sequence),
the set of registers whose values are modified by the instructions in the sequence, and
the actual instructions in the sequence.

We will represent an instruction sequence as a list of its three parts. The constructor for instruction sequences is thus

function make_instruction_sequence(needs, modifies, instructions) {
    return list(needs, modifies, instructions);
}

For example, the two-instruction sequence that looks up the value of the symbol "x" in the current environment, assigns the result to val, and then proceeds to the continuation, requires registers env and continue to have been initialized, and modifies register val. This sequence would therefore be constructed as

make_instruction_sequence(list("env", "continue"), list("val"),
    list(assign("val",
                list(op("lookup_symbol_value"), constant("x"),
                     reg("env"))),
         go_to(reg("continue"))));

The functions for combining instruction sequences are shown in section 5.5.4.

Exercise 5.31 In evaluating a function application, the explicit-control evaluator always saves and restores the env register around the evaluation of the function expression, saves and restores env around the evaluation of each argument expression (except the final one), saves and restores argl around the evaluation of each argument expression, and saves and restores fun around the evaluation of the argument-expression sequence. For each of the following applications, say which of these save and restore operations are superfluous and thus could be eliminated by the compiler's preserving mechanism:

f("x", "y")

f()("x", "y")

f(g("x"), y)

f(g("x"), "y")

There is currently no solution available for this exercise. This textbook adaptation is a community effort. Do consider contributing by providing a solution for this exercise, using a Pull Request in Github.

Exercise 5.32 Using the preserving mechanism, the compiler will avoid saving and restoring env around the evaluation of the function expression of an application in the case where the function expression is a name. We could also build such optimizations into the evaluator. Indeed, the explicit-control evaluator of section 5.4 already performs a similar optimization, by treating applications with no arguments as a special case.

Extend the explicit-control evaluator to recognize as a separate class of components applications whose function expression is a name, and to take advantage of this fact in evaluating such components.
Alyssa P. Hacker suggests that by extending the evaluator to recognize more and more special cases we could incorporate all the compiler's optimizations, and that this would eliminate the advantage of compilation altogether. What do you think of this idea?

[1] Notice, however, that our compiler is a JavaScript program, and the syntax functions that it uses to manipulate expressions are the actual JavaScript functions used with the metacircular evaluator. For the explicit-control evaluator, in contrast, we assumed that equivalent syntax operations were available as operations for the register machine. (Of course, when we simulated the register machine in JavaScript, we used the actual JavaScript functions in our register machine simulation.)

< Previous

Next >

5.5.1 Structure of the Compiler