Architecture of the algol2MMIX compiler

 Prev   Top   Next 

The algol2MMIX compiler was designed with the goal of achieving the bootstrap in mind. Although a complete bootstrap has not been achieved, the general organization of the compiler is probably not to be blamed: in particular, splitting the compiler in different modules for lexical analysis, parsing, semantic analysis and code generation , made it possible to obtain partial bootstrap results, indicating that the part of the compiler needing more careful debugging to get a full bootstrap is the semantic analyzer (specifically, the way data structures are maintained and handled).

When implementing a compiler in the same language as the source language, without access to any other compiler for the given language, two approaches may be taken: either to write the new compiler in another language for which a working compiler exists, and then translate (in a more or less automated fashion) the new compiler into the source language, or to do it the other way around: write the new compiler in a stylized subset of the source language in such a way that it is possible to (semi-)automatically convert it into another programming language which has already been implemented.
In our case, the choice for the "other" language went to the C programming language, mainly because it was reasonably similar to both the source language (since C is a descendent of Algol68, from which Algol68Nix has been derived) and the target language (the semantics of MMIX assembly is often given in term of equivalent C code). As for which of the above two "roads" to follow, a hybrid approach was taken: the new compiler has been written in a subset of the intersection of Algol68Nix and C. Then, taking advantage of the macro facilities provided by the C language, the code was automatically converted into C for testing and debugging, and semiautomatically into algol68Nix to obtain the final working version of the compiler in its own source language.

The Compiler front-end


The compiler front-end is split into the two modules nix-scanner.a68 and nix-parser.a68.
The first module does not require special comments: it performs a pretty straightforward conversion of the source file from a linear sequence of characters to a linear sequence of tokens.


The second phase produces a non-linear representation of the program known as Abstract Syntax Tree (AST), where each node stands for a syntactical entity with a concrete semantic meaning. For example, there are nodes for representing IF constructs and identifiers, whereas there is no need to represent the characters that made up the identifier by themselves: they have already been bundled up into a unique token by the scanner, and are just stored as a semantic attribute in the node corresponding to the identifier.

The parser implemented for the algol2MMIX works by recursive descent. The AST corresponding to a given sequence of tokens is built in a top-down fashion: at the beginning the parser creates a node to represent the whole program (which, in our case, is simply an identity declaration of the main procedure), and then it recursively creates nodes to represent the syntactic entities that constitutes a program. When the parser sees a sequence of tokens that correspond to one of the deepest node, it can tell that it is arrived to a leaf: it then fills the node with the information given by the token, and return from the recursive call. At the end of this process, the whole tree has been constructed, and the sequence of tokens will be completely exhausted.

To avoid exponential backup, the recursive descent is driven by the next token in the sequence. In a few cases this is not enough, so that more tokens of lookahead are necessary, or the grammar for the language needs to be accommodated. Two cases where such restrictions were deemed necessary for the algol68Nix language are procedure calls and array subscripting, as described in another part of this report.

A last remark about the recursive-descent parser implemented regards the data structure used to maintain the semantic attributes associated with each construct: instead of keeping a separated table usually referred to as the Symbol Table, all the relevant information are directly stored within each node of the tree. This leads to some amount of space waste (particularly given the absence of UNION's in the implemented sublanguage), but is conceptually clearer and allows for a simple way to handle input/output of ASTs.

The Compiler back-end

Semantic Analysis

The back-end of the compiler is implemented in the file nix-analyzer.a68. It takes in input (procedure readast) the AST produced by the front-end (which was dumped by the parser in standard output), and produces the corresponding MMIX code (procedures generatemms and issuecode ).

The main semantic tasks carried out by the back-end are type-checking (procedure annotatemode) and coercion inference (here the term coercion inference is used to indicate the process of computing all the coercions that must be performed to convert the actual mode of a unit into the mode required in a given context; see procedures annotatecontext , admissiblemodes, iscoercibleto).

As a by-product of these activities, two tables are constructed: one to verify the well-formedness of (recursive) user-defined modes (that is also useful to check for "mode equality"; see usermodetable and procedures gatherusermodes, modecmp), and another to maintain all the information about the blocks introduced in the program (see blocktable and procedure gatherblock).

Code Generation

To avoid doing I/O of all the structures computed during the semantic analysis, the code generation phase is contained in the same file nix-analyzer.a68. Conceptually, however, the two phases are sharply separated: the only interaction between the two modules consists in that the code generation phase uses the parsetree, the table of user-defined modes and the block table constructed by the semantic analyzer, once semantic analysis is over.

The code generation phase (procedure generatemms) starts generating a standard assembly preamble (procedure issuemmsheader) that sets up the Data Segment (that will contain the Stack of alive activation records), the Pool Segment (that will be used as the Heap for the allocation of objects with a dynamic lifetime), and the Text Segment (for the actual code). Then, the code generator recursively translates (procedure issuecode) each Algol68Nix construct (represented by a node of the AST) into the corresponding sequence of MMIX assembly instructions. For greater modularity, the procedure issuecode acts as dispatcher, inspecting the actual type type of the construct at hand, and then forwarding the call to a procedure of the form issuetype.

Among those ad hoc procedures, an interesting one is the procedure issueproccall, that implements the calling sequence designed for algol2MMIX (see figure).

Calling Sequence in Algol68Nix

The calling sequence, depicted in the figure above, has the caller doing almost all the work; if we let f be the procedure about to call another procedure g, then the caller (i.e. the procedure f) performs the following steps:

  1. saves the current value of Stack Pointer SP in a temporary register;
  2. pushes on the stack all the temporary registers it has been using so far --- notice that in this way the old value of SP will now be at the top of the stack; see part b) of the figure;
  3. evaluates all the actual parameters to be passed to g --- notice that Algol68Nix uses pass-by-value calling semantics;
  4. builds the activation record for g: this in turn can be broken-up in four sub-steps:
    1. first, f sets the dynamic link (i.e. the pointer to the base of its own activation record ) in the second octa of g's activation record;
    2. then, f sets the static link (i.e. the pointer to the base of the activation record of the most recent incarnation of the procedure h in whose body the procedure g was declared) in the third octa of g's activation record;
    3. afterwards, f places the values of the actual parameters starting from the fourth octa of g's activation record, aligning such values according to their mode;
    4. finally, the return address (i.e. the address, within the code for the procedure f, of the instruction following the call to g) is copied into the first octa of g's activation record (see part c) of the figure);
  5. upon return from the call to g, the temporary registers are restored from the stack.

The only duty left to the callee (i.e. the procedure g) is to "clean up" after it has finished its computation: in particular, g should move the Stack Pointer SP to the cell currently pointed by the Frame Pointer FP, and the Frame Pointer FP back to the base of f's activation record (using the dynamic link), right before jumping back to the return address provided in the first octa of its own activation record; additionally, in the case g has a non-VOID return type, the computed value must be left in a special register, specifically reserved for this purpose.

Last modified: 08/30/02
Antonio Nicolosi