Start Lecture #1
I start at Chapter 0 so that when we get to chapter 1, the numbering will agree with the text.
There is a web site for the course. You can find it from my home page listed above.
The course text is Aho, Lam, Seithi, and Ullman: Compilers: Principles, Techniques, and Tools, second edition
The Dragon Book, due to the cover picture.
Your grade will be a function of your final exam and laboratory assignments (see below). I am not yet sure of the exact weightings for each lab and the final, but the final will be roughly half the grade (very likely between 40% and 60%).
I use the upper left board for lab/homework assignments and announcements. I should never erase that board. If you see me start to erase an announcement, please let me know.
I try very hard to remember to write all announcements on the upper left board and I am normally successful. If, during class, you see that I have forgotten to record something, please let me know. HOWEVER, if I forgot and no one reminds me, the assignment has still been given.
I make a distinction between homeworks and labs.
Homeworks are numbered by the class in which they are assigned. So any homework given today is homework #1. Even if I do not give homework today, the homework assigned next class will be homework #2. Unless I explicitly state otherwise, all homeworks assignments can be found in the class notes. So the homework present in the notes for lecture #n is homework #n (even if I inadvertently forgot to write it to the upper left board).
You may solve lab assignments on any system you wish, but ...
I sent it ... I never received itdebate. Thank you.
Good methods for obtaining help include
You may write your lab in Java, C, or C++. Other languages may be possible, but please ask in advance. I need to ensure that the TA is comfortable with the language.
The rules for incompletes and grade changes are set by the school and not the department or individual faculty member. The rules set by GSAS state:
The assignment of the grade Incomplete Pass(IP) or Incomplete Fail(IF) is at the discretion of the instructor. If an incomplete grade is not changed to a permanent grade by the instructor within one year of the beginning of the course, Incomplete Pass(IP) lapses to No Credit(N), and Incomplete Fail(IF) lapses to Failure(F).
Permanent grades may not be changed unless the original grade resulted from a clerical error.
I do not assume you have had a compiler course as an undergraduate, and I do not assume you have had experience developing/maintaining a compiler.
If you have already had a compiler class, this course is probably not appropriate. For example, if you can explain the following concepts/terms, the course is probably too elementary for you.
I do assume you are an experienced programmer. There will be non-trivial programming assignments during this course. Indeed, you will write a compiler for a simple programming language.
I also assume that you have at least a passing familiarity with assembler language. In particular, your compiler may need to produce assembler language, but probably it will produce an intermediate language consisting of 3-address code. We will also be using addressing modes found in typical assemblers. We will not, however, write significant assembly-language programs.
The CS policy on academic integrity, which applies to all graduate courses in the department, can be found here .
I present this snippet from chapter 2 here (it appears
belongs as well), since it is self-contained and
needed for lab number 1, which I wish to assign today.
When performing a depth-first tree traversal, it is clear in what order the leaves are to be visited, namely left to right. In contrast there are several choices as to when to visit an interior (i.e. non-leaf) node. The traversal can visit an interior node
I do not like the book's pseudocode as I feel the names chosen confuse the traversal with visiting the nodes. I prefer the pseudocode below, which uses the following conventions.
traverse (n : treeNode) if leaf(n) -- visit leaves once; base of recursion visit(n) else -- interior node, at least 1 child -- visit(n) -- visit node PRE visiting any children traverse(first child) -- recursive call while (more children remain) -- excluding first child -- visit(n) -- visit node IN-between visiting children traverse (next child) -- recursive call -- visit(n) -- visit node POST visiting all children
Note the following properties
worksfor any tree, we will, like everyone else, reserve the name
inorder traversalfor binary trees. In the case of binary search trees (everything in the left subtree is smaller than the root of that subtree, which in tern is smaller than everything in the corresponding right subtree) an inorder traversal prints the values of the nodes in (numerical) order.
Do the Euler-tour traversal for the tree in the notes and then for a binary tree.
Lab 1 assigned. See the home page.
Homework Read chapter 1.
A Compiler is a translator from one language, the input or source language, to another language, the output or target language.
Often, but not always, the target language is an assembler language or the machine language for a computer processor.
Note that using a compiler requires a two step process to run a program.
This should be compared with an interpreter, which accepts the source language program and the appropriate input, and itself produces the program output.
Sometimes both compilation and interpretation are used.
For example, consider typical Java implementations.
The (Java) source code is translated (i.e., compiled)
into bytecodes, the machine language for an
virtual machine, the Java Virtual Machine or JVM.
Then an interpreter of the JVM (itself normally called a JVM)
accepts the bytecodes and the appropriate input,
and produces the output.
This technique was quite popular in academia, with the Pascal
programming language and P-code.
Homework: 1, 2, 4
Unless otherwise stated, homeworks are from the book and specifically from the end of the second level section we are discussing. Even more specifically, we are in section 1.1, so you are to do the first, second, and fourth problem at the end of section 1.1. These three problems are numbered 1.1.1, 1.1.2, and 1.1.4 in the book.
End of Remark
For large programs, the compiler is actually part of a multistep tool chain
[preprocessor] → [compiler] → [assembler] → [linker] → [loader]
We will be primarily focused on the second element of the chain, the compiler. Our target language will be assembly language. I give a very short description of the other components, including some historical comments.
Preprocessors are normally fairly simple as in the C language, providing primarily the ability to include files and expand macros. There are exceptions, however. IBM's PL/I, another Algol-like language had quite an extensive preprocessor, which made available at preprocessor time, much of the PL/I language itself (e.g., loops and I believe procedure calls).
Some preprocessors essentially augment the base language, to add
One could consider them as compilers in their own right, having as
source this augmented language (say Fortran augmented with
statements for multiprocessor execution in the guise of Fortran
comments) and as target the original base language (in this case
preprocessor inserts procedure calls to
implement the extensions at runtime.
Assembly code is an mnemonic version of machine code in which names, rather than binary values, are used for machine instructions, and memory addresses.
Some processors have fairly regular operations and as a result
assembly code for them can be fairly natural and not-too-hard to
Other processors, in particular Intel's x86 line, have let us
charitably say more
interesting instructions with certain
registers used for certain things.
My laptop has one of these latter processors (pentium 4) so my gcc compiler produces code that from a pedagogical viewpoint is less than ideal. If you have a mac with a ppc processor (newest macs are x86), your assembly language is cleaner. NYU's ACF features sun computers with sparc processors, which also have regular instruction sets.
No matter what the assembly language is, an assembler needs to assign memory locations to symbols (called identifiers) and use the numeric location address in the target machine language produced. Of course the same address must be used for all occurrences of a given identifier and two different identifiers must (normally) be assigned two different locations.
The conceptually simplest way to accomplish this is to make two passes over the input (read it once, then read it again from the beginning). During the first pass, each time a new identifier is encountered, an address is assigned and the pair (identifier, address) is stored in a symbol table. During the second pass, whenever an identifier is encountered, its address is looked up in the symbol table and this value is used in the generated machine instruction.
Linkers, a.k.a. linkage editors combine the output of the assembler for several different compilations. That is the horizontal line of the diagram above should really be a collection of lines converging on the linker. The linker has another input, namely libraries, but to the linker the libraries look like other programs compiled and assembled. The two primary tasks of the linker are
The assembler processes one file at a time. Thus the symbol table produced while processing file A is independent of the symbols defined in file B, and conversely. Thus, it is likely that the same address will be used for different symbols in each program. The technical term is that the (local) addresses in the symbol table for file A are relative to file A; they must be relocated by the linker. This is accomplished by adding the starting address of file A (which in turn is the sum of the lengths of all the files processed previously in this run) to the relative address.
Assume procedure f, in file A, and procedure g, in file B, are compiled (and assembled) separately. Assume also that f invokes g. Since the compiler and assembler do not see g when processing f, it appears impossible for procedure f to know where in memory to find g.
The solution is for the compiler to indicated in the output of the file A compilation that the address of g is needed. This is called a use of g. When processing file B, the compiler outputs the (relative) address of g. This is called the definition of g. The assembler passes this information to the linker.
The simplest linker technique is to again make two passes.
During the first pass, the linker records in its
external symbol table (a table of external symbols, not a
symbol table that is stored externally) all the definitions
During the second pass, every use can be resolved by access to the
I cover the linker in more detail when I teach 2250, OS Design. You can find my class notes for OS Design starting at my home page.
After the linker has done its work, the resulting
executable file can be loaded by the operating system into
The details are OS dependent.
With early single-user operating systems all programs would be
loaded into a fixed address (say 0) and the loader simply copies the
file to memory.
Today it is much more complicated since (parts of) many programs
reside in memory at the same time.
Hence the compiler/assembler/linker cannot know the real
location for an identifier.
Indeed, this real location can change.
More information is given in any OS course.
Modern compilers contain two (large) parts, each of which is often subdivided. These two parts are the front end, shown in green on the right and the back end, shown in pink.
The front end analyzes the source program, determines its constituent parts, and constructs an intermediate representation of the program. Typically the front end is independent of the target language.
The back end synthesizes the target program from the intermediate representation produced by the front end. Typically the back end is independent of the source language.
This front/back division very much reduces the work for a compiling
system that can handle several (N) source languages and several (M)
Instead of NM compilers, we need N front ends and M back ends.
For gcc (originally abbreviating
Gnu C Compiler, but
Gnu Compiler Collection), N=7 and
M~30 so the savings are considerable.
compiler like applications also use analysis and
synthesis. Some examples include
The front and back end are themselves each divided into multiple phases. Conceptually, the input to each phase is the output of the previous. Sometime a phase changes the representation of the input. For example, the lexical analyzer converts a character stream input into a token stream output. Sometimes the representation is unchanged. For example, the machine-dependent optimizer transforms target-machine code into (hopefully improved) target-machine code.
The diagram is definitely not drawn to scale, in terms of effort or lines of code. In practice, the optimizers dominate.
Conceptually, there are three phases of analysis with the output of
one phase the input of the next.
Each of these phases changes the representation of the program being
The phases are called lexical analysis
or scanning, which transforms the program from a string of
characters to a string of tokens; syntax analysis
or parsing, which transforms the program into some kind of
and semantic analysis, which
decorates the tree with
Note that the above classification is conceptual; in practice more efficient representations may be used. For example, instead of having all the information about the program in the tree, tree nodes may point to symbol table entries. Thus the information about the variable counter is stored once and pointed to at each occurrence.
The character stream input is grouped into meaningful units called lexemes, which are then mapped into tokens, the latter constituting the output of the lexical analyzer. For example, any one of the following C statements
x3 = y + 3; x3 = y + 3 ; x3 =y+ 3 ;but not
x 3 = y + 3;would be grouped into the lexemes x3, =, y, +, 3, and ;.
A token is a <token-name,attribute-value> pair. For example
this kind of 3is stored. Another possibility is to have a separate
Note that non-significant blanks are normally removed during scanning. In C, most blanks are non-significant. That does not mean the blanks are unnecessary. Consider
int x; intx;The blank between int and x is clearly necessary, but it does not become part of any token. Blanks inside strings are an exception, they are part of the token (or more likely the table entry pointed to by the second component of the token).
Note that we can define identifiers, numbers, and the various symbols and punctuation without using recursion (compare with parsing below).
Parsing involves a further grouping in which tokens are grouped into grammatical phrases, which are often represented in a parse tree. For example
x3 = y + 3;would be parsed into the tree on the right.
This parsing would result from a grammar containing rules such as
asst-stmt → id = expr ; expr → number | id | expr + expr
Note the recursive definition of expression (expr). Note also the hierarchical decomposition in the figure on the right.
The division between scanning and parsing is somewhat arbitrary, in that some tasks can be accomplished by either. However, if a recursive definition is involved, it is considered parsing not scanning.
Often we utilize a simpler tree called the syntax tree with operators as interior nodes and operands as the children of the operator. The syntax tree on the right corresponds to the parse tree above it. We expand on this point later.
(Technical point.) The syntax tree shown represents an assignment expression not an assignment statement. In C an assignment statement includes the trailing semicolon. That is, in C (unlike in Algol) the semicolon is a statement terminator not a statement separator.
There is more to a front end than simply syntax. The compiler needs semantic information, e.g., the types (integer, real, pointer to array of integers, etc) of the objects involved. This enables checking for semantic errors and inserting type conversion where necessary.
For example, if y was declared to be a real and x3 an integer, we
need to insert (unary, i.e., one operand) conversion operators
realtoint as shown on the
In this class we will use three-address-code for our intermediate language; another possibility that is used is some kind of syntax tree.
Many compilers internally generate intermediate code for an
For example, the intermediate code generated would assume that the
target has an unlimited number of registers and that any register
can be used for any operation.
You can also think of this a machine with no registers, but
which permits operations to be directly performed on memory locations.
Another common assumption is that machine operations take (up to) three
operands, two source and one target.
With these assumptions of a machine with an unlimited number of
registers and instructions with three operands, one generates
three-address code by walking the semantic tree.
Our example C instruction would produce
temp1 = inttoreal(3) temp2 = id2 + temp1 temp3 = realtoint(temp2) id1 = temp3
We see that three-address code can include instructions with fewer than 3 operands.
Sometimes three-address code is called quadruples because one can view the previous code sequence as
inttoreal temp1 3 -- add temp2 id2 temp1 realtoint temp3 temp2 -- assign id1 temp3 --Each
quadhas the form
operation target source1 source2
This is a very serious subject, one that we will not really do justice to in this introductory course. Some optimizations are fairly easy to see.
add temp2 id2 3.0
realtoint id1 temp2
In addition to optimizations performed on the intermediate code, further optimizations can be performed on the machine code by the machine-dependent back end.
Modern processors have only a limited number of register. Although some processors, such as the x86, can perform operations directly on memory locations, we will for now assume only register operations. Some processors (e.g., the MIPS architecture) use three-address instructions. We follow this model. Other processors permit only two addresses; the result overwrites one of the sources. Using three-address instructions restricted to registers (except for load and store instructions, which naturally must also reference memory), code something like the following would be produced for our example, after first assigning memory locations to id1 and id2.
LD R1, id2 ADDF R1, R1, #3.0 // add float RTOI R2, R1 // real to int ST id1, R2
The symbol table stores information about program variables that will be used across phases. Typically, this includes type information and storage locations.
A possible point of confusion: the storage location does not give the location where the compiler has stored the variable. Instead, it gives the location where the compiled program will store the variable.
Logically each phase is viewed as a separate pass, i.e., a program that reads input and produces output for the next phase. The phases thus form a pipeline. In practice some phases are combined into a pass.
For example one could have the entire front end as one pass.
The term pass is used to indicate that the entire input is read during this activity. So two passes, means that the input is read twice. A grayed out (optional) portion of the notes above discusses 2-pass approaches for both assemblers and linkers. If we implement each phase separately and possibly use multiple passes for some of them, the compiler will perform a large number of I/O operations, an expensive undertaking.
As a result, techniques have been developed to reduce the number of
We will see in the next chapter how to combine the scanner, parser,
and semantic analyzer into one phase.
Consider the parser.
When it needs to input the next token, rather than reading the input
file (presumably produced by the scanner), the parser calls the
At selected points during the production of the syntax tree, the
parser calls the
intermediate-code generator which performs
semantic analysis as well as generating a portion of the
For pedagogical reasons, we will not be employing this technique. That is to ease the programming and understanding, we will use a compiler design that performs more I/O than necessary. Naturally, production compilers do not do this. Thus your compiler will consist of separate programs for the scanner, parser, and semantic analyzer / intermediate code generator. Indeed, these will likely be labs 2, 3, and 4.
One problem with combining phases, or with implementing a single phase in one pass, is that it appears that an internal form of the entire program being compiled will need to be stored in memory. This problem arises because the downstream phase may need, early in its execution, information that the upstream phase produces only late in its execution. This motivates the use of symbol tables and a two pass approach in which the symbol table is produced during the first pass and used during the second pass. However, a clever one-pass approach is often possible.
Consider an assembler (or linker).
The good case is when a symbol definition precedes all its uses so
that the symbol table contains the value of the symbol prior to that
value being needed.
Now consider the harder case of one or more uses preceding the
When a not-yet-defined symbol is first used, an entry is placed in
the symbol table, pointing to this use and indicating that the
definition has not yet appeared.
Further uses of the same symbol attach their addresses to a linked
undefined uses of this symbol.
When the definition is finally seen, the value is placed in the
symbol table, and the linked list is traversed inserting the value
in all previously encountered uses.
Subsequent uses of the symbol will find its definition in the table.
This technique is called backpatching.
Originally, compilers were written
from scratch, but now the
situation is quite different.
A number of tools are available to ease the burden.
We will mention tools that generate scanners and parsers.
This will involve us in some theory: regular expressions for
scanners and context-free grammars for parsers.
These techniques are fairly successful.
One drawback can be that they do not execute as fast as
hand-crafted scanners and parsers.
We will also see tools for syntax-directed translation and automatic code generation. The automation in these cases is not as complete.
Finally, there is the large area of optimization.
This is not automated; however, a basic component of optimization is
data-flow analysis (how values are transmitted between
parts of a program) and there are tools to help with this task.
Pedagogically, a problem with using the tools is that your effort shifts from understanding how a compiler works to how the tool is used. So instead of studying regular expressions and finite state automata, you study the flex man pages and user's guide.
As you have doubtless noticed, not all programming efforts produce correct programs. If the input to the compiler is not a legal source language program, errors must be detected and reported. It is often much easier to detect that the program is not legal (e.g., the parser reaches a point where the next token cannot legally occur) than to deduce what is the actual error (which may have occurred earlier). It is even harder to reliably deduce what the intended correct program should be.
Skipped. Assumed knowledge (only one page).
High performance compilers (i.e., the code generated performs well) are crucial for the adoption of new language concepts and computer architectures. Also important is the resource utilization of the compiler itself.
Modern compilers are large. On my laptop the compressed source of gcc is 38MB so uncompressed it must be about 100MB.
We will encounter several aspects of computer science during the course. Some, e.g., trees, I'm sure you already know well. Other, more theoretical aspects, such as nondeterministic finite automata, may be new.
We will do very little optimization. That topic is typically the subject of a second compiler course. Considerable theory has been developed for optimization, but sadly we will see essentially none of it. We can, however, appreciate the pragmatic requirements.
For 50+ years some computers have had multiple processors
The challenge with such
multiprocessors is to program
them effectively so that all the processors are utilized
Recently, multiprocessors have become commodity items, with
multiple processors (
cores) on a single chip.
Major research efforts had lead to improvements in
All machines have a limited number of registers, which can be accessed much faster than central memory. All but the simplest compilers devote effort to using this scarce resource effectively. Modern processors have several levels of caches and advanced compilers produce code designed to utilize the caches well.
RISC computers have comparatively simple instructions, complicated instructions require several RISC instructions. A CISC, Complex Instruction Set Computer, contains both complex and simple instructions. A sequence of CISC instructions would be a larger sequence of RISC instructions. Advanced optimizations are able to find commonality in this larger sequence and lower the total number of instructions. The CISC Intel x86 processor line 8086/80286/80386/... had a major implementation change with the 686 (a.k.a. pentium pro). In this processor, the CISC instructions were decomposed into RISC instructions by the processor itself. Currently, code for x86 processors normally achieves highest performance when the (optimizing) compiler emits primarily simple instructions.
A great variety has emerged. Compilers are produced before the processors are fabricated. Indeed, compilation plus simulated execution of the generated machine code is used to evaluate proposed designs.
This means translating from one machine language to another. Companies changing processors sometimes use binary translation to execute legacy code on new machines. Apple did this when converting from Motorola CISC processors to the PowerPC (they did not do this when converting from the PowerPC to x86).
An alternative to binary translation is to have the new processor execute programs in both the new and old instruction set. Intel had the Itanium processor also execute x86 code. Digital Equipment Corp (DEC) had their VAX processor also execute PDP-11 instructions. Apple, does not produce processors so needed binary translation for the MIPS→PowerPC transition
With the recent dominance of x86 processors, binary translators from x86 have been developed so that other microprocessors can be used to execute x86 software.
In the old days integrated circuits were designed by hand. For example, the NYU Ultracomputer research group in the 1980s designed a VLSI chip for rapid interprocessor coordination. The design software we used essentially let you paint. You painted blue lines where you wanted metal, green for polysilicon, etc. Where certain colors crossed, a transistor appeared.
Current microprocessors are much too complicated to permit such a low-level approach. Instead, designers write in a high level description language which is compiled down the specific layout.
The optimization of database queries and transactions is quite a serious subject.
Instead of simulating a processor designs on many inputs, it may be faster to compile the design first into a lower level representation and then execute the compiled version.
Dataflow techniques developed for optimizing code are also useful for finding errors. Here correctness (finding all errors and only errors) is not a requirement, which is a good thing since that problem is undecidable.
Techniques developed to check for type correctness (we will see some of these) can be extended to find other errors such as using an uninitialized variable.
As mentioned above optimizations have been developed to eliminate
unnecessary bounds checking for languages like Ada and Java that
perform the checks automatically.
Similar techniques can help find potential
errors that can be a serious security threat.
Languages (e.g., Java) with garbage collection cannot have memory leaks (failure to free no longer accessible memory). Compilation techniques can help to find these leaks in languages like C that do not have garbage collection.
Remark: You should be able to do the exercises in this section (but they are not assigned).
Homework: Read chapter 2.
The goal of this chapter is to implement a very
Really we are just going as far as the intermediate code, i.e., the
Nonetheless, the output, i.e. the intermediate code, does look
somewhat like assembly language.
How is this possible to do in just one chapter?
What is the rest of the book about?
In this chapter there is
The material will be presented too fast for full understanding: Starting in chapter 3, we slow down and explain everything.
Sometimes in chapter 2, we only show some of the possibilities (typically omitting the hard cases) and don't even mention the omissions. Again, this is corrected in the remainder of the course.
A weakness of my teaching style is that I spend too long on chapters like this. I will try not to make that mistake this semester, but I have said that before.
We will be looking at the front end, i.e., the analysis portion of a compiler.
The syntax describes the form of a program in a given language, while the semantics describes the meaning of that program. We will learn the standard context-free grammar and will use the standard BNF (Backus-Naur Form) to describe the syntax.
We will learn syntax-directed translation, where the grammar does more than specify the syntax. We augment the grammar with attributes and use this to guide the entire front end.
The front end discussed in this chapter has as source language
infix expressions consisting of digits, +, and -.
The target language is
postfix expressions with the same components.
The compiler will convert
7+4-5 to 74+5-.
Actually, our simple compiler will handle a few other operators as well.
tokenize the input (i.e., write a scanner), model
the syntax of the source, and let this syntax direct the translation
all the way to
three-address code, our intermediate language.
This will be
done right in the next two chapters.
A context-free grammar (CFG) consists of
Terminals: 0 1 2 3 4 5 6 7 8 9 + - Nonterminals: list digit Productions: list → list + digit list → list - digit list → digit digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Start symbol: list
We use | to indicate that a nonterminal has multiple possible right hand side. So
A → B | Cis simply shorthand for
A → B A → C
If no start symbol is specifically designated, the LHS of the first production is the start symbol.
Watch how we can generate the string 7+4-5 beginning with the start symbol, applying productions, and stopping when no productions are possible (we have only terminals).
list → list - digit → list - 5 → list + digit - 5 → list + 4 - 5 → digit + 4 - 5 → 7 + 4 - 5
This process of applying productions, starting with the start symbol and ending when only terminals are present is called a derivation and we say that the final string has been derived from the initial string (in this case the start symbol).
The set of all strings derivable from the start symbol is the language generated by the CFG