Compilers Lecture #3

The purpose of lexical analysis is to convert a sequence of characters (the source) into a sequence of tokens. A lexeme is the sequence of characters comprising a single token.

Note that (following the book) we are going out of order. In reality, the lexer operates on the input and the resulting token sequence is the input to the parser. The reason we were able to produce the translator in the previous section without a lexer is that all the tokens were just one character (that is why we had just single digits).

Actually, you never need a lexer. Anything that a lexer can do a parser can do. But lexers are smaller and for software engineering and other reasons are normally used.

2.6.1: Removal of White space and comments

2.6.2: Reading ahead

After reading the < we must read another character. If it is y, we have found our token (<). However, we must unread the y so that when asked for the next token, we will start at y. If it is never more than one extra character that must be examined, a single char variable would suffice. A more general solution is discussed in the next chapter (Lexical Analysis).

2.6.3: Constants

This chapter considers only numerical integer constants. They are computed one digit at a time using the formula

The value of the constant can be considered the attribute of the token named num. Alternatively, the attribute can be a pointer/index into the symbol table entry for the number (or into a numbers table).

2.6.4: Recognizing identifiers and keywords

The C statement
sum = sum + x;
contains 6 tokens. The scanner (aka lexer; aka lexical analyzer) will convert the input into
id = id + id ;
(id standing for identifier).
Although there are three id tokens, the first and second represent the lexeme sum; the third represents x. These two different lexemes must be distinguished.

A related distinction occurs with language keywords, for example then, which are syntactically the same as identifiers. The symbol table is used to accomplishes both distinctions. We assume (as do most modern languages) that the keywords are reserved, i.e., cannot be used as program variables. The we simply initialize the symbol table to contain all these reserved words and mark them as keywords. When the lexer encounters a would-be identifier and searches the symbol table, it finds out that the string is actually a keyword.

As mentioned previously care must be taken when one lexeme is a proper subset of another. Consider
x<y versus x<=y
When the < is read, the scanner needs to read another character to see if it is an =. But if that second character is y, the current token is < and the y must be pushed back onto the input stream so that the configuration is the same after scanning < as it is after scanning <=.

Also consider then versus thenewvalue, one is a keyword and the other an id.

2.6.5: A lexical analyzer

A Java program is given. The book, but not the course, assumes knowledge of Java.

Since the scanner converts digits into num's we can shorten the grammar above. Here is the shortened version before the elimination of left recursion. Note that the value attribute of a num is its numerical value.

The factor() procedure follows the familiar recursive descent pattern: Find a production with factor as LHS and lookahead in FIRST, then do what the RHS says. That is, call the procedures corresponding to the nonterminals, match the terminals, and execute the semantic actions.

2.7: Incorporating a symbol table

The symbol table is an important data structure for the entire compiler. One example of its use is that the semantic actions or rules associated with declarations set the type field of the symbol table entry. Subsequent semantic actions or rules associated with expression evaluation use this type information. For the simple infix to postfix translator (which is typeless), the table is primarily used to store and retrieve <lexeme,token> pairs.

2.7.1: Symbol Table per Scope

There is a serious issue here involving scope. We will learn that lexers are based on regular expressions; whereas parsers are based on the stronger, but more expensive, context-free grammars. Regular expressions are not powerful enough to handle nested scopes. So, if the language you are compiling supports nested scopes, the lexer can only construct the <lexeme,token> pairs. The parser converts these pairs into a true symbol table that reflects the nested scopes. If the language is flat, the scanner can produce the symbol table.

The idea for a language with nested scopes is that, when entering a block, a new symbol table is created. Each such table points to the one immediately outer. This structure supports the most-closely nested rule for symbols: a symbol is in the scope of most-closely nested declaration. This gives rise to a tree of tables.

Interface

Reserved keywords

Simply insert them into the symbol table prior to examining any input. Then they can be found when used correctly and, since their corresponding token will not be id, any use of them where an identifier is required can be flagged. For example a lexer for a C-like language would have insert(int) performed prior to scanning the input.

2.7.2: The Use of Symbol Tables

Below is the grammar for a stripped down example showing nested scopes. The language consists just of nested blocks, a weird mixture of C- and ada-style declarations (specifically, type colon identifier), and trivial statements consisting of just an identifier.

To show that we have correctly parsed the input and obtained its meaning (i.e., performed semantic analysis), we present a translation scheme that digests the declarations and translates the statements so that the above example becomes

The translation scheme, slightly modified from the book page 90, is shown on the right. First a formatting comment.

This translation scheme looks weird, but is actually a good idea (of the authors): it reconciles the two goals of respecting the ordering and nonetheless having the actions all in one column.

Recall that the placement of the actions within the RHS of the production is significant. The parse tree with its actions is processed in a depth first manner so that the actions are performed in left to right order. Thus an action is executed after all the subtrees rooted by parts of the RHS to the left of the action and is executed before all the subtrees rooted by parts of the RHS to the right of the action.

Consider the first production. We want the action to be executed before processing block. Thus the action must precede block in the RHS. But we want the actions in the right column. So we split the RHS over several lines and place an action in the rightmost column of the line that puts in the right order.

The second production has some semantic actions to be performed at the start of the block, and others to be performed at the bottom.

To fully understand the details, you must read the book; but we can see how it works. A new Env initializes a new symbol table; top.put inserts into the symbol table of the current environment top; top.get retrieves from that symbol table.

In some sense the star of the show is the simple production
factor → id
together with its semantic actions. These actions look up the identifier in the (correct!) symbol table and print out the type name.

Hard Question: I prefer ada-style declarations, which are of the form
identifier : type
What problem would have occurred had I done so here and how does one solve that problem?
Answer: Both a stmt and a decl would start with id and thus the FIRST sets would not be disjoint. The result would be that need more than one token lookahead to see when the decls end and the stmts begin. The fix is to left-factor the grammar as we will learn in chapter 4.

2.8: Intermediate Code Generation

2.8.1: Two kinds of Intermediate Representations

Another (but less common) name for parse tree is concrete syntax tree. Similarly another (also less common) name for syntax tree is abstract syntax tree.

Very roughly speaking, (abstract) syntax trees are parse trees reduced to their essential components, and three address code looks like assembler without the concept of registers.

2.8.2: Construction of (Abstract) Syntax Trees

Syntax Trees for Statements

The book has an SDD on page 94 for several statements. The part for while reads
stmt → while ( expr ) stmt₁ { stmt.n = new While(expr.n, stmt₁.n); } The n attribute gives the syntax tree node.

Representing Blocks in Syntax Trees

Syntax trees for Expressions

When parsing we need to distinguish between + and * to insure that 3+4*5 is parsed correctly, reflecting the higher precedence of *. However, once parsed, the precedence is reflected in the tree itself (the node for + has the node for * as a child). The rest of the compiler treats + and * largely the same so it is common to use the same node label, say OP, for both of them. So we see
term → term₁ * factor { term.n = new Op('*', term₁.n, factor.n); }

Note, however, that the SDD (Figure 2.39) essentially constructs both the parse tree and the syntax tree. That latter is constructed as the attributes in the former.

2.8.3: Static Checking

Static checking refers to checks performed during compilation; whereas, dynamic checking refers to those performed at run time. Examples of static checks include

Remark: This is from 1e.

Implementation

Probably the simplest would be

      struct symtableType {
        char lexeme[BIGNUMBER];
        int  token;
      } symtable[ANOTHERBIGNUMBER];

The space inefficiency of having a fixed size entry for all lexemes is poor, so the authors use a (standard) technique of concatenating all the strings into one big string and storing pointers to the beginning of each of the substrings.

2.8: Abstract stack machines

One form of intermediate representation is to assume that the target machine is a simple stack machine (explained very soon). The the front end of the compiler translates the source language into instructions for this stack machine and the back end translates stack machine instructions into instructions for the real target machine.

We use a very simple stack machine

Separate instruction and data memories (no self modifying code).
Arithmetic performed on elements of the stack, rather than on registers or arbitrary locations in (data) memory.
Very simple instructions
- Arithmetic (we just do integer for now).
- Stack manipulation, push, pop, dup, a few others.
- Control flow.
The machine itself has two hidden registers tos (top of stack) and pc (program counter) that are manipulated by instruction execution but are not explicitly mentioned in the instructions. Similar to implementing an abstract data type.

Arithmetic instructions

An instruction for each simple op (e.g., add, mul).
Complicated ops (e.g., sqrt) require several instructions.
We assume an instruction exists for each of the ops we use.
The instruction consumes one or two operands from the tos and places the result on the tos.

L-values and R-values

Consider Q := Z; or A[f(x)+B*D] := g(B+C*h(x,y));. I am using [] for array reference and () for function call).

From a macroscopic view, executing either of these assignments has three components.

Note the differences between L-values, quantities that can appear on the LHS of an assignment, and and R-values, quantities that can appear only on the RHS.

Type Checking

These checks assure that the type of the operands are expected by the operator. In addition to flagging errors, this activity includes

2.8.4: Three-Address Code

These are primitive instructions that have one operator and (up to) three operands, all of which are addresses. One address is the destination, which receives the result of the operation; the other two addresses are the sources of the values to be operated on.

Perhaps the clearest way to illustrate the (up to) three address nature of the instructions is to write them as quadruples or quads.

Translating Statements

We do this and the next section much slower and in much more detail later in the course.

The constructor for an IF node is called with nodes for the expression and statement. These are saved.

When the entire tree is constructed, the main code calls gen() of the root. As we see for IF, gen of a node invokes gen of the children.

Translating Expressions

Better Code for Expressions

So called optimization (the result is far from optimal) is a huge subject that we barely touch. Here are a few very simple examples. We will cover these since they are local optimizations, that is they occur within a single basic block (a sequence of statements that execute without any jumps).

Remark: From 1e.

Stack manipulation

push v	push v (onto stack)
rvalue l	push contents of (location) l
lvalue l	push address of l
pop	pop
:=	r-value on tos put into the location specified by l-value 2nd on the stack; both are popped
copy	duplicate the top of stack

Translating expressions

Machine instructions to evaluate an expression mimic the postfix form of the expression. That is we generate code to evaluate the left operand, then code to evaluate the write operand, and finally the code to evaluate the operation itself.

For example y := 7 * xx + 6 * (z + w) becomes

      lvalue y
      push 7
      rvalue xx
      *
      push 6
      rvalue z
      rvalue w
      +
      *
      +
      :=

To say this more formally we define two attributes. For any nonterminal, the attribute t gives its translation and for the terminal id, the attribute lexeme gives its string representation.

Assuming we have already given the semantic rules for expr (i.e., assuming that the annotation expr.t is known to contain the translation for expr) then the semantic rule for the assignment statement is

      stmt → id := expr
        { stmt.t := 'lvalue' || id.lexime || expr.t || := }

Control flow

There are several ways of specifying conditional and unconditional jumps. We choose the following 5 instructions. The simplifying assumption is that the abstract machine supports symbolic labels. The back end of the compiler would have to translate this into machine instructions for the actual computer, e.g. absolute or relative jumps (jump 3450 or jump +500).

label l target of jump

goto l

gofalse pop stack; jump if value is false

gotrue pop stack; jump if value is true

halt

label l	target of jump
goto l
gofalse	pop stack; jump if value is false
gotrue	pop stack; jump if value is true
halt

Translating (if-then) statements

Fairly simple. Generate a new label using the assumed function newlabel(), which we sometimes write without the (), and use it. The semantic rule for an if statement is simply

      stmt → if expr then stmt₁ { out := newlabel();
                               stmt.t := expr.t || 'gofalse' out || stmt₁.t || 'label' out

Emitting a translation

Rewriting the above as a semantic action (rather than a rule) we get the following, where emit() is a function that prints its arguments in whatever form is required for the abstract machine (e.g., it deals with line length limits, required whitespace, etc).

      stmt → if
	 expr      { out := newlabel; emit('gofalse', out); }
	 then
         stmt₁     { emit('label', out) }

Don't forget that expr is itself a nonterminal. So by the time we reach out:=newlabel, we will have already parsed expr and thus will have done any associated actions, such as emit()'ing instructions. These instructions will have left a boolean on the tos. It is this boolean that is tested by the emitted gofalse.

More precisely, the action written to the right of expr will be the third child of stmt in the tree. Since a postorder traversal visits the children in order, the second child expr will have been visited (just) prior to visiting the action.

Pseudocode for stmt (fig 2.34)

Look how simple it is! Don't forget that the FIRST sets for the productions having stmt as LHS are disjoint!

      procedure stmt
        integer test, out;
        if lookahead = id then       // first set is {id} for assignment
          emit('lvalue', tokenval);  // pushes lvalue of lhs
          match(id);                 // move past the lhs]
          match(':=');               // move past the :=
          expr;                      // pushes rvalue of rhs on tos
          emit(':=');                // do the assignment (Omitted in book)
        else if lookahead = 'if' then
          match('if');               // move past the if
          expr;                      // pushes boolean on tos
          out := newlabel();
          emit('gofalse', out);      // out is integer, emit makes a legal label
          match('then');             // move past the then
          stmt;                      // recursive call
          emit('label', out)         // emit again makes out legal
        else if ...                  // while, repeat/do, etc
        else error();
      end stmt;

2.9: Putting the techniques together

Full code for a simple infix to postfix translator. This uses the concepts developed in 2.5-2.7 (it does not use the abstract stack machine material from 2.8). Note that the intermediate language we produced in 2.5-2.7, i.e., the attribute .t or the result of the semantic actions, is essentially the final output desired. Hence we just need the front end.

Description

The grammar with semantic actions is as follows. All the actions come at the end since we are generating postfix. this is not always the case.

       start → list eof
        list → expr ; list
        list →  ε                   // would normally use | as below
        expr → expr + term      { print('+') }
             | expr - term      { print('-'); }
             | term
        term → term * factor    { print('*') }
             | term / factor    { print('/') }
             | term div factor  { print('DIV') }
             | term mod factor  { print('MOD') }
             | factor
      factor → ( expr )
             | id               { print(id.lexeme) }
             | num              { print(num.value) }

Eliminate left recursion to get

            start → list eof
    	 list → expr ; list
    	      | ε
    	 expr → term moreterms
        moreterms → + term { print('+') } moreterms
    	      | - term { print('-') } moreterms
    	      | ε
             term | factor morefactors
      morefactors → * factor { print('*') } morefactors
    	      | / factor { print('/') } morefactors
    	      | div factor { print('DIV') } morefactors
    	      | mod factor { print('MOD') } morefactors
    	      | ε
           factor → ( expr )
                  | id               { print(id.lexeme) }
                  | num              { print(num.value) }

Show A+B; on board starting with start.

`Lexer.c`

Contains lexan(), the lexical analyzer, which is called by the parser to obtain the next token. The attribute value is assigned to tokenval and white space is stripped.

lexme token attribute value

white space

sequence of digits NUM numeric value

div DIV

mod MOD

other seq of a letter then letters and digits ID index into symbol table

eof char DONE

other char that char NONE

lexme	token	attribute value

white space
sequence of digits	NUM	numeric value
div	DIV
mod	MOD
other seq of a letter then letters and digits	ID	index into symbol table
eof char	DONE
other char	that char	NONE

`Parser.c`

Using a recursive descent technique, one writes routines for each nonterminal in the grammar. In fact the book combines term and morefactors into one routine.

    term() {
      int t;
      factor();
      // now we should call morefactorsl(), but instead code it inline
      while(true)              // morefactor nonterminal is right recursive
         switch (lookahead) {  // lookahead set by match()
         case '*': case '/': case DIV: case MOD: // all the same
            t = lookahead;     // needed for emit() below
            match(lookahead)   // skip over the operator
            factor();          // see grammar for morefactors
            emit(t,NONE);
            continue;          // C semantics for case
         default:              // the epsilon production
            return;

Other nonterminals similar.

`Emitter.c`

The routine emit().

`Symbol.c` and `init.c`

The insert(s,t) and lookup(s) routines described previously are in symbol.c The routine init() preloads the symbol table with the defined keywords.

`Error.c`

Does almost nothing. The only help is that the line number, calculated by lexan() is printed.

Two Questions

Chapter 3: Lexical Analysis

Note that the speed (of the lexer not of the code generated by the compiler) and error reporting/correction are typically much better for a handwritten lexer. As a result most production-level compiler projects write their own lexers.

3.1: The Role of the Lexical Analyzer

The lexer is called by the parser when the latter is ready to process another token.

The lexer also might do some housekeeping such as eliminating whitespace and comments. Some call these tasks scanning, but others user the term scanner for the entire lexical analyzer.

After the lexer, individual characters are no longer examined by the compiler; instead tokens (the output of the lexer) are used.

3.1.1: Lexical Analysis Versus Parsing

Why separate lexical analysis from parsing? The reasons are basically software engineering concerns.

3.1.2: Tokens, Patterns, and Lexemes

3.1.3: Attributes for Tokens

For tokens corresponding to keywords, attributes are not needed since the name of the token tells everything. But consider the token corresponding to integer constants. Just knowing that the we have a constant is not enough, subsequent stages of the compiler need to know the value of the constant. Similarly for the token identifier we need to distinguish one identifier from another. The normal method is for the attribute to specify the symbol table entry for this identifier.

We really shouldn't say symbol table. As mentioned above if the language has scoping (nested blocks) the lexer can't construct the symbol table, but just makes a table of <lexeme,token> pairs, which the parser later converts into a proper symbol table (or tree of tables).

3.1.4: Lexical Errors

We saw in this movie an example where parsing got stuck because we reduced the wrong part of the input string. We also learned about FIRST sets that enabled us to determine which production to apply when we are operating left to right on the input. For predictive parsers the FIRST sets for a given nonterminal are disjoint and so we know which production to apply. In general the FIRST sets might not be disjoint so we have to try all the productions whose FIRST set contains the lookahead symbol.

All the above assumed that the input was error free, i.e. that the source was a sentence in the language. What should we do when the input is erroneous and we get to a point where no production can be applied?

In many cases this is up to the parser to detect/repair. Sometimes, however, the lexer is stuck because there are no patterns that match the input at this point.

The simplest solution is to abort the compilation stating that the program is wrong, perhaps giving the line number and location where the lexer and/or parser could not proceed.

We would like to do better and at least find other errors. We could perhaps skip input up to a point where we can begin anew (e.g. after a statement ending semicolon), or perhaps make a small change to the input around lookahead so that we can proceed.

3.2: Input Buffering

Determining the next lexeme often requires reading the input beyond the end of that lexeme. For example, to determine the end of an identifier normally requires reading the first whitespace character after it. Also just reading > does not determine the lexeme as it could also be >=. When you determine the current lexeme, the characters you read beyond it may need to be read again to determine the next lexeme.

3.2.1: Buffer Pairs

The book illustrates the standard programming technique of using two (sizable) buffers to solve this problem.

3.2.2: Sentinels

A useful programming improvement to combine testing for the end of a buffer with determining the character read.

3.3: Specification of Tokens

The chapter turns formal and, in some sense, the course begins. The book is fairly careful about finite vs infinite sets and also uses (without a definition!) the notion of a countable set. (A countable set is either a finite set or one whose elements can be put into one to one correspondence with the positive integers. That is, it is a set whose elements can be counted. The set of rational numbers, i.e., fractions in lowest terms, is countable; the set of real numbers is uncountable, because it is strictly bigger, i.e., it cannot be counted.) We should be careful to distinguish the empty set φ from the empty string ε. Formal language theory is a beautiful subject, but I shall suppress my urge to do it right and try to go easy on the formalism.

3.3.1: Strings and Languages

Example: {0,1}, presumably φ (uninteresting), {a,b,c} (typical for exam questions), ascii, unicode, latin-1.

Definition: A string over an alphabet is a finite sequence of symbols from that alphabet. Strings are often called words or sentences.

Example: Strings over {0,1}: ε, 0, 1, 111010. Strings over ascii: ε, sysy, the string consisting of 3 blanks.

Definition: The length of a string is the number of symbols (counting duplicates) in the string.

Definition: A language over an alphabet is a countable set of strings over the alphabet.

Example: All grammatical English sentences with five, eight, or twelve words is a language over ascii.

Definition: The concatenation of strings s and t is the string formed by appending the string t to s. It is written st.

We view concatenation as a product (see Monoid in wikipedia http://en.wikipedia.org/wiki/Monoid). It is thus natural to define s⁰=ε and sⁱ⁺¹=sⁱs.

More string terminology

A prefix of a string is a portion starting from the beginning and a suffix is a portion ending at the end. More formally,

Definitions: A proper prefix of s is a prefix of s other than ε and s itself. Similarly, proper suffixes and proper substrings of s do not include ε and s.

Definition: A subsequence of s is formed by deleting (possibly zero) positions from s. We say positions rather than characters since s may for example contain 5 occurrences of the character Q and we only want to delete a certain 3 of them.

3.3.2: Operations on Languages

Definition: The union of L and M, written L ∪ M, is the set-theoretic union, i.e., it consists of all words (strings) in either L or M (or both).

Example: Let the alphabet be ascii. The union of {Grammatical English sentences with one, three, or five words} with {Grammatical English sentences with two or four words} is {Grammatical English sentences with five or fewer words}.

Definition: The concatenation of L and M is the set of all strings st, where s is a string of L and t is a string of M.

We again view concatenation as a product and write LM for the concatenation of L and M.

Examples:: Let the alphabet A={a,b,c,1,2,3}. The concatenation of the languages L={a,b,c} and M={1,2} is L∪M={a1,a2,b1,b2,c1,c2}. The concatenation of {aa,b,c} and {1,2,ε} is {aa1,aa2,aa,b1,b2,b,c1,c2,c}.

Definition: As with strings, it is natural to define powers of a language L.
L⁰={ε}, which is not φ.
Lⁱ⁺¹=LⁱL.

Definition: The (Kleene) closure of L, denoted L^* is
L⁰ ∪ L¹ ∪ L² ...

Example: Let both the alphabet and the language L be {0,1,2,3,4,5,6,7,8,9}. More formally, I would say the alphabet A is {0,1,2,3,4,5,6,7,8,9} and the language L is the set of all strings of length one over A. L⁺ gives all unsigned integers, but with some ugly versions. It has 3, 03, 000003.
{0} ∪ ( {1,2,3,4,5,6,7,8,9} ({0,1,2,3,4,5,6,7,8,9}^* ) ) seems better.

In these notes I may write * for ^* and + for ⁺, but that is strictly speaking wrong and I will not do it on the board or on exams or on lab assignments.

Example: {a,b}* is {ε,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...}.
{a,b}+ is {a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...}.
{ε,a,b}* is {ε,a,b,aa,ab,ba,bb,...}.
{ε,a,b}+ is the same as {ε,a,b}*.

The book gives other examples based on L={letters} and D={digits}, which you should read.

3.3.3: Regular Expressions

The idea is that the regular expressions over an alphabet consist of
the alphabet, and expressions using union, concatenation, and *,
but it takes more words to say it right. For example, I didn't include (). Note that (A ∪ B)* is definitely not A* ∪ B* (* does not distribute over ∪) so we need the parentheses.

The book's definition includes many () and is more complicated than I think is necessary. However, it has the crucial advantages of being correct and precise.

I will try a slightly different approach, but note again that there is nothing wrong with the book's approach (which appears in both first and second editions, essentially unchanged).

Definition: The regular expressions and associated languages over an alphabet consist of

Parentheses, if present, control the order of operations. Without parentheses the following precedence rules apply.

The postfix unary operator * has the highest precedence. The book mentions that it is left associative. (I don't see how a postfix unary operator can be right associative or how a prefix unary operator such as unary minus could be left associative.)

The book gives various algebraic laws (e.g., associativity) concerning these operators.

Examples
Let the alphabet A={a,b,c}. Write a regular expression representing the language consisting of all words with

Production			Action

Program	→		{top = null}
		block

block	→	{	{ saved = top;
			top = new Env(top);
			print ("{ "); }
		decls stmts }	{ top = saved;
			print ("} "); }

decls	→	decls decl
	\|	ε

decl	→	type id ;	{ s = new Symbol;
			s.type = type.lexeme;
			top.put(id.lexeme,s); }

stmts	→	stmts stmt
	\|	ε

stmt	→	block
	\|	factor ;	{ print("; "); }

factor	→	id	{ s = top.get(id.lexeme);
			print(s.type); }

Compilers

2.6: Lexical Analysis