Compilers

Start Lecture #12

Remark: Lab 4 assigned.

6.5: Type Checking

Type Checking includes several aspects.

The language comes with a type system, i.e., a set of rules saying what types can appear where.
The compiler assigns a type expression to parts of the source program.
The compiler checks that the type usage in the program conforms to the type system for the language.

All type checking could be done at run time: The compiler generates code to do the checks. Some languages have very weak typing; for example, variables can change their type during execution. Often these languages need run-time checks. Examples include lisp, snobol, apl.

A sound type system guarantees that all checks can be performed prior to execution. This does not mean that a given compiler will make all the necessary checks.

An implementation is strongly typed if compiled programs are guaranteed to run without type errors.

6.5.1: Rules for Type Checking

There are two forms of type checking.

We will learn type synthesis where the types of parts are used to infer the type of the whole. For example, integer+real=real.
Type inference is very slick. The type of a construct is determined from usage. This permits languages like ML to check types even though names need not be declared.

We consider type checking for expessions. Checking statements is very similar. View the statement as a function having its components as arguments and returning void. The SDDs below (and for lab 4) contain type checking (and coercions) for assignment statements.

6.5.2: Type Conversions

A very strict type system would do no automatic conversion. Instead it would offer functions for the programer to explicitly convert between selected types. Then either the program has compatible types or is in error.

However, we will consider a more liberal approach in which the language permits certain implicit conversions that the compiler is to supply. This is called type coercion. Explicit conversions supplied by the programmer are called casts. widening

We continue to work primarily with the two basic types used in lab 3, namely integer and real, and postulate a unary function denoted (real) that converts an integer into the real having the same value. Nonetheless, we do consider the more general case where there are multiple types some of which have coercions (often called widening). For example in C/Java, int can be widened to long, which in turn can be widened to float as shown in the figure to the right.

Mathematically the hierarchy on the right is a partially order set (poset) in which each pair of elements has a least upper bound (LUB). For many binary operators (all the arithmetic ones we are considering, but not exponentiation) the two operands are converted to the LUB. So adding a short to a char, requires both to be converted to an int. Adding a byte to a float, requires the byte to be converted to a float (the float remains a float and is not converted).

Checking and Coercing Types for Basic Arithmetic

The steps for addition, subtraction, multiplication, and division are all essentially the same: Convert each types if necessary to the LUB and then perform the arithmetic on the (converted or original) values. Note that conversion often requires the generation of code.

Two functions are convenient.

LUB(t1,t2) returns the type that is the LUB of the two given types. It signals an error if there is no LUB, for example if one of the types is an array.
widen(a,t,w,newcode,newaddr). Given an address a of type t, and a (hopefully) wider address w, produce the instructions newcode needed so that the address newaddr is the conversion of address a to type w.

LUB is simple, just look at the address latice. If one of the type arguments is not in the lattice, signal an error; otherwise find the lowest common ancestor.

widen is more interesting. It involves n² cases for n types. Many of these are error cases (e.g., if t wider than w). Below is the code for our situation with two possible types integer and real. The four cases consist of 2 nops (when t=w), one error (t=real; w=integer) and one conversion (t=integer; w=real).

    widen (a:addr, t:type, w:type, newcode:string, newaddr:addr)
      if t=w
        newcode = ""
        newaddr = a
      else if t=integer and w=real
        newaddr = new Temp()
        newcode = gen(newaddr = (real) a)
      else signal error

With these two functions it is not hard to modify the rules to catch type errors and perform coercions for arithmetic expressions.

Maintain the type of each operand by defining type attributes for e, t, and f.
Coerce each operand to the LUB.

This requires that we have type information for the base entities, identifiers and numbers. The lexer can supply the type of the numbers. We retrieve it via get(NUM.type).

It is more interesting for the identifiers. We insert that information when we process declarations. So we now have another semantic check: Is the identifier declared before it is used?

I will use the function get(ID.type), which returns the type from the identifier table and signals an error if it is not there. The original SDD for assignment statements was here and the changes for arrays was here.

Before taking on the entire SDD, let's examine a particularly interesting entry. Consider the assignment statement

    A[3/X+4] := X*5+Y;

whose parse tree is shown on the right. Consider the ra node, i.e., the node corresponding to the production.

    ra → [ e ] := e₁ ;

When the tree traversal gets to this node, its parent has passed in the value of the inherited attribute ra.id=id.entry. Thus the ra node has access to the identifier table entry for ID, which in our example is the variable A.

Prior to doing its calculations, the ra node invokes its children and gets back all the synthesized attributes. To summarize, when the ra node performs its calculations, it has available.

ra.id: the identifier entry for A.
e.addr/e.code: executing e.code results in e.addr containing the value of the expression 3/X+4.
e₁.addr/e₁.code: executing e₁.code results in e₁.addr containing the value of the expression X*5+Y.
e.type/e₁.type: The types of the expressions.

What must the ra node do?

Ensure execution of e.code and e₁.code.
Check that e.type is int (I don't do this).
Multiply e by the base width of the array A. (We need a temporary, ra.t1, to hold the computed value).
Widen e1 to the base type of A. (We need, and widen generates, a temporary ra.addr to hold the widened value).
Do the actual assignment of X*5+Y to A[3/X+4].

I hope this clarifies the semantic rules for this production.

Remark: In preparing this explanation, I fixed errors in the table below and also in the corresponding entry of the last table of section 6.4, which we did last time.

Assignment Statements With Type Checks and Coercions
Production	Semantic Rule

ids → ID ra	ra.id = ID.entry ids.code = ra.code

ra → = e ;	widen(e.addr, e.type, ra.id.basetype, ra.code₁, ra.addr) ra.code = e.code \|\| ra.code₁ \|\| gen(ra.id.lexeme)=ra.addr)

ra → [ e ] = e₁ ;	ra.t1 = newTemp() widen(e₁.addr, e₁.type, ra.id.basetype, ra.code₁, ra.addr) ra.code = e.code \|\| gen(ra.t1 = getBaseWidth(ra.id) * e.addr ) \|\| e₁.code \|\| ra.code₁ \|\| gen(ra.id.lexeme[ra.t1] = ra.addr)

e → e₁ + t	e.addr = new Temp() e.type = LUB(e₁.type, t.type) widen(e₁.addr, e₁.type, e.type, e.code1, e.addr1) widen(t.addr, t.type, e.type, e.code2, e.addr2) e.code = e₁.code \|\| t.code \|\| e.code1 \|\| e.code2 \|\| gen(e.addr = e.addr1 + e.addr2)

e → e₁ - t	e.addr = new Temp() e.type = LUB(e₁.type, t.type) widen(e₁.addr, e₁.type, e.type, e.code1, e.addr1) widen(t.addr, t.type, e.type, e.code2, e.addr2) e.code = e₁.code \|\| t.code \|\| e.code1 \|\| e.code2 \|\| gen(e.addr = e.addr1 - e.addr2)

e → t	e.addr = t.addr e.type = t.type e.code = t.code

t → t₁ * f	t.addr = new Temp() t.type = LUB(t₁.type, f.type) widen(t₁.addr, t₁.type, t.type, t.code1, t.addr1) widen(f.addr, f.type, t.type, t.code2, t.addr2) t.code = t₁.code \|\| f.code \|\| t.code1 \|\| t.code2 \|\| gen(t.addr = t.addr1 * t.addr2)

t → t₁ / f	t.addr = new Temp() t.type = LUB(t₁.type, f.type) widen(t₁.addr, t₁.type, t.type, t.code1, t.addr1) widen(f.addr, f.type, t.type, t.code2, t.addr2) t.code = t₁.code \|\| f.code \|\| t.code1 \|\| t.code2 \|\| gen(t.addr = t.addr1 / t.addr2)

t → f	t.addr = f.addr t.type = f.type t.code = f.code

f → ( e )	f.addr = e.addr f.type = e.type f.code = e.code

f → NUM	f.addr = get(NUM.lexeme) f.type = get(NUM.type) f.code = ε

f → if	f.addr = if.addr f.type = if.type f.code = if.code

if → ID	if.addr = ID.lexeme if.type = getBaseType(ID.type) if.code = ε

if → ID [ e ] Note: e not es	if.t1 = new Temp() if.addr = new Temp() if.type = getBaseType(ID.type) if.code = e.code \|\| gen(if.t1=e.addr*getBaseWidth(ID.entry)) \|\| gen(if.addr=ID.lexeme[if.t1])

Homework: Same question as the previous homework (What code is generated for the program written above?). But the answer is different!

6.5.3: Overloading of Functions and Operators

Skipped.

Overloading is when a function or operator has several definitions depending on the types of the operands and result.

6.5.4: Type Inference and Polymorphic Functions

Skipped.

6.5.5: An Algorithm for Unification

Skipped.

6.6: Control Flow

Control flow includes the study of Boolean expressions, which have two roles.

They can be computed and treated similar to integers or real. Once can declare Boolean variables, there are boolean constants and boolean operators. There are also relational operators that produce Boolean values from arithmetic operands. From this point of view, Boolean expressions are similar to the expressions we have already treated. Our previous semantic rules could be modified to generate the code needed to evaluate these expressions.
They are used in certain statements that alter the normal flow of control. In this regard, we have something new to learn.

6.6.1: Boolean Expressions

One question that comes up with Boolean expressions is whether both operands need be evaluated. If we are evaluating A or B and find that A is true, must we evaluate B? For example, consider evaluating

     A=0  OR  3/A < 1.2

when A is zero.

This issue arises in other cases as well. Consider A*F(x). If the compiler knows that for this run A is zero, must it generate code to evaluate F(x)? Don't forget that functions can have side effects,

6.6.2: Short-Circuit Code

This is also called jumping code. Here the Boolean operators AND, OR, and NOT do not appear in the generated instruction stream. Instead we just generate jumps to either the true branch or the false branch.

6.6.3: Flow-of-Control Statements

The lab3 grammar has the following productions concerning flow of control statements. I added a start production which will be used to initiate a series of *.next inherited attributes.

    start             → program
    program           → function-def program | procedure-def program | ε
    procedure-def     → PROCEDURE name-and-parameters IS declarations BEGIN statement statements END ;
    statements        → statement statements | ε
    statement         → keyword-statement | identifier-statement
    keyword-statement → return-statement | while-statement | if-statement
    if-statement      → IF condition THEN statements optional-else END ;
    optional-else     → ELSE statements | ε
    while-statement   → WHILE condition DO statements END ;

I do not include the production for function-def since, for control flow, it is the same as procedure-def. I don't show the productions for name-and-parameters, declarations, identifier-statement, and return-statement since they do not have conditional control flow (return just generates a goto, but doesn't use any of the techniques in this section). The production for condition will be done in the next section. flow of control

To make the tables fit we abbreviate the names of the nonterminals appearing in the grammar above to pg, fd, pd, np, ds, ss, s, ks, is, rs, ws, is, c, and oe.

In this section we will produce an SDD for these productions under the assumption that the SDD for condition generates jumps to the labels c.true and c.false (depending of course on whether the condition c is true or false).

In the lab3 grammar, a condition is simply two expressions separated by a relational operator e RELOP e₁. In the next section we give the SDD for a more general nonterminal boolean-expression, for which the lab3 condition is a special case.

The grammar for boolean-expression given in the next section is not LL(1). Since the lab3 grammar does not have general Boolean expressions, I did not massage the grammar to make it LL(1).

I don't know why the sections aren't in the reverse order and I came close to reversing the order of presentation.

The diagrams on the right give the idea for the three basic control flow statements, if-then (not in the lab3 grammar), if-then-else, and while-do. The table below gives the details for the latter two via an SDD.

If and While SDDs
Production	Semantic Rules

start → pg	pg.next = newLabel() start.code = pg.code \|\| label(pg.next)

pg → pd pg₁	pd.next = newLabel() pg1.next = pg.next pg.code = pd.code \|\| label(pd.next) \|\| pg₁.code

pg → ε	pg.code = ""

pd → PROC np IS ds BEGIN s ss END ;	s.next = newLabel() ss.next = pd.next pd.code = s.code \|\| label(s.next) \|\| ss.code

ss → s ss₁	s.next = newLabel() ss₁.next = ss.next ss.code = s.code \|\| label(s.next) \|\| ss₁.code

ss → ε	ss.code = ""

s → ks	ks.next = s.next s.code = ks.code

ks → is	is.next = ks.next ks.code = is.code

is → IF c THEN ss oe END ;	c.true = newLabel() c.false = newLabel() ss.next = is.next oe.next = is.next is.code = c.code \|\| label(c.true) \|\| ss.code \|\| gen(goto is.next) \|\| label(c.false) \|\| oe.code

oe → ELSE ss	ss.next = oe.next oe.code = ss.code

oe → ε	oe.code = ""

ks → ws	ws.next = ks.next ks.code = ws.code

ws → WHILE c DO ss END ;	begin = newLabel() c.true = newLabel() c.false = ws.next ss.next = begin ws.code = label(begin) \|\| c.code \|\| label(c.true) \|\| ss.code \|\| gen(goto begin)

The treatment of the various *.next attributes deserves some comment. Each statement is given, as an inherited attribute, a label *.next which the parent normally places after the code for the child. The child and/or the parent can generate code that performs a goto this label. An alternative would be for the child to itself generate the label and place it as the last component of its code. I believe this alternative would make for a clearer SDD; however, the method chosen does have two advantages.

Look at the ws.code the code attribute for the while statement. The parent does not place ss.next after ss.code. If we used the alternative the parent would need a goto from after ss.code to the begin label.
Perhaps there is a case where it is awkward for the child to place something at the end of its code (but I don't quite see how this could be). To investigate this possibility one should examine the treatment of case statements below.

Homework: Give the SDD for a repeat statement
REPEAT ss WHILE c END ;

6.6.4: Control-Flow Translation of Boolean Expressions

Boolean Expressions
Production	Semantic Rules

BE → BE₁ \|\| BT	BE₁.true = BE.true
	BE₁.false = newlabel()
	BT.true = BE.true
	BT.false = BE.false
	BE.code = BE₁.code \|\| label(BE₁.false) \|\| BT.code

BE → BT	BT.true = B.true
	BT.false = B.false
	BE.code = BT.code

BT → BT₁ && BF	BT₁.true = newlabel()
	BT₁.false = BT.false
	BF.true = BT.true
	BF.false = BT.false
	BT.code = BT₁.code \|\| label(BT₁.true) \|\| BF.code

BT → BF	BF.true = BT.true
	BF.false = BT.false
	BT.code = BF.code

BF → ! BF₁	BF₁.true = BF.false
	BF₁.false = BF.true
	B.code = BF₁.code

BF → true	BF.code = gen(goto BF.true)

BF → false	BF.code = gen(goto BF.false)

BF → ID	BF.code = gen(if get(ID.lexeme) goto BF.true) \|\| gen(goto BF.false)

BF → E relop E₁	BF.code = E.code \|\| E₁.code \|\| gen(if E.addr relop.lexeme E₁.addr goto BF.true) \|\| gen(goto B.false)

Do on the board the translation of

    if ( x < 5 || x > 10 && x == y ) x = 3 ;

We get

        if x < 5 goto L₂
        goto L₃
    L₃: if x > 10 goto L₄
	goto L₁
    L₄: if x == y goto L₂
	goto L₁
    L₂: x = 3

Note that there are three extra gotos. One is a goto the next statement. Two others could be eliminated by using ifFalse.

6.6.5: Avoiding Redundant Gotos

Skipped.

6.6.6: Boolean Values and Jumping Code

If there are boolean variables (or variables into which a boolean value can be placed), we can have boolean assignment statements. That is we might evaluate boolean expressions outside of control flow statements.

Recall that the code we generated for boolean expressions (inside control flow statements) used inherited attributes to push down the tree the exit labels B.true and B.false. How are we to deal with Boolean assignment statements?

Two Methods for Booleans: Method 1

Up to now we have used the so called jumping code method for Boolean quantities. We evaluated Boolean expressions (in the context of control flow statements) by using inherited attributes to push down the tree the true and false exits (i.e., the target locations to jump to if the expression evaluates to true and false).

With this method if we have a Boolean assignment statement, we just let the true and false exits lead respectively to statements

    LHS = true
    LHS = false

Two Methods for Booleans: Method 2

In the second method we simply treat boolean expressions as expressions. That is, we just mimic the actions we did for integer/real evaluations. Thus Boolean assignment statements like
a = b OR (c AND d AND (x < y))
just work.

For control flow statements like

    while boolean-expression do statement-list end ;
    if boolean-expression then statement-list else statement-list end ;

we simply evaluate the boolean expression as if it was part of an assignment statement and then have two jumps to where we should go if the result is true or false.

However, as mentioned before, this is wrong.
In C and other languages if (a=0 || 1/a > f(a)) is guaranteed not to divide by zero and the above implementation fails to provide this guarantee. We must implement short-circuit boolean evaluation.