Compilers

Start Lecture #11

6.5: Type Checking

Type Checking includes several aspects.

  1. The language comes with a type system, i.e., a set of rules saying what types can appear where.
  2. The compiler assigns type expressions to parts of the source program.
  3. The compiler ensures that the type usage in the program conforms to the type system for the language. This does not mean that the necessary type checks are performed at compile time.

For any language, all type checking could be done at run time, i.e. there would be no compile-time checking. However, that does not mean the compiler is absolved from the type checking. Instead, the compiler generates code to do the checks at run time.

It is normally preferable to perform the checks at compile time, if possible. Some languages have very weak typing; for example, variables can change their type during execution. Often these languages need run-time checks. Examples include lisp, snobol, and apl.

A sound type system guarantees that all checks can be performed prior to execution. This does not mean that a given compiler will make all the necessary checks.

An implementation is strongly typed if compiled programs are guaranteed to run without type errors.

6.5.1: Rules for Type Checking

There are two forms of type checking.

  1. We will learn type synthesis where the types of parts are used to infer the type of the whole. For example, integer+real=real.
  2. Type inference is very slick. The type of a construct is determined from its usage. This permits languages like ML to check types even though names need not be declared.

We will implement type checking for expressions. Type checking statements is similar. The SDDs below (and for lab 4) contain type checking (and coercions) for assignment statements as well as expressions.

6.5.2: Type Conversions

A very strict type system would do no automatic conversion. Instead it would offer functions for the programmer to explicitly convert between selected types. With such a system, either the program has compatible types or is in error. Such explicit conversions supplied by the programmer are called casts.

We, however, will consider a more liberal approach in which the language permits certain implicit conversions that the compiler is to supply. This is called type coercion. widening

We continue to work primarily with the two basic types integer and real, and postulate a unary function denoted (real) that converts an integer into the real having the same value. Nonetheless, we do consider the more general case where there are multiple types some of which have coercions (often called widenings). For example in C/Java, int can be widened to long, which in turn can be widened to float as shown in the figure to the right.

Mathematically the hierarchy on the right is a partially order set (poset) in which each pair of elements has a least upper bound (LUB). For many binary operators (all the arithmetic ones we are considering, but not exponentiation) the two operands are converted to the LUB. So adding a short to a char, requires both to be converted to an int. Adding a byte to a float, requires the byte to be converted to a float (the float remains a float and is not converted).

Checking and Coercing Types for Basic Arithmetic

The steps for addition, subtraction, multiplication, and division are all essentially the same:

  1. Determine the result type, which is the LUB of the two operand types.
  2. Convert each operand to the result type. This step may require generating code or may be a no op.
  3. Perform the arithmetic on the converted types.

Two functions are convenient.

  1. LUB(t1,t2) returns the type that is the LUB of the two given types. It signals an error if there is no LUB, for example if one of the types is an array.
  2. widen(A,T,W,newcode,newaddr). Given an address A of type T, and a wider (or equally wide) type W, produce newcode, the instructions needed so that the address newaddr is the conversion of address A to type W.

LUB is simple, just look at the address latice. If one of the type arguments is not in the lattice, signal an error; otherwise find the lowest common ancestor. For our case the lattice is trivial, real is above int.

The widen function is more interesting. It involves n2 cases for n types. Many of these are error cases (e.g., if T wider than W). Below is the code for our situation with two possible types integer and real. The four cases consist of 2 nops (when T=W), one error (T=real; W=integer) and one conversion (T=integer; W=real).

    widen (A:addr, T:type, W:type, newcode:string, newaddr:addr)
      if T=W
        newcode = ""
        newaddr = A
      else if T=integer and W=real
        newaddr = new Temp()
        newcode = gen(newaddr = (real) A)
      else signal error
  

With these two functions it is not hard to modify the rules to catch type errors and perform coercions for arithmetic expressions.

  1. Maintain the type of each operand by defining type attributes for e, t, and f.
  2. Coerce each operand to the LUB.

This requires that we have type information for the base entities, identifiers and numbers. The lexer supplies the type of the numbers.

It is more interesting for the identifiers. We inserted that information when we processed declarations. So we now have another semantic check: Is the identifier declared before it is used?

We will not explicitly perform this check. To do so would not be hard.

The following analysis is similar to the one given in the previous homework solution. I repeat the explanation here because it is important to understand the workings of SDDs.

assign stmt tree

Before taking on the entire SDD, let's examine a particularly interesting entry

    identifier-statement → IDENTIFIER rest-of-assignment
  
and its right child
    rest-of-assignment → index := expression ;
  

Consider the assignment statement

    A[3/X+4] := X*5+Y;
  
the top of whose parse tree is shown on the right (again making the simplification to one-dimensional arrays by replacing indices with index). Consider the ra node, i.e., the node corresponding to the production.
    ra → i := e ;
  

When the tree traversal gets to the ra node the first time, its parent has passed in the value of the inherited attribute ra.id=id.entry. Thus the ra node has access to the identifier table entry for ID, which in our example is the variable A.

Prior to doing its calculations, the ra node invokes its children and gets back all the synthesized attributes. Alternately said, when the tree traversal gets to this node the last time, the children have returned all the synthesized attributes. To summarize, when the ra node finally performs its calculations, it has available.

  1. ra.id: the identifier entry for A.

  2. i.addr/i.code: executing i.code results in i.addr containing the value of the index, which is the expression 3/X+4.

  3. e.addr/e.code: executing e.code results in e.addr containing the value of the expression X*5+Y.

  4. i.type/e.type: The types of the expressions.

Assignment Statements With Type Checks and Coercions
(Without Multidimensional Arrays)
ProductionSemantic Rule
ids → ID ra
ra.id = ID.entry
ids.code = ra.code
ra → := e ;
widen(e.addr, e.type, ra.id.basetype, ra.code1, ra.addr)
ra.code = e.code || ra.code1 || gen(ra.id.lexeme=ra.addr)
ra → i = e ;
Note: i not is
ra.t1 = newTemp()
widen(e.addr, e.type, ra.id.basetype, ra.code1, ra.addr)
ra.code = i.code || gen(ra.t1 = getBaseWidth(ra.id)*i.addr)
      || e.code || ra.code1 || gen(ra.id.lexeme[ra.t1]=ra.addr)
i → e i.addr = e.addr
i.type = e.type
i.code = e.code
e → e1 ADDOP t e.addr = new Temp()
e.type = LUB(e1.type, t.type)
widen(e1.addr, e1.type, e.type, e.code1, e.addr1)
widen(t.addr, t.type, e.type, e.code2, e.addr2)
e.code = e1.code || e.code1 || t.code || e.code2 ||
      gen(e.addr = e.addr1 ADDOP.lexeme e.addr2)
e → t e.addr = t.addr
e.type = t.type
e.code = t.code
t → t1 MULOP f t.addr = new Temp()
t.type = LUB(t1.type, f.type)
widen(t1.addr, t1.type, t.type, t.code1, t.addr1)
widen(f.addr, f.type, t.type, t.code2, t.addr2)
t.code = t1.code || t.code1 || f.code || t.code2 ||
      gen(t.addr = t.addr1 MULOP.lexeme t.addr2)
t → f t.addr = f.addr
t.type = f.type
t.code = f.code
f → ( e ) f.addr = e.addr
f.type = e.type
f.code = e.code
f → NUM f.addr = NUM.lexeme
f.type = NUM.entry.type
f.code = ""
f → ID
(i.e., indices=ε)
f.addr = ID.lexeme
f.type = getBaseType(ID.entry)
f.code = ε
f → ID i
Note: i not is
f.t1 = new Temp()
f.addr = new Temp()
f.type = getBaseType(ID.entry)
f.code = i.code || gen(f.t1=i.addr*getBaseWidth(ID.entry))
      || gen(f.addr=ID.lexeme[f.t1])

What must the ra node do?

  1. Check that i.type is int (I don't do this, but should/could).

  2. Ensure execution of i.code and e.code.

  3. Multiply i.addr by the base width of the array A. (We need a temporary, ra.t1, to hold the computed value).

  4. Widen e to the base type of A. (We may need, and widen would then generate, a temporary ra.addr to hold the widened value).

  5. Do the actual assignment of X*5+Y to A[3/X+4].

I hope the above illustration clarifies the semantic rules for the
    ra → i := e ;
production in the SDD on the right.

Because we are not considering multidimensional-arrays, the
    f → ID is
production (is abbreviates indices) is replaced by the two special cases corresponding to scalars and one-dimensional arrays, namely:
    f → ID     f → ID i

The above illustration should also help understanding the semantic rules for this last production.

The result of incorporating these rules in our growing SDD is given here.

Homework: Same question as the previous homework (What code is generated for the program written above?). But the answer is different!
Please remind me to go over the answer next class.

For lab 4, I eliminated the left recursion for expression evaluation. This was an instructive exercise for me, so let's do it here.

6.5.3: Overloading of Functions and Operators

Overloading is when a function or operator has several definitions depending on the types of the operands and result.

6.5.4: Type Inference and Polymorphic Functions

6.5.5: An Algorithm for Unification

6.6: Control Flow

A key to the understand of control flow is the study of Boolean expressions, which themselves are used in two roles.

  1. They can be computed and treated similar to integers and reals. In many programming languages once can declare Boolean variables, use the Boolean constants true and false, and construct larger Boolean expression using Boolean operators such as and and or. There are also relational operators that produce Boolean values from arithmetic operands. From this point of view, Boolean expressions are similar to the expressions we have already treated. Our previous semantic rules could be modified to generate the code needed to evaluate these expressions.
  2. Boolean expressions are used in certain statements that alter the normal flow of control. It is in this regard that we have something new to learn.

6.6.1: Boolean Expressions

One question that comes up with Boolean expressions is whether both operands need be evaluated. If we are evaluating A or B and find that A is true, must we evaluate B? For example, consider evaluating

    A=0  OR  3/A < 1.2
  
when A is zero.

This issue arises in other cases not involving Booleans at all. Consider A*F(x). If the compiler determines that A must be zero at this point, must it generate code to evaluate F(x)? Don't forget that functions can have side effects.

6.6.2: Short-Circuit (or Jumping) Code

Consider

    IF bool-expr-1 OR bool-expr-2 THEN
        then-clause
    ELSE
        else-clause
    END;
  
where bool-expr-1 and bool-expr-2 are Boolean expressions we have already generated code for. For a simple case think of them as Boolean variables, x and y.

When we compile the if condition bool-expr-1 OR bool-expr-2 we do not use an OR operator. Instead, we generate jumps to either the true branch or the false branch. We shall see that the above source code in the comparatively simple case where bool-expr-1=x and bool-expr-2=y will generate

    if x goto L2
    goto L4
    L4:
    if y goto L2
    goto L3
    L2:
    then-clause-code
    goto L1
    L3:
    else-clause-code
    L1:  -- This label (if-stmt.next) was defined higher in the SDD tree
  

Note that the compiled code does not evaluate y if x is true. This is the sense in which it is called short-circuit code. As we have stated above, for many programming languages, it is required that we not evaluate y if x is true.

6.6.3: Flow-of-Control Statements

The class grammar has the following productions concerning flow of control statements.

    program           → procedure-def program | ε
    procedure-def     → PROCEDURE name-and-params IS decls BEGIN statement statements END ;
    statements        → statement statements | ε
    statement         → keyword-statement | identifier-statement
    keyword-statement → return-statement | while-statement | if-statement
    if-statement      → IF boolean-expression THEN statements optional-else END ;
    optional-else     → ELSE statements | ε
    while-statement   → WHILE boolean-expression DO statements END ;
  
flow of control

I don't show the productions for name-and-parameters, declarations, identifier-statement, and return-statement since they do not have conditional control flow. The productions for boolean-expression will be done in the next section.

In this section we will produce an SDD for the productions above under the assumption that the SDD for Boolean expressions generates jumps to the labels be.true and be.false (depending of course on whether the Boolean expression be is true or false).

The diagrams on the right give the idea for the three basic control flow statements, if-then (upper left), if-then-else (right), and while-do (lower-left).
If and While SDDs
ProductionSemantic Rules
pg → pd pg1
pd.next = new Label()
pg.code = pd.code || label(pd.next) || pg1.code
pg → εpg.code = ""
pd → PROC np IS ds BEG ss END ; pd.code = ss.code
ss → s ss1
s.next = new Label()
ss.code = s.code || label(s.next) || ss1.code
ss → ε ss.code = ""
s → ks
ks.next = s.next
s.code = ks.code
ks → ifs
ifs.next = ks.next
ks.code = is.code
ifs → IF be THEN ss oe END ;
be.true = new Label()
be.false = new Label()
oe.next = ifs.next
ifs.code = be.code || label(be.true) || ss.code ||
    gen(goto is.next) || label(be.false) || oe.code
oe → ELSE ss oe.code = ss.code
oe → ε oe.code = ""
ks → ws
ws.next = ks.next
ks.code = ws.code
ws → WHILE be DO ss END ; ws.begin = new Label()
be.true = new Label()
be.false = ws.next
ss.next = ws.begin
ws.code = label(ws.begin) || be.code ||
      label(be.true) || ss.code || gen(goto begin)

The table to the right gives the details via an SDD.

To enable the tables to fit I continue to abbreviate severely the names of the nonterminals. New abbreviations are ks (keyword-statement), ifs (if-statement), be (boolean-expression), oe (optional-else) and ws (while-statement). The remaining abbreviations, to be introduced next section, are bt (boolean-term) and bf (boolean-factor)

The treatment of the various *.next attributes deserves some comment. Each statement (e.g., if-statement abbreviated ifs or while-stmt abbreviated ws) is given, as an inherited attribute, a new label *.next. You can see the new label generated in the
statements → statement statements
production and then notice how it is passed down the tree from statements to statement to keyword-statement to if-statementt. The child generates a goto *.next if it wants to end (look at ifs.code in the if-statement production).

The parent, in addition to sending this attribute to the child, normally places label(*.next) after the code for the child. See ss.code in the stmts → stmt stmts production.

An alternative design would have been for the child to itself generate the label and place it as the last component of its code (or elsewhere if needed). I believe that alternative would have resulted in a clearer SDD with fewer inherited attributes. The method actually chosen does, however, have one and possibly two advantages.

  1. Perhaps there is a case where it is awkward for the child to place something at the end of its code (but I don't quite see how this could be). To investigate if this second possibility occurs one might want to examine the treatment of case statements below.

  2. Look at ws.code, the code attribute for the while statement. The parent (the while-stmt) does not place the ss.next label after ss.code. Instead it is placed before ss.code (actually before the while test, which is before ss.code) since right after ss.code is an unconditional goto to the while test. If we used the alternative where the child (the stmts) generated the ss.next at the end of its code, the parent would still need a goto from after ss.code to the begin label. If the child needed to jump to its end it would jump to the ss.next label and the next statement would be the jump to begin. With the scheme we are using we do not have this embarrassing jump to jump sequence.

    I must say, however, that, given all the non-optimal code we are generating, it might have been pedagogically better to use the alternative scheme and live with the jump to a jump embarrassment. I decided in the end to follow the book, but am not sure that was the right decision.

Homework: Give the SDD for a repeat statement
    REPEAT ss WHILE be END ;

Boolean Expressions
ProductionSemantic Rules
be → bt OR be1
bt.true = be.true
bt.false = new Label()
be1.true = be.true
be1.false = be.false
be.code = bt.code || label(bt.false) || be1.code
be → bt
bt.true = be.true
bt.false = be.false
be.code = bt.code
bt → bf AND bt1
bf.true = new Label()
bf.false = bt.false
bt1.true = bt.true
bt1.false = bt.false
bt.code = bf.code || label(bf.true) || bt1.code
bt → bf
bf.true = bt.true
bf.false = bt.false
bt.code = bf.code
bf → NOT bf1
bf1.true = bf.false
bf1.false = bf.true
bf.code = bf1.code
bf → TRUE bf.code = gen(goto bf.true)
bf → FALSE bf.code = gen(goto bf.false)
bf → e RELOP e1 bf.code = e.code || e1.code
      || gen(if e.addr RELOP.lexeme e1.addr goto bf.true)
      || gen(goto bf.false)

6.6.4: Control-Flow Translation of Boolean Expressions

The SDD for the evaluation of Boolean expressions is given on the right. All the SDDs are assembled together here.

Recall that in evaluating the Boolean expression the objective is not simply to produce the value true or false, but instead, the goal is to generate a goto be.true or a goto be.false (depending of course on whether the Boolean expression evaluates to true or to false).

Recall as well that we must not evaluate the right-hand argument of an OR if the left-hand argument evaluates to true and we must not evaluate the right-hand argument of an AND if the left-hand argument evaluates to true.

Notice a curiosity. When evaluating arithmetic expressions, we needed to write

    expr → expr + term
  
and not
    expr → term + expr
  
in order to achieve left associativity. But with jumping code we evaluate each factor (really boolean-factor) in left to right order and jump out as soon as the condition is determined. Hence we can use the productions in the table, which conveniently are not left recursive.

Look at the rules for the first production in the table. If bt yields true, then we are done (i.e., we want to jump to be.true). Thus we set bt.true=be.true, knowing that when bt is evaluated (look at be.code) bt will now jump to be.true.

On the other hand, if bt yields false, then we must evaluate be1, which is why be.code has the label for bt.false right before be1.code. In this case the value of be is the same as the value of be1, which explains the assignments to be1.true and be1.false.

Do on the board the translation of example 6.6.4

    IF   x<5  OR  x>10 AND x=y  THEN
       x := 3;
    END ;
  

We first produce the parse tree, as shown on the upper right. The tree appears quite large for such a small example. Do remember that production compilers use the (abstract) syntax tree, which is quite a bit smaller.

Notice that each e.code="" and that each e.addr is the ID or NUM directly below it. The real action begins with the three bf and one ra productions. Another point is that there are several identity productions such as be→bt.

Thus we can simplify the parse tree to the one at the lower right. This is not an official tree, I am using it here just to save some work. Basically, I eliminated nodes that would have simply copied attributes (inherited attributes are copied downward; synthesized attributes are copied upwards). I also replaced the ID, NUM, and RELOP terminals with their corresponding lexemes. Then, as mentioned above, I moved these lexemes up the tree to the expr directly above them (the only exception is the LHS of the assignment statement; the ID there is not reduced to an expr).

The Euler-tour traversal to evaluate the SDD for this parse tree begins by processing the inherited attributes at the root. The code fragment we are considering is not a full example (recall that the start symbol of the grammar is program not if-stmt). So when we start at the if-stmt node, its inherited attribute is.next will have been evaluated (it is a label, let's call it L1) and the .code computed higher in the tree will place this label right after is.code.

We go down the tree to bool-expr with be.true and .false set to new labels say L2 and L3 (these are shown in green in the diagram), and ss.next and oe.next set to ifs.next=L1 (these two .next attributes will not be used). We go down to bf (really be→bt→bf) and set .true to L2 and .false to a new label L4. We start back up with

  bf.code=""||""||gen(if x<5 goto L2)||
                  gen(goto L4)
  

Next we go up to be and down to bt setting .true and .false to L2 and L3. Then down to bf with .true=L5 .false=L3, giving

    bf.code=gen(if x>10 goto L5)||gen(goto L3)
  
Up and down to the other bf gives
    bf.code=gen(if x=y goto L2)||gen(goto L3)
  

Next we go back up to bt and synthesize

    bt.code=gen(if>10 goto L5)||gen(goto L3)||
            label L5||gen(if x=y goto L2)||gen(goto L3)
  

We complete the Boolean expression processing by going up to be and synthesizing

    be.code=gen(if x<5 goto L2)||gen(goto L4)||label(L4)||
            gen(if>10 goto L5)||gen(goto L3)||
            label L5||gen(if x=y goto L2)||gen(goto L3)
  

We have already seen assignments. You can check that s.code=gen(x:=3)

Finally we synthesize ifs.code and get (written in a more familiar style)

    if x<5 goto L2
    goto L4
    L4:
    if x>10 goto L5
    goto L3
    L5:
    if x=y goto L2
    goto L3
    L2:
    x:=3
    goto L1
    L3:
    // oe.code is empty
    L1:
  

Note that there are four extra gotos. One is a goto the next statement. Two others could be eliminated by using ifFalse. The fourth just skips over a label and empty code.

Remark: This ends the material for lab 4.