Compilers

Start Lecture #11

6.4: Translation of Expressions

Expressions Without Arrays and Functions
Production	Semantic Rule

e → t	e.addr = t.addr
e → t	e.code = t.code

e → e₁ + t	e.addr = new Temp()
e → e₁ + t	e.code = e₁.code \|\| t.code \|\| gen(e.addr = e₁.addr + t.addr)

e → e₁ - t	e.addr = new Temp()
e → e₁ - t	e.code = e₁.code \|\| t.code \|\| gen(e.addr = e₁.addr - t.addr)

t → f	t.addr = f.addr
t → f	t.code = f.code

t → t₁ * f	t.addr = new Temp()
t → t₁ * f	t.code = t₁.code \|\| f.code \|\| gen(t.addr = t₁.addr * f.addr)

t → t₁ / f	t.addr = new Temp()
t → t₁ / f	t.code = t₁.code \|\| f.code \|\| gen(t.addr = t₁.addr / f.addr)

f → ( e )	f.addr = e.addr
f → ( e )	f.code = e.code

f → NUM	f.addr = get(NUM.lexeme)
f → NUM	f.code = ""

f → if	f.addr = if.addr
f → if	f.code = if.code

if → ID	f.addr = get(ID.lexeme)
if → ID	f.code = ""

if → ID [ expressions ]	Done later

if → ID ( expressions )	Done later

The goal is to generate 3-address code for expressions. We will generate them using the natural notation of 6.2. In fact we assume there is a function gen() that given the pieces needed does the proper formatting so gen(x = y + z) will output the corresponding 3-address code. gen() is often called with addresses rather than lexemes like x. The constructor Temp() produces a new address in whatever format gen needs. Hopefully this will be clear in the tables that follow

6.4.1: Operations Within Expressions

We will use two attributes code and address. For a parse tree node the code attribute gives the three address code to evaluate the input derived from that node. In particular, code at the root evaluates the entire expression.

The attribute addr at a node is the address that holds the value calculated by the code at the node. Recall that unlike real code for a real machine our 3-address code doesn't reuse addresses.

As one would expect for expressions, all the attributes in the table to the right are synthesized. The table is for the expression part of the lab 3 grammar. To save space let's use ID for IDENTIFIER, lv for lvalue, e for expression, t for term, and f for factor.

Since our current objective is primarily to illustrate the usage of the code and addr attributes, we omit arrays and function calls within expressions.

6.4.2: Incremental Translation

We saw this in chapter 2.

The method in the previous section generates long strings and we walk the tree. By using SDT instead of using SDD, you can output parts of the string as each node is processed.

6.4.3: Addressing Array Elements

The idea is that you associate the base address with the array name. That is, the offset stored in the identifier table is the address of the first element of the array. The indices and the array bounds are used to compute the amount, often called the offset (unfortunately, we have already used that term), by which the address of the referenced element differs from the base address.

Multiple Declarations with basetypes and widths
Production	Semantic Rules

fd → FUNC np RET t IS ds BEG s ss END ;	ds.offset = 0

pd → PROC np IS ds BEG s ss END ;	ds.offset = 0

np → di ( ps ) \| di	not used yet

ds → d ds₁	d.offset = ds.offset
	ds₁.offset = d.newoffset
	ds.totalSize = ds₁.totalSize

ds → ε	ds.totalSize = ds.offset

d → di : t ;	addType(di.entry, t.type)
	addBaseType(di.entry, t.basetype)
	addSize(di.entry, t.size)
	addOffset(di.entry, d.offset)
	d.newoffset = d.offset + t.size

t → ARRAY [ NUM ] OF t₁ ;	t.type = array(NUM.value, t₁.type)
	t.basetype = t₁.basetype
	t.size = NUM.value * t₁.size

t → INTEGER	t.type = integer
	t.basetype = integer
	t.size = 4

t → REAL	t.type = real
	t.basetype = real
	t.size = 8

To implement this technique, we store the base type of each identifier in the identifier table. For example, consider

    arr: array [ 10 ] of integer ;
    x  : real ;

Our previous SDD for declarations calculates the size and type of each identifier. For arr these are 40 and array(10,integer). The enhanced SDD on the right calculates, in addition, the base type. For arr this is integer. For a scalar, such as x, the base type is the same as the type, which in the case of x is real.

Instead of a column distinguishing synthesized and inherited attributes, I now highlight in pink the inherited ones. This is not needed; you can look at the LHS of a rule and see if the rule is inherited or synthesized.

One Dimensional Arrays

Calculating the address of an element of a one dimensional array is easy. The address increment is the width of each element times the index (assuming indexes start at 0). So the address of A[i] is the base address of A, which is the offset component of A's entry in the identifier table, plus i times the width of each element of A.

The width of each element is the width of what we have called the base type. So for an ID the element width is sizeof(getBaseType(ID.entry.type)). For convenience we define getBaseWidth by the formula

getBaseWidth(ID.entry) = sizeof(getBaseType(ID.entry.type))

Two Dimensional Arrays

Let us assume row major ordering. That is, the first element stored is A[0,0], then A[0,1], ... A[0,k-1], then A[1,0], ... . Modern languages use row major ordering.

With the alternative column major ordering, after A[0,0] comes A[1,0], A[2,0], ... .

For two dimensional arrays the address of A[i,j] is the sum of three terms

The base address of A.
The distance from A to the start of row i. This is i times the width of a row, which is i times the number of elements in a row times the width of an element. The number of elements in a row is the column array bound.
The distance from the start of row i to element A[i,j]. This is j times the width of an element.

Remark: Our grammar really declares one dimension arrays of one dimensional arrays rather than 2D arrays. I think this makes it easier. We could make the SDD above more fancy and capture for a declaration like
A : array [5] of array [9] of real;
all the values needed to compute the offset of an element.
However, we won't do this and for lab4 will only have 1D arrays.

Higher Dimensional Arrays

The generalization to higher dimensional arrays is clear.

A Simple Example

Consider the following expression containing a simple array reference, where a and c are integers and b is a real array.

  a = b[3*c]

We want to generate code something like

  T1 = 3 * c     // i.e. mult T1,3,c
  T2 = T1 * 8    // each b[i] is size 8
  a  = b[T2]     // Uses the x[i] special form

If we considered it too easy to use the special form we would generate something like

  T1 = 3 * c
  T2 = 8 * T1
  T3 = &b
  T4 = T2 + T3
  a  = *T4

6.4.4: Translation of Array References

Translating One-Dimensional Array References in Expressions
Production	Semantic Rules

if → ID [ e ]	if.t1 = new Temp() if.addr = new Temp if.code = e.code \|\| gen(if.t1 = e.addr * getBaseWidth(ID.entry)) \|\| gen(if.addr = get(ID.lexeme)[if.t1])

if → ID [ e ]	if.t1 = new Temp() if.t2 = new Temp() if.t3 = new Temp() if.addr = new Temp if.code = e.code \|\| gen(if.t1 = e.addr * getBaseWidth(ID.entry)) \|\| gen(if.t2 = &get(ID.lexeme)) \|\| gen(if.t3 = if.t2 + if.t1) gen(if.addr = *if.t3)

To include arrays we need to specify the semantic actions for the production

identifier-factor → IDENTIFIER [ expressions ]

Since, at least for now, we will limit ourselves to one-dimensional arrays, we replace expressions by simply expression, which we abbreviate as e.

The table on the right does this in two ways, both with and without using the special addressing form x[i].

An Aside on Special Forms

Normally lisp is taught in our programming languages course, which is a prerequisite for compilers. If you no longer remember lisp, don't worry.

In lisp there is a simple evaluation rule. To evaluate, for example, (a b c d) you
1. Evaluate all four components.
2. Confirm that the first component can evaluate to a function.
3. Invoke this function passing as arguments the values calculated for the other three components.
But this rule is not always applied.
Instead there are special forms that are evaluated differently.
For example (setq a b) does not evaluate a prior to invoking setq.
A similar thing is happening with a[i]. It is a special form in that, unlike the normal rules for three-address code, we don't use the address of i but instead its value. Specifically the value of i is added to the address of a.

Since the goal of the semantic rules in the table is precisely to generate such code, the simpler version of the SDD uses a[i].

I also included a version without using a[i] for two reasons.

Since we are restricted to one dimensional arrays, the full code generation for the address of an element is not hard and
I thought it would be instructive to see the full address generation without hiding some of it under the covers.

It was definitely instructive for me! The rules for addresses in 3-address code also include

    a = &b
    a = *b
    *a = b

which are other special forms. They have the same meaning as in the C programming language.

Let's carefully evaluate the simple example above

This is an exciting moment. At long last we really seem to be compiling!

The Left-Hand Side

The left hand side of an assignment statement
Production	Semantic Rules

ids → ID ra	ra.id = id.entry ids.code = ra.code

ra → := e ;	ra.code = e.code \|\| gen(ra.id.lexeme=e.addr)

ra → [ e ] := e₁ ;	ra.t1 = newTemp() ra.code = e₁.code \|\| e.code \|\| gen(ra.t1 = getBaseWidth(ra.id.lexeme) * e.addr \|\| gen(ra.id.lexeme[ra.t1]=e₁.addr)

Now that we can evaluate expressions (even including one-dimensional array reverences) we need to handle the left-hand side of an assignment statement (which also can be an array reference). Specifically we need semantic actions for the following productions from the lab3 grammar.

    identifier-stmt    → IDENTIFIER rest-of-assignment
    rest-of-assignment → = expression ;
    rest-of-assignment → [ expressions ] = expression

Once again we restrict ourselves to one-dimensional arrays, which corresponds to replacing expressions by expression in the last production.

Recall the program we could partially handle.

    procedure test () is
        y : integer;
        x : array [10] of real;
    begin
        y = 5;        // we haven't yet done statements
        x[2] = y;     // type error?
    end;

Now we can do the statements.

What about the possible type error?

We could ignore errors.
We could assume the intermediate language permits mismatched types. Final code generation would then need to generate conversion code or signal an error.
We could change the program to use only one type.
We could learn about type checking and conversions.

Let's take the last option.

Homework: What code is generated for the program written above?