Compilers

Start Lecture #10

array tree

6.3.4: Storage Layout for Local Names

Previously we considered an SDD for arrays that was able to compute the type. The key point was that it called the function array(size,type) so as to produce a tree structure exhibiting the dimensionality of the array. For example the tree on the right would be produced for
    int[3][4]     or     array [3] of array [4] of int.

Now we will extend the SDD to calculate the size of the array as well. For example, the array pictured has size 48, assuming that each int has size 4. When we declare many objects, we need to know the size of each one in order to determine the offsets of the objects from the start of the storage area.

We are considering here only those types for which the storage requirements can be computed at compile time. For others, e.g., string variables, dynamic arrays, etc, we would only be reserving space for a pointer to the structure; the structure itself would be created at run time. Such structures are discussed in the next chapter.

Type and Size of Arrays
ProductionActionsSemantic Rules



T → B { t = B.type }C.bt = B.bt
{ w = B.width }C.bw = B.bw
        C { T.type = C.type }T.type = C.type
{ T.width = B.width; }T.width = C.width



B → INT{ B.type = integer; B.width = 4; } B.bt = integer
B.bw = 4



B → FLOAT{ B.type = float; B.width = 8; } B.bt = float
B.bw = 8



C → [ NUM ] C1 C.type = array(NUM.value, C1.type)
C.width = NUM.value * C1.width;
{ C.type = array(NUM.value, C1.type); C1.bt = C.bt
C.width = NUM.value * C1.width; } C1.bw = C.bw



C → ε{ C.type = t; C.width=w } C.type = C.bt
C.width = C.bw

The idea (for arrays whose size can be determined at compile time) is that the basic type determines the width of the object, and the number of elements in the array determines the height. These are then multiplied to get the size (area) of the object. The terminology actually used is that the basetype determines the basewidth, which when multiplied by the number of elements gives the width.

The book uses semantic actions, i.e., a syntax directed translation or SDT. I added the corresponding semantic rules so that the table to the right is an SDD as well. In both cases we follow the book and show a single type specification T rather than a list of object declarations D. The omitted productions are

    D → T ID ; D | ε
  

The goal of the SDD is to calculate two attributes of the start symbol T, namely T.type and T.width, the rules can be viewed as the implementation.

The basetype and hence the basewidth are determined by B, which is INT or FLOAT. The values are set by the lexer (as attributes of FLOAT and INT) or, as shown in the table, are constants in the parser. NUM.value is set by the lexer.

These attributes of B pushed down the tree via the inherited attributes C.bw and C.bw until all the array dimensions have been passed over and the ε-production is reached. There they are turned around as usual and sent back up, during which the dimensions are processed and the final type and width are calculated.

An alternative implementation, using global variables, is described in the grayed out description of the semantic actions This is similar to the comment above that instead of having the identifier table passed up and down via attributes, the bullet can be bitten and a globally visible table used instead.

Remember that for an SDT, the placement of the actions within the production is important. Since it aids reading to have the actions lined up in a column, we sometimes write the production itself on multiple lines. For example the production T→BC in the table below has the B and C on separate lines so that (the first two) actions can be in between even though they are written to the right. These two actions are performed after the B child has been traversed, but before the C child has been traversed. The final two actions are at the very end so are done after both children have been traversed.

The actions use global variables t and w to carry the base type (INT or FLOAT) and width down to the ε-production, where they are then sent on their way up and become multiplied by the various dimensions.

Using the Class Grammar

Lab 3 SDD for Declarations
Production Semantic Rules
(All Attributes Synthesized)
d → di : ty ; addType(di.entry, ty.type);
addSize(di.entry, ty.size)
di → IDdi.entry = ID.entry
ty → ARRAY [ NUM ] OF ty1 ; ty.type = array(NUM.value, ty1.type)
ty.size = NUM.value * ty1.size
ty → INT ty.type = integer
ty.size = 4
ty → REAL ty.type = real
ty.size = 8

This exercise is easier with the class grammar since there are no inherited attributes. We again assume that the lexer has defined NUM.value (it is likely a field in the numbers table entry for the token NUM). The goal is to augment the identifier table entry for ID to include the type and size information found in the declaration. This can be written two ways.

  1. addType(ID.entry,ty.type)
    addSize(ID.entry,ty.size)
  2. ID.entry.type=ty.type
    ID.entry.size=ty.size
The two notations mean the same thing, namely the type component of the identifier table entry for ID is set to t.type (and similarly for size). It is common to write it the first way. (In the table ID.entry is di.entry, but they are equal.)

Recall that addType is viewed as synthesized since its parameters come from the RHS, i.e., from children of this node.

addType has a side effect (it modifies the identifier table) so we must ensure that we do not use this table value before it is calculated.

When do we use this value?
Answer: When we evaluate expressions, we will need to look up the types of objects.

How can we ensure that the type has already been determined and saved?
Answer: We will need to enforce declaration before use. So, in expression evaluation, we should check the entry in the identifier table to be sure that the type has already been set. example 6.3.4-1

Example 1:
As a simple example, let us construct, on the board, the parse tree for the scalar declaration
      y : int ;
We get the diagram at the right, in which I have also shown the effects of the semantic rules.

Note specifically the effect of the addType and addSize functions on the identifier table.

example 6.3.4-2

Example 2:
For our next example we choose the array declaration
    a : array [7] of int ;
The result is again shown on the right. The green numbers show the value of t.size and the blue number shows the value of NUM.value.

Example 3:
For our final example in this section we combine the two previous declarations into the following simple complete program. In this example we show only the parse tree. In the next section we consider the semantic actions as well.

example 6.3.4-3

    Procedure P1 is
      y : int;
      a : array [7] of real;
    begin
    end;
  

We observe several points about this example.

  1. Since we have 2 declarations, we need to use the two productions involving the nonterminal declarations (ds)
        ds → d ds | ε
  2. An Euler-tree traversal of the tree will visit the declarations before visiting any statements (I know this example doesn't have any statements). Thus, if we are processing a statement and find an undeclared variable, we can signal an error since we know that there is no chance we will visit the declaration later.
  3. With multiple declarations, it will be necessary to determine the offset of each declared object.
  4. We will study an extension of this example in the next section. In particular, we will show the annotated parse tree, including calculations of the offset just mentioned.

6.3.5: Sequences of Declarations

Remark: Be careful to distinguish between three methods used to store and pass information.

  1. Attributes. These are variables in a phase of the compiler (the semantic analyzer a.k.a intermediate code generator).
  2. Identifier (and other) table. This holds longer lived data; often passed between phases.
  3. Run time storage. This is storage established by the compiler, but not used by the compiler. It is allocated and used during run time.

To summarize, the identifier table (and other tables we have used) are not present when the program is run. But there must be run time storage for objects. In addition to allocating the storage (a subject discussed next chapter), we need to determine the address each object will have during execution. Specifically, we need to know its offset from the start of the area used for object storage.

For just one object, it is trivial: the offset is zero. When there are multiple objects, we need to keep a running sum of the sizes of the preceding objects, which is our next objective.

Multiple Declarations

The goal is to permit multiple declarations in the same procedure (or program or function). For C/java like languages this can occur in two ways.

  1. Multiple objects in a single declaration.
  2. Multiple declarations in a single procedure.

In either case we need to associate with each object being declared the location in which it will be stored at run time. Specifically we include in the table entry for the object, its offset from the beginning of the current procedure. We initialize this offset at the beginning of the procedure and increment it after each object declaration.

The lab3 grammar does not support multiple objects in a single declaration.

C/Java does permit multiple objects in a single declaration, but surprisingly the 2e grammar does not.

Naturally, the way to permit multiple declarations is to have a list of declarations in the natural right-recursive way. The 2e C/Java grammar has D which is a list of semicolon-separated T ID's
    D → T ID ; D | ε

The lab 3 grammar has a list of declarations (each of which ends in a semicolon). Shortening declarations to ds we have
    ds → d ds | ε
Multiple declarations snippet
ProductionSemantic Action


P →{ offset = 0; }
        D


D → T ID ;{ top.put(id.lexeme, T.type, offset);
  offset = offset + T. width; }
        D1


D → ε

As mentioned, we need to maintain an offset, the next storage location to be used by an object declaration. The 2e snippet on the right introduces a nonterminal P for program that gives a convenient place to initialize offset.

The name top is used to signify that we work with the top symbol table (when we have nested scopes for record definitions, nested procedures, or nested blocks we need a stack of symbol tables). Top.put places the identifier into this table with its type and storage location and then bumps offset for the next variable or next declaration.

Rather than figure out how to put this snippet together with the previous 2e code that handled arrays, we will just present the snippets and put everything together on the class grammar.

Multiple Declarations in the Class Grammar

Multiple Declarations
ProductionSemantic Rules
pd → PROC np IS ds BEG ss END ; ds.offset = 0
ds → d ds1 d.offset = ds.offset
ds1.offset = d.newoffset
ds.totalSize = ds1.totalSize
ds → ε ds.totalSize = ds.offset
d → di : ty ; addType(di.entry, ty.type)
addSize(di.entry, ty.size)
addOffset(di.entry, d.offset)
d.newoffset = d.offset + ty.size
di → ID di.entry = ID.entry
ty → ARRAY [ NUM ] OF ty1 ty.type = array(NUM.value, t1.type)
ty.size = NUM.value * t1.size
ty → INT ty.type = integer
ty.size = 4
ty → REAL ty.type = real
ty.size = 8

On the right we show the part of the SDD used to translate multiple declarations for the class grammar. We do not show the productions for name-and-parameters (np) or statements (ss) since we are focusing on just the declaration of local variables.

The new part is determining the offset for each individual declaration. The new items have blue backgrounds (this includes new, inherited attributes, which were red). The idea is that the first declaration has offset 0 and the next offset is the current offset plus the current size. Specifically, we proceed as follows.

In the procedure-def (pd) production, we give the nonterminal declarations (ds) the inherited attribute offset (ds.offset), which we initialize to zero.

We inherit this offset down to individual declarations. At each declaration (d), we store the offset in the entry for the identifier being declared and increment the offset by the size of this object.

When we get the to the end of the declarations (the ε-production), the offset value is the total size needed. We turn it around and send it back up the tree in case the total is needed by some higher level production.

6.3.5 parse

Example: What happens when the following program (an extension of P1 from the previous section) is parsed and the semantic rules above are applied.

  procedure P2 is
      y : integer;
      a : array [7] of real;
  begin
      y := 5;      // Statements not yet done
      a[2] := y;   // Type error?
  end;

On the right is the parse tree, limited to the declarations.

The dotted lines would be solid for a parse tree and connect to the lines below. They are shown dotted so that the figure can be compared to the annotated parse tree immediately below, in which the attributes and their values have been filled in.

In the annotated tree, the attributes shown in red are inherited. Those in black are synthesized.

To start the annotation process, look at the top production in the parse tree. It has an inherited attributed ds.offset, which is set to zero. Since the attribute is inherited, the entry can be placed immediately into the annotated tree.

We now do the left child and fill in its inherited attribute.

When the Euler tour comes back up the tree, the synthesized attributes are evaluated and recorded.

6.3.5 full

6.3.6: Fields in Records and Classes

Since records can essentially have a bunch of declarations inside, we only need add

    T → RECORD { D }
  
to get the syntax right. For the semantics we need to push the environment and offset onto stacks since the namespace inside a record is distinct from that on the outside. The width of the record itself is the final value of (the inner) offset, which in turn is the value of totalsize at the root when the inner scope is concluded.

  T → record {  { Env.push(top);
                  top = new Env();
                  Stack.push(offset);
                  offset = 0; }
  D }           { T.type = record(top);
                  T.width = offset;
                  top = Env.pop();
                  offset = Stack.pop(); }

This does not apply directly to the class grammar, which does not have records.

This same technique would be used for other examples of nested scope, e.g., nested procedures/functions and nested blocks. To have nested procedures/functions, we need other alternatives for declaration: procedure/function definitions. Similarly if we wanted to have nested blocks we would add another alternative to statement.

  s           → ks | ids | block-stmt
  block-stmt  → DECLARE ds BEGIN ss END ;

If we wanted to generate code for nested procedures or nested blocks, we would need to stack the symbol table as done above and in the text.

Homework: 1.

6.4: Translation of Expressions

Scalar Expressions
ProductionSemantic Rule
e → t e.addr = t.addr
e.code = t.code
e → e1 ADDOP t e.addr = new Temp()
e.code = e1.code || t.code ||
gen(e.addr = e1.addr ADDOP.lexeme t.addr)
t → f t.addr = f.addr
t.code = f.code
t → t1 MULOP f t.addr = new Temp()
t.code = t1.code || f.code ||
gen(t.addr = t1.addr MULOP.lexeme f.addr)
f → ( e ) f.addr = e.addr
f.code = e.code
f → NUM f.addr = get(NUM.lexeme)
f.code = ""
f → ID is assume indices is ε
f.addr = get(ID.lexeme)
f.code = ""

The goal is to generate 3-address code for scalar expressions, i.e., arrays are not treated in this section (they will be shortly). Specifically, indices is assumed to be ε. We use is to abbreviate indices.

We generate the 3-address code using the natural notation of 6.2. In fact we assume there is a function gen() that, given the pieces needed, does the proper formatting so gen(x = y + z) will output the corresponding 3-address code. gen() is often called with addresses other than lexemes, e.g., temporaries and constants. The constructor Temp() produces a new address in whatever format gen needs. Hopefully this will be clear in the table to the right and the others that follow.

6.4.1: Operations Within Expressions

We will use two attributes code and address (addr). The key objective at each node of the parse tree for an expression is to produce values for the code and addr attributes so that the following crucial invariant is maintained.

If code is executed, then address addr contains the value of the (sub-)expression routed at this node.
In particular, after TheRoot.code is evaluated, the address TheRoot.addr contains the value of the entire expression.

Said another way, the attribute addr at a node is the address that holds the value calculated by the code at the node. Recall that unlike real code for a real machine our 3-address code doesn't reuse temporary addresses.

As one would expect for expressions, all the attributes in the table to the right are synthesized. The table is for the expression part of the class grammar. To save space we use ID for IDENTIFIER, e for expression, t for term, and f for factor.

The SDDs for declarations and scalar expressions can be easily combined by essentially concatenating them as shown here.

6.4.2: Incremental Translation

We saw this in chapter 2.

The method in the previous section generates long strings as we walk the tree. By using SDTs instead of SDDs, you can output parts of the string as each node is processed.

6.4.3: Addressing Array Elements

Declarations with Basetypes
ProductionSemantic Rules
pd → PROC np IS ds BEG ss END ; ds.offset = 0
np → di ( ps ) | di not used yet
ds → d ds1 d.offset = ds.offset
ds1.offset = d.newoffset
ds.totalSize = ds1.totalSize
ds → ε ds.totalSize = ds.offset
d → di : ty ; addType(di.entry, ty.type)
addBaseType(di.entry, ty.basetype)
addSize(di.entry, ty.size)
addOffset(di.entry, d.offset)
d.newoffset = d.offset + ty.size
di → ID di.entry = ID.entry
ty → ARRAY [ NUM ] OF ty1 ty.type = array(NUM.value, ty1.type)
ty.basetype = ty1.basetype
ty.size = NUM.value * ty1.size
ty → INT ty.type = integer
ty.basetype = integer
ty.size = 4
ty → REAL ty.type = real
ty.basetype = real
ty.size = 8

The idea is to associate the base address with the array name. That is, the offset stored in the identifier table entry for the array is the address of the first element. When an element is referenced, the indices and the array bounds are used to compute the amount, often called the offset (unfortunately, we have already used that term), by which the address of the referenced element differs from the base address.

To implement this technique, we must first store the base type of each identifier in the identifier table. We use this basetype to determine the size of each element of the array. For example, consider

    arr: array [10] of integer;
    x  : real ;
  
Our previous SDD for declarations calculates the size and type of each identifier. For arr these are 40 and array(10,integer), respectively. The enhanced SDD on the right calculates, in addition, the base type. For arr this is integer. For a scalar, such as x, the base type is the same as the type, which in the case of x is real. The new material is shaded in blue.

This SDD is combined with the expression SDD here.

One Dimensional Arrays

Calculating the address of an element of a one dimensional array is easy. The address increment is the width of each element times the index (assuming indices start at 0). So the address of A[i] is the base address of A, which is the offset component of A's entry in the identifier table, plus i times the width of each element of A.

The width of each element of an array is the width of what we have called the base type of the array. For a scalar, there is just one element and its width is the width of the type, which is the same as the base type. Hence, for any ID the element width is sizeof(getBaseType(ID.entry)).

For convenience, we define getBaseWidth by the formula

    getBaseWidth(ID.entry) = sizeof(getBaseType(ID.entry)) = sizeof(ID.entry.baseType)
  

Two Dimensional Arrays

Let us assume row major ordering. That is, the first element stored is A[0,0], then A[0,1], ... A[0,k-1], then A[1,0], ... . Modern languages use row major ordering.

With the alternative column major ordering, after A[0,0] comes A[1,0], A[2,0], ... .

For two dimensional arrays the address of A[i,j] is the sum of three terms

  1. The base address of A.
  2. The distance from A to the start of row i. This is i times the width of a row, which is i times the number of elements in a row times the width of an element. The number of elements in a row is the column array bound.
  3. The distance from the start of row i to element A[i,j]. This is j times the width of an element.

Remarks

  1. Our grammar really declares one dimensional arrays of one dimensional arrays rather than 2D arrays. I believe this makes it easier.
  2. The SDD above when processing the declaration
        A : array [5] of array [9] of real;
    gives A the type array(5,array(9,real)) and thus the type component of the entry for A in the symbol table contains all the values needed to compute the address of any given element of the array.

End of Remarks

Higher Dimensional Arrays

The generalization to higher dimensional arrays is clear.

A Simple Example

Consider the following expression containing a simple array reference, where a and c are integers and b is a real array.

    a = b[3*c]
  
We want to generate code something like
    T1 = #3 * c    // i.e. mult T1,#3,c
    T2 = T1 * 8    // each b[i] is size 8
    a  = b[T2]     // Uses the x[i] special form
  
If we considered it too easy to use the that special form we would generate something like
    T1 = #3 * c
    T2 = T1 * 8
    T3 = &b
    T4 = T2 + T3
    a  = *T4
  

One-Dimensional Array References in Expressions
ProductionSemantic Rules
f → ID i f.t1 = new Temp()
f.addr = new Temp
f.code = i.code ||
      gen(f.t1 = i.addr * getBaseWidth(ID.entry)) ||
      gen(f.addr = get(ID.lexeme)[f.t1])
f → ID i f.t1 = new Temp()
f.t2 = new Temp()
f.t3 = new Temp()
f.addr = new Temp
f.code = i.code ||
    gen(f.t1 = in.addr * getBaseWidth(ID.entry)) ||
    gen(f.t2 = &get(ID.lexeme)) ||
    gen(f.t3 = f.t2 + f.t1) || gen(f.addr = *f.t3)
i → [ e ] i.addr = e.addr
i.code = e.code

6.4.4: Translation of Array References (Within Expressions)

To permit arrays in expressions, we need to specify the semantic actions for the production

    factor → IDENTIFIER indices
  

One-Dimensional Array References

As a warm-up, lets start with references to one-dimensional arrays. That is, instead of the above production, we consider the simpler

    factor → IDENTIFIER index
  

The table on the right does this in two ways, both with and without using the special addressing form x[j]. In the table the nonterminal index is abbreviated i . I included the version without the x[j] special form for two reasons.

  1. Since we are restricted to one dimensional arrays, the full code generation for the address of an element is not hard.
  2. I thought it would be instructive to see the full address generation without hiding some of it under the covers. It was definitely instructive for me!

Note that by avoiding the special form b=x[j], I ended up using two other special forms.
Is it possible to avoid the special forms?

An Aside on Special Forms: An Example From Lisp

Lisp is taught in our programming languages course, which is a prerequisite for compilers. If you no longer remember Lisp, don't worry.

Our Special Forms for Addressing

We just (optionally) saw an exception to the basic lisp evaluation rule. A similar exception occurs with x[j] in our three-address code. It is a special form in that, unlike the normal rules for three-address code, we don't use the address of j but instead its value. Specifically the value of j is added to the address of x.

The rules for addresses in 3-address code also include

    a = &b
    a = *b
    *a = b
  
which are other special forms. They have the same meaning as in the C programming language.

Incorporating 1D Arrays in the Expression SDD

Our current SDD includes the production

    f → ID is
  
with the added assumption that indices is ε. We now want to permit indices to be a single index as well as ε. That is we replace the above production with the pair
    f → ID
    f → ID i
  

The semantic rules for each case were given in previous tables. The ensemble to date is given here.

On the board evaluate e.code for the RHS of the simple example above: a=b[3*c].

This is an exciting moment. At long last we really seem to be compiling!

Multidimensional Array References

As mentioned above in the general case we must process the production

    f → IDENTIFIER is
  

Following the method used in the 1D case we need to construct is.code. The basic idea is shown here

The Left-Hand Side

Now that we can evaluate expressions (including one-dimensional array reverences) we need to handle the left-hand side of an assignment statement (which also can be an array reference). Specifically we need semantic actions for the following productions from the class grammar.

    id-statement   → ID rest-of-either
    rest-of-either → rest-of-assignment
    rest-of-assignment → := expression ;
    rest-of-assignment → indices := expression
  

Scalars and One Dimensional Arrays on the Left Hand Side

Assignment Statements
ProductionSemantic Rules
ss → s ss1 ss.code = s.code || ss1.code
ss → ε ss.code = ""
s → ids s.code = ids.code
ids → IDENTIFIER re
re.id = IDENTIFIER.entry
ids.code = re.code
re → ra
ra.id = re.id
re.code = ra.code
ra → := e ; ra.code = e.code ||
        gen(ra.id.lexeme=e.addr)
ra → i := e ; ra.t1 = newTemp()
ra.code = i.code || e.code ||
        gen(ra.t1 = i.addr * getBaseWidth(ra.id)||
        gen(ra.id.lexeme[ra.t1]=e.addr)

Once again we begin by restricting ourselves to one-dimensional arrays, which corresponds to replacing indices by index in the last production. The SDD for this restricted case is shown on the right.

The first three productions reduce statements (ss) and statement (s) to identifier-statement (ids), which is used for statements such as assignment and procedure invocation, that begin with an identifier. The corresponding semantic rules simply concatenate all the code produced into the top ss.code.

The identifier-statement production captures the ID and sends it down to the appropriate rest-of-assignment (ra) production where the necessary code is generated and passed back up.

The simple ra → := e; production generates ra.code by simply appending to the evaluation of the RHS, the natural assignment with ra as the LHS.

It is instructive to compare ra.code for the ra → i := e production with f.code for the f → ID i production in the expression SDD. Both compute the same offset (index*elementSize) and both use a special form for the address x[j]=y in this section, and y=x[j] for expressions.

Incorporating these productions and semantic rules gives this SDD. Note that we have added a semantic rule to the procedure-def (pd) production that simply sends ss.code to pd.code so that we have obtained the overall goal setting pd.code to the entire code needed for the procedure.

Multi-dimensional Arrays on the Left Hand Side

The idea is the same as when a multidimensional array appears in an expression. Specifically,

  1. Traverse the sub-tree routed at the top indices node and compute the total offset from the start of the array.
  2. Multiply this offset by the width of each entry.
  3. A[product] = RHS

Our Simple Example Revisited

Recall the program we could partially handle.

  procedure P2 is
      y : integer;
      a : array [7] of real;
  begin
      y := 5;      // Statements not yet done
      a[2] := y;   // Type error?
  end;

Now we can do the statements.

Homework: What code is generated for the program written above? Please remind me to go over this homework next class.

What should we do about the possible type error?

  1. We could ignore errors.
  2. We could assume the intermediate language permits mismatched types. Final code generation would then need to generate conversion code or signal an error.
  3. We could change the program to use only one type.
  4. We could learn about type checking and conversions.

Let's take the last option.