Start Lecture #10
Previously we considered an SDD for
arrays that was able to compute the type.
The key point was that it called the function array(size,type) so as
to produce a tree structure exhibiting the dimensionality of the
array.
For example the tree on the right would be produced for
int[3][4]
or array [3] of array [4] of int.
Now we will extend the SDD to calculate the size of the array as well. For example, the array pictured has size 48, assuming that each int has size 4. When we declare many objects, we need to know the size of each one in order to determine the offsets of the objects from the start of the storage area.
We are considering here only those types for which the storage requirements can be computed at compile time. For others, e.g., string variables, dynamic arrays, etc, we would only be reserving space for a pointer to the structure; the structure itself would be created at run time. Such structures are discussed in the next chapter.
Production | Actions | Semantic Rules |
---|---|---|
T → B | { t = B.type } | C.bt = B.bt |
{ w = B.width } | C.bw = B.bw | |
C | { T.type = C.type } | T.type = C.type |
{ T.width = B.width; } | T.width = C.width | |
B → INT | { B.type = integer; B.width = 4; } | B.bt = integer B.bw = 4 |
B → FLOAT | { B.type = float; B.width = 8; } | B.bt = float B.bw = 8 |
C → [ NUM ] C1 | C.type = array(NUM.value, C1.type) | |
C.width = NUM.value * C1.width; | ||
{ C.type = array(NUM.value, C1.type); | C1.bt = C.bt | |
C.width = NUM.value * C1.width; } | C1.bw = C.bw | |
C → ε | { C.type = t; C.width=w } | C.type = C.bt C.width = C.bw |
The idea (for arrays whose size can be determined at compile time)
is that the basic type determines the width of the object, and the
number of elements in the array determines the height
.
These are then multiplied to get the size (area) of the object.
The terminology actually used is that the basetype determines the
basewidth, which when multiplied by the number of elements gives the
width.
The book uses semantic actions, i.e., a syntax directed translation or SDT. I added the corresponding semantic rules so that the table to the right is an SDD as well. In both cases we follow the book and show a single type specification T rather than a list of object declarations D. The omitted productions are
D → T ID ; D | ε
The goal of the SDD is to calculate two attributes of the start symbol T, namely T.type and T.width, the rules can be viewed as the implementation.
The basetype and hence the basewidth are determined by B, which is INT or FLOAT. The values are set by the lexer (as attributes of FLOAT and INT) or, as shown in the table, are constants in the parser. NUM.value is set by the lexer.
These attributes of B pushed down the tree via the inherited
attributes C.bw and C.bw until all the array dimensions have been
passed over and the ε-production is reached.
There they are turned around
as usual and sent back up,
during which the dimensions are processed and the final type and
width are calculated.
An alternative implementation, using global variables, is described in the grayed out description of the semantic actions This is similar to the comment above that instead of having the identifier table passed up and down via attributes, the bullet can be bitten and a globally visible table used instead.
Remember that for an SDT, the placement of the actions within the production is important. Since it aids reading to have the actions lined up in a column, we sometimes write the production itself on multiple lines. For example the production T→BC in the table below has the B and C on separate lines so that (the first two) actions can be in between even though they are written to the right. These two actions are performed after the B child has been traversed, but before the C child has been traversed. The final two actions are at the very end so are done after both children have been traversed.
The actions use global variables t and w to carry the base type (INT or FLOAT) and width down to the ε-production, where they are then sent on their way up and become multiplied by the various dimensions.
Production | Semantic Rules (All Attributes Synthesized) |
---|---|
d → di : ty ; | addType(di.entry, ty.type); addSize(di.entry, ty.size) |
di → ID | di.entry = ID.entry |
ty → ARRAY [ NUM ] OF ty1 ; | ty.type = array(NUM.value, ty1.type) ty.size = NUM.value * ty1.size |
ty → INT | ty.type = integer ty.size = 4 |
ty → REAL | ty.type = real ty.size = 8 |
This exercise is easier with the class grammar since there are no inherited attributes. We again assume that the lexer has defined NUM.value (it is likely a field in the numbers table entry for the token NUM). The goal is to augment the identifier table entry for ID to include the type and size information found in the declaration. This can be written two ways.
Recall that addType is viewed as synthesized since its parameters come from the RHS, i.e., from children of this node.
addType has a side effect (it modifies the identifier table) so we must ensure that we do not use this table value before it is calculated.
When do we use this value?
Answer: When we evaluate expressions, we will need to look up the
types of objects.
How can we ensure that the type has already been determined and
saved?
Answer: We will need to enforce declaration before use
.
So, in expression evaluation, we should check the entry in the
identifier table to be sure that the type has already been set.
Example 1:
As a simple example, let us construct,
on the board, the parse tree for the scalar declaration
y : int ;
We get the diagram at the right, in which I have also shown the
effects of the semantic rules.
Note specifically the effect of the addType and addSize functions on the identifier table.
Example 2:
For our next example we choose the array declaration
a : array [7] of int ;
The result is again shown on the right.
The green numbers show the value of t.size and the blue number shows
the value of NUM.value.
Example 3:
For our final example in this section we combine the two previous
declarations into the following simple complete program.
In this example we show only the parse tree.
In the next section we consider the semantic actions as well.
Procedure P1 is y : int; a : array [7] of real; begin end;
We observe several points about this example.
Remark: Be careful to distinguish between three methods used to store and pass information.
To summarize, the identifier table (and other tables we have used) are not present when the program is run. But there must be run time storage for objects. In addition to allocating the storage (a subject discussed next chapter), we need to determine the address each object will have during execution. Specifically, we need to know its offset from the start of the area used for object storage.
For just one object, it is trivial: the offset is zero. When there are multiple objects, we need to keep a running sum of the sizes of the preceding objects, which is our next objective.
The goal is to permit multiple declarations in the same procedure (or program or function). For C/java like languages this can occur in two ways.
In either case we need to associate with each object being declared the location in which it will be stored at run time. Specifically we include in the table entry for the object, its offset from the beginning of the current procedure. We initialize this offset at the beginning of the procedure and increment it after each object declaration.
The lab3 grammar does not support multiple objects in a single declaration.
C/Java does permit multiple objects in a single declaration, but surprisingly the 2e grammar does not.
Naturally, the way to permit multiple declarations is to have a
list of declarations in the natural right-recursive way.
The 2e C/Java grammar has D which is a list of semicolon-separated
T ID's
D → T ID ; D | ε
The lab 3 grammar has a list of declarations
(each of which ends in a semicolon).
Shortening declarations to ds we have
ds → d ds | ε
Production | Semantic Action |
---|---|
P → | { offset = 0; } |
D | |
D → T ID ; | { top.put(id.lexeme, T.type, offset); |
offset = offset + T. width; } | |
D1 | |
D → ε |
As mentioned, we need to maintain an offset, the next storage location to be used by an object declaration. The 2e snippet on the right introduces a nonterminal P for program that gives a convenient place to initialize offset.
The name top is used to signify that we work with the top symbol table (when we have nested scopes for record definitions, nested procedures, or nested blocks we need a stack of symbol tables). Top.put places the identifier into this table with its type and storage location and then bumps offset for the next variable or next declaration.
Rather than figure out how to put this snippet together with the previous 2e code that handled arrays, we will just present the snippets and put everything together on the class grammar.
Production | Semantic Rules |
---|---|
pd → PROC np IS ds BEG ss END ; | ds.offset = 0 |
ds → d ds1 |
d.offset = ds.offset ds1.offset = d.newoffset ds.totalSize = ds1.totalSize |
ds → ε | ds.totalSize = ds.offset |
d → di : ty ; | addType(di.entry, ty.type) addSize(di.entry, ty.size)
addOffset(di.entry, d.offset)
d.newoffset = d.offset + ty.size |
di → ID | di.entry = ID.entry |
ty → ARRAY [ NUM ] OF ty1 |
ty.type = array(NUM.value, t1.type) ty.size = NUM.value * t1.size |
ty → INT |
ty.type = integer ty.size = 4 |
ty → REAL |
ty.type = real ty.size = 8 |
On the right we show the part of the SDD used to translate multiple declarations for the class grammar. We do not show the productions for name-and-parameters (np) or statements (ss) since we are focusing on just the declaration of local variables.
The new part is determining the offset for each individual declaration. The new items have blue backgrounds (this includes new, inherited attributes, which were red). The idea is that the first declaration has offset 0 and the next offset is the current offset plus the current size. Specifically, we proceed as follows.
In the procedure-def (pd) production, we give
the nonterminal declarations
(ds) the inherited attribute
offset (ds.offset), which we initialize to zero.
We inherit this offset down to individual declarations. At each declaration (d), we store the offset in the entry for the identifier being declared and increment the offset by the size of this object.
When we get the to the end of the declarations (the ε-production), the offset value is the total size needed. We turn it around and send it back up the tree in case the total is needed by some higher level production.
Example: What happens when the following program (an extension of P1 from the previous section) is parsed and the semantic rules above are applied.
procedure P2 is y : integer; a : array [7] of real; begin y := 5; // Statements not yet done a[2] := y; // Type error? end;
On the right is the parse tree, limited to the declarations.
The dotted lines would be solid for a parse tree and connect to the lines below. They are shown dotted so that the figure can be compared to the annotated parse tree immediately below, in which the attributes and their values have been filled in.
In the annotated tree, the attributes shown in red are inherited. Those in black are synthesized.
To start the annotation process, look at the top production in the parse tree. It has an inherited attributed ds.offset, which is set to zero. Since the attribute is inherited, the entry can be placed immediately into the annotated tree.
We now do the left child and fill in its inherited attribute.
When the Euler tour comes back up the tree, the synthesized attributes are evaluated and recorded.
Since records can essentially have a bunch of declarations inside, we only need add
T → RECORD { D }to get the syntax right. For the semantics we need to push the environment and offset onto stacks since the namespace inside a record is distinct from that on the outside. The width of the record itself is the final value of (the inner) offset, which in turn is the value of totalsize at the root when the inner scope is concluded.
T → record { { Env.push(top); top = new Env(); Stack.push(offset); offset = 0; } D } { T.type = record(top); T.width = offset; top = Env.pop(); offset = Stack.pop(); }
This does not apply directly to the class grammar, which does not have records.
This same technique would be used for other examples of nested scope, e.g., nested procedures/functions and nested blocks. To have nested procedures/functions, we need other alternatives for declaration: procedure/function definitions. Similarly if we wanted to have nested blocks we would add another alternative to statement.
s → ks | ids | block-stmt block-stmt → DECLARE ds BEGIN ss END ;
If we wanted to generate code for nested procedures or nested blocks, we would need to stack the symbol table as done above and in the text.
Homework: 1.
Production | Semantic Rule |
---|---|
e → t |
e.addr = t.addr e.code = t.code |
e → e1 ADDOP t |
e.addr = new Temp() e.code = e1.code || t.code || gen(e.addr = e1.addr ADDOP.lexeme t.addr) |
t → f |
t.addr = f.addr t.code = f.code |
t → t1 MULOP f |
t.addr = new Temp() t.code = t1.code || f.code || gen(t.addr = t1.addr MULOP.lexeme f.addr) |
f → ( e ) |
f.addr = e.addr f.code = e.code |
f → NUM |
f.addr = get(NUM.lexeme) f.code = "" |
f → ID is |
assume indices is ε f.addr = get(ID.lexeme) f.code = "" |
The goal is to generate 3-address code for scalar expressions, i.e., arrays are not treated in this section (they will be shortly). Specifically, indices is assumed to be ε. We use is to abbreviate indices.
We generate the 3-address code using the natural
notation of
6.2.
In fact we assume there is a function gen() that, given the pieces
needed, does the proper formatting so gen(x = y + z) will output the
corresponding 3-address code.
gen() is often called with addresses other than lexemes, e.g.,
temporaries and constants.
The constructor Temp() produces a new address in whatever format gen
needs.
Hopefully this will be clear in the table to the right and the
others that follow.
We will use two attributes code and address (addr). The key objective at each node of the parse tree for an expression is to produce values for the code and addr attributes so that the following crucial invariant is maintained.
IfIn particular, after TheRoot.code is evaluated, the address TheRoot.addr contains the value of the entire expression.codeis executed, then addressaddrcontains the value of the (sub-)expression routed at this node.
Said another way, the attribute addr at a node is the address that holds the value calculated by the code at the node. Recall that unlike real code for a real machine our 3-address code doesn't reuse temporary addresses.
As one would expect for expressions, all the attributes in the table to the right are synthesized. The table is for the expression part of the class grammar. To save space we use ID for IDENTIFIER, e for expression, t for term, and f for factor.
The SDDs for declarations and scalar expressions can be easily combined by essentially concatenating them as shown here.
We saw this in chapter 2.
The method in the previous section generates long strings as we walk the tree. By using SDTs instead of SDDs, you can output parts of the string as each node is processed.
Production | Semantic Rules |
---|---|
pd → PROC np IS ds BEG ss END ; | ds.offset = 0 |
np → di ( ps ) | di | not used yet |
ds → d ds1 |
d.offset = ds.offset ds1.offset = d.newoffset |
ds.totalSize = ds1.totalSize | |
ds → ε | ds.totalSize = ds.offset |
d → di : ty ; |
addType(di.entry, ty.type) addBaseType(di.entry, ty.basetype)
addSize(di.entry, ty.size)addOffset(di.entry, d.offset) d.newoffset = d.offset + ty.size |
di → ID | di.entry = ID.entry |
ty → ARRAY [ NUM ] OF ty1 |
ty.type = array(NUM.value, ty1.type)
ty.basetype = ty1.basetype
ty.size = NUM.value * ty1.size
|
ty → INT |
ty.type = integer
ty.basetype = integer
ty.size = 4
|
ty → REAL |
ty.type = real
ty.basetype = real
ty.size = 8
|
The idea is to associate the base address with the array name. That is, the offset stored in the identifier table entry for the array is the address of the first element. When an element is referenced, the indices and the array bounds are used to compute the amount, often called the offset (unfortunately, we have already used that term), by which the address of the referenced element differs from the base address.
To implement this technique, we must first store the base type of each identifier in the identifier table. We use this basetype to determine the size of each element of the array. For example, consider
arr: array [10] of integer; x : real ;Our previous SDD for declarations calculates the size and type of each identifier. For arr these are 40 and array(10,integer), respectively. The enhanced SDD on the right calculates, in addition, the base type. For arr this is integer. For a scalar, such as x, the base type is the same as the type, which in the case of x is real. The new material is shaded in blue.
This SDD is combined with the expression SDD here.
Calculating the address of an element of a one dimensional array is easy. The address increment is the width of each element times the index (assuming indices start at 0). So the address of A[i] is the base address of A, which is the offset component of A's entry in the identifier table, plus i times the width of each element of A.
The width of each element of an array is the width of what we have called the base type of the array. For a scalar, there is just one element and its width is the width of the type, which is the same as the base type. Hence, for any ID the element width is sizeof(getBaseType(ID.entry)).
For convenience, we define getBaseWidth by the formula
getBaseWidth(ID.entry) = sizeof(getBaseType(ID.entry)) = sizeof(ID.entry.baseType)
Let us assume row major ordering. That is, the first element stored is A[0,0], then A[0,1], ... A[0,k-1], then A[1,0], ... . Modern languages use row major ordering.
With the alternative column major ordering, after A[0,0] comes A[1,0], A[2,0], ... .
For two dimensional arrays the address of A[i,j] is the sum of three terms
Remarks
End of Remarks
The generalization to higher dimensional arrays is clear.
Consider the following expression containing a simple array reference, where a and c are integers and b is a real array.
a = b[3*c]We want to generate code something like
T1 = #3 * c // i.e. mult T1,#3,c T2 = T1 * 8 // each b[i] is size 8 a = b[T2] // Uses the x[i]If we considered it too easy to use the that special form we would generate something likespecial form
T1 = #3 * c T2 = T1 * 8 T3 = &b T4 = T2 + T3 a = *T4
Production | Semantic Rules |
---|---|
f → ID i |
f.t1 = new Temp() f.addr = new Temp f.code = i.code || gen(f.t1 = i.addr * getBaseWidth(ID.entry)) || gen(f.addr = get(ID.lexeme)[f.t1]) |
f → ID i |
f.t1 = new Temp() f.t2 = new Temp() f.t3 = new Temp() f.addr = new Temp f.code = i.code || gen(f.t1 = in.addr * getBaseWidth(ID.entry)) || gen(f.t2 = &get(ID.lexeme)) || gen(f.t3 = f.t2 + f.t1) || gen(f.addr = *f.t3) |
i → [ e ] |
i.addr = e.addr i.code = e.code |
To permit arrays in expressions, we need to specify the semantic actions for the production
factor → IDENTIFIER indices
As a warm-up, lets start with references to one-dimensional arrays. That is, instead of the above production, we consider the simpler
factor → IDENTIFIER index
The table on the right does this in two ways, both with and without using the special addressing form x[j]. In the table the nonterminal index is abbreviated i . I included the version without the x[j] special form for two reasons.
Note that by avoiding the special form b=x[j], I ended up
using two other special forms.
Is it possible to avoid the special forms?
Lisp is taught in our programming languages course, which is a prerequisite for compilers. If you no longer remember Lisp, don't worry.
special formsthat are evaluated differently.
We just (optionally) saw an exception to the basic lisp evaluation
rule.
A similar exception occurs with x[j] in our three-address
code.
It is a special form
in that, unlike the normal rules for
three-address code, we don't use the address
of j but instead its value.
Specifically the value of j is added to
the address of x.
The rules for addresses in 3-address code also include
a = &b a = *b *a = bwhich are other
special forms. They have the same meaning as in the C programming language.
Our current SDD includes the production
f → ID iswith the added assumption that indices is ε. We now want to permit indices to be a single index as well as ε. That is we replace the above production with the pair
f → ID f → ID i
The semantic rules for each case were given in previous tables. The ensemble to date is given here.
On the board evaluate e.code for the RHS of the simple example above: a=b[3*c].
This is an exciting moment. At long last we really seem to be compiling!
As mentioned above in the general case we must process the production
f → IDENTIFIER is
Following the method used in the 1D case we need to construct is.code. The basic idea is shown here
Now that we can evaluate expressions (including one-dimensional array reverences) we need to handle the left-hand side of an assignment statement (which also can be an array reference). Specifically we need semantic actions for the following productions from the class grammar.
id-statement → ID rest-of-either rest-of-either → rest-of-assignment rest-of-assignment → := expression ; rest-of-assignment → indices := expression
Production | Semantic Rules |
---|---|
ss → s ss1 | ss.code = s.code || ss1.code |
ss → ε | ss.code = "" |
s → ids | s.code = ids.code |
ids → IDENTIFIER re |
re.id = IDENTIFIER.entry
ids.code = re.code
|
re → ra | ra.id = re.id
re.code = ra.code
|
ra → := e ; | ra.code = e.code || gen(ra.id.lexeme=e.addr) |
ra → i := e ; | ra.t1 = newTemp() ra.code = i.code || e.code || gen(ra.t1 = i.addr * getBaseWidth(ra.id)|| gen(ra.id.lexeme[ra.t1]=e.addr) |
Once again we begin by restricting ourselves to one-dimensional arrays, which corresponds to replacing indices by index in the last production. The SDD for this restricted case is shown on the right.
The first three productions reduce statements (ss) and statement (s) to identifier-statement (ids), which is used for statements such as assignment and procedure invocation, that begin with an identifier. The corresponding semantic rules simply concatenate all the code produced into the top ss.code.
The identifier-statement production captures the ID and sends it down to the appropriate rest-of-assignment (ra) production where the necessary code is generated and passed back up.
The simple ra → := e; production generates ra.code by simply appending to the evaluation of the RHS, the natural assignment with ra as the LHS.
It is instructive to compare ra.code for the ra → i := e production with f.code for the f → ID i production in the expression SDD. Both compute the same offset (index*elementSize) and both use a special form for the address x[j]=y in this section, and y=x[j] for expressions.
Incorporating these productions and semantic rules gives this SDD. Note that we have added a semantic rule to the procedure-def (pd) production that simply sends ss.code to pd.code so that we have obtained the overall goal setting pd.code to the entire code needed for the procedure.
The idea is the same as when a multidimensional array appears in an expression. Specifically,
Recall the program we could partially handle.
procedure P2 is y : integer; a : array [7] of real; begin y := 5; // Statements not yet done a[2] := y; // Type error? end;
Now we can do the statements.
Homework: What code is generated for the program written above? Please remind me to go over this homework next class.
What should we do about the possible type error?
Let's take the last option.