Start Lecture #9
Production | Semantic Rules |
---|---|
E → E 1 + T | E.node = new Node('+',E1.node,T.node) |
E → E 1 - T | E.node = new Node('-',E1.node,T.node) |
E → T | E.node = T.node |
T → ( E ) | T.node = E.node |
T → ID | T.node = new Leaf(ID,ID.entry) |
T → NUM | T.node = new Leaf(NUM,NUM.val) |
Recall that a syntax tree (technically an abstract syntax tree) contains just the essential nodes. For example, 7+3*5 would have one + node, one *, and the three numbers. Lets see how to construct the syntax tree from an SDD.
Assume we have two functions Leaf(op,val) and Node(op,c1,...,cn), that create leaves and interior nodes respectively of the syntax tree. Leaf is called for terminals.. Op is the label of the node (op for operation) and val is the lexical value of the token. Node is called for nonterminals and the ci's are pointers to the children.
Production | Semantic Rules | Type |
---|---|---|
E → T E' | E.node=E'.syn | Synthesized |
E'node=T.node | Inherited | |
E' → + T E'1 | E'1.node=new Node('+',E'.node,T.node) | Inherited |
E'.syn=E'1.syn | Synthesized | |
E' → - T E'1 | E'1.node=new Node('-',E'.node,T.node) | Inherited |
E'.syn=E'1.syn | Synthesized | |
E' → ε | E'.syn=E'.node | Synthesized |
T → ( E ) | T.node=E.node | Synthesized |
T → ID | T.node=new Leaf(ID,ID.entry) | Synthesized |
T → NUM | T.node=new Leaf(NUM,NUM.val) | Synthesized |
The upper table on the right shows a left-recursive grammar that is S-attributed (so all attributes are synthesized).
Try this for x-2+y and see that we get the syntax tree.
When we eliminate the left recursion, we get the lower table on the right. It is a good illustration of dependencies. Follow it through and see that you get the same syntax tree as for the left-recursive version.
Remarks:
This course emphasizes top-down parsing (at least for the labs) and hence we must eliminate left recursion. The resulting grammars often need inherited attributes, since operations and operands are in different productions. But sometimes the language itself demands inherited attributes. Consider two ways to declare a 3x4, two-dimensional array. (Strictly speaking, these are one-dimensional arrays of one-dimensional arrays.)
array [3] of array [4] of int and int[3][4]
Assume that we want to produce a tree structure like the one the right for either of the array declarations on the left. The tree structure is generated by calling a function array(num,type). Our job is to create an SDD so that the function gets called with the correct arguments.
Production | Semantic Rule |
---|---|
A → ARRAY [ NUM ] OF A1 | A.t=array(NUM.val,A1.t) |
A → INT | A.t=integer |
A → REAL | A.t=real |
For the first language representation of arrays (found in Ada and
in lab 3), it is easy to generate an S-attributed
(non-left-recursive) grammar based on
A → ARRAY [ NUM ] OF A | INT | REAL
This is shown in the upper table on the right.
On the board draw the parse tree and see that simple synthesized attributes above suffice.
Production | Semantic Rules |
---|---|
T → B C | T.t=C.t |
C.b=B.t | |
B → INT | B.t=integer |
B → REAL | B.t=real |
C → [ NUM ] C1 | C.t=array(NUM.val,C1.t) |
C1.b=C.b | |
C → ε | C.t=C.b |
For the second language representation of arrays (the C-style), we need some smarts (and some inherited attributes) to move the int all the way to the right. Fortunately, the result, shown in the table on the right, is L-attributed and therefore all is well.
Note that, instead of a third column stating whether the attribute is synthesized or inherited, I have adopted the convention of drawing the inherited attribute definitions with a pink background.
Also note that this is not necessary.
That is, one can look at a production and the associated semantic
rule and determine if the attribute is inherited or synthesized.
How is this done?
Answer: If the attribute being defined (i.e., the one on the LHS of
the semantic rule) is associated with the nonterminal on the LHS of
the production, the attribute is synthesized.
If the attribute being defined is associated with a nonterminal on
the RHS of the production, the attribute is inherited.
Homework: 1.
Basically skipped.
The idea is that instead of the SDD approach, which requires that we build a parse tree and then perform the semantic rules in an order determined by the dependency graph, we can attach semantic actions to the grammar (as in chapter 2) and perform these actions during parsing, thus saving the construction of the parse tree.
But except for very simple languages, the tree cannot be eliminated. Modern commercial quality compilers all make multiple passes over the tree, which is actually the syntax tree (technically, the abstract syntax tree) rather than the parse tree (the concrete syntax tree).
If parsing is done bottom up and the SDD is S-attributed, one can generate an SDT with the actions at the end (hence, postfix). In this case the action is perform at the same time as the RHS is reduced to the LHS.
A good summary of the available techniques.
preorderis relevant).
Recall that in fully-general, recursive-descent parsing one normally writes one procedure for each nonterminal.
Assume the SDD is L-attributed.
We are interested in using predictive parsing for LL(1) grammars In this case we have a procedure P for each production, which is implemented as follows.
Assume we have the parse tree on the right as produced, for example, by your lab3. (The numbers should really be a token, e.g. NUM, not the lexemes shown.) You now want to write the semantics analyzer, or intermediate code generator, or lab 4. In any of these cases you have semantic rules or actions (for lab4 it will be semantic rules) that need to be performed. Assume the SDD or SDT is L-attributed (that is my job for lab4), so we don't have to worry about dependence loops.
You start to write
analyze-i.e.-traverse (tree-node)
which will be initially called with tree-node=the-root.
The procedure will perform an Euler-tour traversal of the parse tree
and during its visits of the nodes, it will evaluate the relevant
semantic rules.
The visit() procedure is basically a big switch statement where the cases correspond to evaluating the semantic rules for the different productions in the grammar. The tree-node is the LHS of the production and the children are the RHS.
By first switching on the tree-node and then inspecting enough of the children, visit() can tell which production the tree-node corresponds to and which semantic rules to apply.
As described in 5.5.1 above, visit() has received as parameters (in addition to tree-node), the inherited attributes of the node. The traversal calls itself recursively, with the tree-node argument set to the leftmost child, then calls again using the next child, etc. Each time, the child is passed the inherited attributes.
When each child returns, it passes back its synthesized attributes.
After the last child returns, the parent returns, passing back the synthesized attributes that were calculated.
A programming point is that, since tree-node can be any node, traverse() and visit() must each be prepared to accept as parameters any inherited attribute that any nonterminal can have.
Instead of a giant switch, you could have separate routines for each nonterminal and just switch on the productions having this nonterminal as LHS.
In this case each routine need be prepared to accept as parameters only all the inherited attributes that its nonterminal can have.
You could have separate routines for each production and thus each routine has as parameters exactly the inherited attributes that this production receives.
To do this requires knowing which visit() procedure to call for
each nonterminal (child node of the tree).
For example, assume you are processing the production
B → C D
You need to know which production to call for C
(remember that C can be the LHS of many different productions).
Homework: Read Chapter 6.
The difference between a syntax DAG and a syntax tree is that the
former can have undirected cycles.
DAGs are useful where there are multiple, identical portions in a
given input.
The common case of this is for expressions where there often are
common subexpressions.
For example in the expression
X + a + b + c - X + ( a + b + c )
each individual variable is a common subexpression.
But a+b+c is not since the first occurrence has the X already
added.
This is a real difference when one considers the possibility of
overflow or of loss of precision.
The easy case is
x + y * z * w - ( q + y * z * w )
where y*z*w is a common subexpression.
It is easy to find such common subexpressions. The constructor Node() above checks if an identical node exists before creating a new one. So Node ('/',left,right) first checks if there is a node with op='/' and children left and right. If so, a reference to that node is returned; if not, a new node is created as before.
Homework: 1.
Often one stores the tree or DAG in an array, one entry per node.
Then the array index, rather than a pointer, is used to reference a
node.
This index is called the node's value-number and the triple
<op, value-number of left, value-number of right>
is called the signature of the node.
When Node(op,left,right) needs to determine if an identical node
exists, it simply searches the table for an entry with the required
signature.
Searching an unordered array is slow; there are many better data structures to use. Hash tables are a good choice.
Homework: 2.
We will use three-address code, i.e., instructions of the
form op a,b,c, where op is a primitive
operator.
For example
lshift a,b,4 // left shift b by 4 and place result in a add a,b,c // a = b + c a = b + c // alternate (more natural) representation of above
If we are starting with an expression DAG (or syntax tree if less aggressive), then transforming into 3-address code is just a topological sort and an assignment of a 3-address operation with a new name for the result to each interior node (the leaves already have names and values).
A key point is that nodes in an expression dag (or tree) have at most 2 children so three-address code is easy. As we produce three-address code for various constructs, we may have to generate several instructions to translate one construct.
For example, (B+A)*(Y-(B+A)) produces the DAG on the right, which yields the following 3-address code.
t1 = B + A t2 = Y - t1 t3 = t1 * t2
Notes
We use the terminology 3-address code since instructions in our
intermediate-language consist of one elementary
operation
with three operands, each of which is often an address.
Typically two of the addresses represent source operands or
arguments of the operation and the third represents the result.
Some of the 3-address operations have fewer than three addresses; we
simply think of the missing addresses as unused (or ignored) fields
in the instruction.
Consider
Q := Z; or
A[f(x)+B*D] := g(B+C*h(x,y));.
Where [] indicates array reference and () indicates a function call.
From a macroscopic view, performing each assignment involves three tasks.
Note the differences between L-values, quantities that can appear on the LHS of an assignment, and and R-values, quantities that can appear only on the RHS.
There is no universally agreed to set of three-address instructions, or even whether 3-address code should be the intermediate code for the compiler. Some prefer a set close to a machine architecture. Others prefer a higher-level set closer to the source, for example, subsets of C have been used. Others prefer to have multiple levels of intermediate code in the compiler and define a compilation phase that converts the high-level intermediate code into the low-level intermediate code. What follows is the set proposed in the book.
In the list below, x, y, and z are addresses; i is an integer,
not an address); and L is a symbolic label.
The instructions can be thought of as numbered and the labels can be
converted to the numbers with another pass over the output or
via backpatching
, which is discussed below.
param S param U param V t = call G,2 param t param W A = call F,3This is not important for lab4 since we do not have functions, and procedure calls cannot be embedded one inside the other the way function calls can.
Indexed Copy ops. x = y[i] x[i] =
y.
In the second example, x[i] is the address that is
the-value-of-i locations after the address x; in particular
x[0] is the same address as x.
Similarly for y[i] in the first example.
But a crucial difference is that, since y[i] is on the RHS,
the address is dereferenced and the value in the
address is what is used.
Note that x[i] = y[j] is not permitted as that requires 4 addresses. x[i] = y[i] could be permitted, but is not.
Quads are an easy, almost obvious, way to represent the three address instructions: put the op into the first of four fields and the three addresses into the remaining three fields. Some instructions do not use all the fields. Many operands will be references to entries in tables (e.g., the identifier table).
Homework: 1, 2 (you may use the parse tree instead of the syntax tree if you prefer). You may omit the part about triples.
A triple optimizes
a quad by eliminating the result field of
a quad since the result is often a temporary.
When this result occurs as a source operand of a subsequent instruction, the source operand is written as the value-number of the instruction yielding this result (distinguished some way, say with parens).
If the result field of a quad is a program name and not a temporary then two triples may be needed:
When an optimizing compiler reorders instructions for increased performance, extra work is needed with triples since the instruction numbers, which have changed, are used implicitly. Hence the triples must be regenerated with correct numbers as operands.
With Indirect triples we maintain an array of pointers to triples and, if it is necessary to reorder instructions, just reorder these pointers. This has two advantages.
This has become a big deal in modern optimizers, but we will
largely ignore it.
The idea is that you have all assignments go to unique (temporary)
variables.
So if the code is
if x then y=4 else y=5
it is treated as though it was
if x then y1=4 else y2=5
The interesting part comes when y is used later in the program and
the compiler must choose between y1 and y2.
Much of the early part of this section is really about programming languages more than about compilers.
A type expression is either a basic type or the result of applying a type constructor.
There are two camps, name equivalence and structural equivalence.
Consider the following example.
declare type MyInteger is new Integer; MyX : MyInteger; x : Integer := 0; begin MyX := x; endThis generates a type error in Ada, which has name equivalence since the types of x and MyX do not have the same name, although they have the same structure.
As another example, consider an object of an anonymous
type
as in
X : array [5] of integer;
Since the type of X does not have a name,
X does not have the same type as any other object not even Y
declared as
y : array [5] of integer;
However, x[2] has the same type as y[3]; both are integers.
The following example from the 2e uses an usual C/Java-like array notation. (The 1e had pascal-like notation.) Although I prefer Ada-like constructs as in lab 3, I realize that the class knows C/Java best so like the authors I will sometimes follow the 2e as well as presenting lab3-like grammars. I will often refer to lab3-like grammars as the class grammar.
The grammar below gives C/Java-like records/structs/methodless-classes as well as multidimensional arrays (really singly dimensioned arrays of singly dimensioned arrays).
D → T id ; D | ε T → B C | RECORD { D } B → INT | FLOAT C → [ NUM ] C | ε
Note that an example sentence derived from D is
int [5] x ;which is not legal C, but does have the
virtuethat the type is separate from identifier being declared.
The class grammar doesn't support records. This part of the class grammar declares ints, reals, arrays, and user-defined types.
declarations → declaration declarations | ε declaration → defining-identifier : type ; | TYPE defining-identifier IS type ; defining-identifier → IDENTIFIER type → INT | REAL | ARRAY [ NUMBER ] OF type | IDENTIFIERSo that the tables below are not too wide, let's use shorter names for the nonterminals. Specifically, we abbreviate declaration as d, declarations as ds, defining-identifier, as di, and type as ty (unfortunately, we have already used t to abbreviate term).
For now we ignore the second possibility for a declaration (declaring a type itself).
ds → d ds | ε d → di : ty ; di → ID ty → INT | REAL | ARRAY [ NUMBER ] OF ty
It is useful to support user-declared types. For example
type vector5 is array [5] of real; v5 : vector5;The full class grammar does support this.
ds → d ds | ε d → di : ty ; | TYPE di IS ty ; di → ID ty → INT | REAL | ARRAY [ NUMBER ] OF ty | ID
Ada supports both constrained array types such as
type t1 is array [5] of integer;and unconstrained array types such as
type t2 is array of integer;With the latter, the constraint is specified when the array (object) itself is declared.
x1 : t1 x2 : t2[5]
You might wonder why we want the unconstrained type. These types permit a procedure to have a parameter that is an array of integers of unspecified size. Remember that the declaration of a procedure specifies only the type of the parameter; the object is determined at the time of the procedure call.