Compilers

Start Lecture #9

When we eliminate the left recursion, we get the lower table on the right. It is a good illustration of dependencies. Follow it through and see that you get the same syntax tree as for the left-recursive version.

Remarks:

  1. You probably did/are-doing/will-do some variant of new Node and new Leaf for lab 3. When processing a production
    1. Create a parse tree node for the LHS.
    2. Call subroutines for RHS symbols and connect the resulting nodes to the node created in i.
    3. Return a reference to the new node so the parent can hook it into the parse tree.
  2. It is the lack of a call to new in the third and fourth productions (of the first, left recursive, grammar) that causes the (abstract) syntax tree to be produced rather than the parse (concrete syntax) tree.
  3. Production compilers do not produce a parse trees; rather they produce syntax trees. The syntax tree is smaller, and hence more (space and time) efficient for subsequent passes that walk the tree. The parse tree is (I believe) slightly easier to construct as you don't have to decide which nodes to produce; you simply produce them all.

5.3.2: The structure of a Type

This course emphasizes top-down parsing (at least for the labs) and hence we must eliminate left recursion. The resulting grammars often need inherited attributes, since operations and operands are in different productions. But sometimes the language itself demands inherited attributes. Consider two ways to declare a 3x4, two-dimensional array. tree rep for arrays

    array [3] of array [4] of int    and     int[3][4]
  

Assume that we want to produce a tree structure like the one the right for either of the array declarations on the left. The tree structure is generated by calling a function array(num,type). Our job is to create an SDD so that the function gets called with the correct arguments.
ProductionSemantic Rule


A → ARRAY [ NUM ] OF A1 A.t=array(NUM.val,A1.t)
A → INTA.t=integer
A → REALA.t=real

For the first language representation of arrays (found in Ada and in lab 3), it is easy to generate an S-attributed (non-left-recursive) grammar based on
A → ARRAY [ NUM ] OF A | INT | REAL
This is shown in the upper table on the left.

On the board draw the parse tree and see that simple synthesized attributes above suffice.

ProductionSemantic Rules


T → B CT.t=C.t
C.b=B.t


B → INTB.t=integer
B → REALB.t=real


C → [ NUM ] C1 C.t=array(NUM.val,C1.t)
C1.b=C.b


C → εC.t=C.b

For the second language representation of arrays (the C-style), we need some smarts (and some inherited attributes) to move the int all the way to the right. Fortunately, the result, shown in the table on the right, is L-attributed and therefore all is well.

Note that, instead of a third column stating whether the attribute is synthesized or inherited, I have adopted the convention of shading the inherited attribute definitions with a pink background.

Also note that this is not necessary. That is, one can look at a production and the associated attribute definition and determine if the attribute is inherited or synthesized.
How is this done?
Answer: If the attribute being defined (i.e., the one on the LHS of the definition) is associated with the nonterminal on the LHS of the production, the attribute is synthesized. If the attribute being defined is associated with a nonterminal on the RHS of the production, the attribute is inherited.

Homework: 1.

5.4: Syntax-Directed Translation Schemes (SDTs)

Basically skipped.

The idea is that instead of the SDD approach, which requires that we build a parse tree and then perform the semantic rules in an order determined by the dependency graph, we can attach semantic actions to the grammar (as in chapter 2) and perform these actions during parsing, thus saving the construction of the parse tree.

But except for very simple languages, the tree cannot be eliminated. Modern commercial quality compilers all make multiple passes over the tree, which is actually the syntax tree (technically, the abstract syntax tree) rather than the parse tree (the concrete syntax tree).

5.4.1: Postfix Translation Schemes

If parsing is done bottom up and the SDD is S-attributed, one can generate an SDT with the actions at the end (hence, postfix). In this case the action is perform at the same time as the RHS is reduced to the LHS.

5.4.2: Parser-Stack Implementation of Postfix SDTs

Skipped.

5.4.3: SDTs with Actions Inside Productions

Skipped

5.4.4: Eliminating Left Recursion from SDTs

Skipped

5.4.5: SDTs For L-Attributed Definitions

Skipped

5.5: Implementing L-Attributed SDD's

A good summary of the available techniques.

  1. Build the parse tree and annotate. Works as long as no cycles are present (guaranteed by L- or S-attributed).
  2. Build the parse tree, add actions, and execute the actions in preorder. Works for any L-attributed definition. Can add actions based on the semantic rules of the SDD. (Since actions are leaves of the tree, I don't see why preorder is relevant).
  3. Translate During Recursive Descent Parsing. See below.
  4. Generate Code on the Fly. Also uses recursive descent, but is restrictive.
  5. Implement an SDT during LL-parsing. Skipped.
  6. Implement an SDT during LR-parsing of an LL Language. Skipped.

5.5.1: Translation During Recursive-Descent Parsing

Recall that in recursive-descent parsing there is one procedure for each nonterminal. Assume the SDD is L-attributed.

Also assume the grammar is LL(1) so that we can employ predictive parsing and not have to worry about backtracking. In this case we have a procedure for each production.

Then the procedure P for a given production is implemented as follows.

  1. The procedure is passed the inherited attributes of its LHS nonterminal.
  2. The procedure defines a variable for each attribute it will calculate. These are the inherited attributes of the nonterminals in the RHS and the synthesized attributes of the nonterminal constituting the LHS.
  3. Process in left-to-right order each nonterminal B appearing in the body of the production as follows.
    1. Calculate the inherited attributes of B. Since the SDD is L-attributed, these attributes do not depend on synthesized attributes for nonterminals to the right of B (which have not yet been processed).
    2. Call the procedure Q associated with this use of the nonterminal B. Since the grammar is LL(1) and we are using predictive parsing, we know what production with LHS B will be used here.
    3. The procedure Q returns to P the synthesized attributes of B, which may be needed for nonterminals to the right of B.
  4. The procedure P calculates the synthesized attributes of the nonterminal constituting its LHS.
  5. P returns these synthesized attributes to its caller (which is the procedure associated with its parent).

5.5.2: On-the-fly Code Generation

Skipped.

5.5.3: L-attributed SDDs and LL Parsing

Skipped.

5.5.4: Bottom-Up Parsing of L-Attributed SDDs

Basically skipped. It is interesting that this bottom-up technique requires an LL (not just LR) language.

5.A: What Is This All Used For?

Assume we have a parse tree as produced, for example, by your lab3. You now want to write the semantics analyzer, or intermediate code generator, or lab 4. In any of these cases you have semantic rules or actions (for lab4 it will be semantic rules) that need to be performed. Assume the SDD or SDT is L-attributed (that is my job for lab4), so we don't have to worry about dependence loops.

You start to write
      analyze (tree-node)
which will be initially called with tree-node=root. The procedure will perform an Euler-tour traversal of the parse tree and during its visits of the nodes, it will evaluate the relevant semantic rules.

The visit() procedure is basically a big switch statement where the cases correspond to the different productions in the grammar. The tree-node is the LHS of the production and the children are the RHS.

By first switching on the tree-node and then inspecting enough of the children, visit() can tell which production the tree-node corresponds to and which semantic rules to apply.

As described in 5.5.1 above, visit() has received as parameters (in addition to tree-node), the inherited attributes of the node. The traversal calls itself recursively, with the tree-node argument set to the leftmost child, then calls again using the next child, etc. Each time, the child is passed the inherited attributes.

When each child returns, it passes back its synthesized attributes.

After the last child returns, the parent returns, passing back the synthesized attributes that were calculated.

A programming point is that, since tree-node can be any node, visit() must be prepared to accept as parameters any inherited attribute that any nonterminal can have.

5.A.A: Variations

  1. Instead of a giant switch, you could have separate routines for each nonterminal and just switch on the productions having this nonterminal as LHS.

    In this case each routine need be prepared to accept as parameters only all the inherited attributes that its nonterminal can have.

  2. You could have separate routines for each production and thus each routine has as parameters exactly the inherited attributes that this production receives.

    To do this requires knowing which visit() procedure to call for each nonterminal (child node of the tree). For example, assume you are processing the production
        B → C D
    You need to know which production to call for C (remember that C can be the LHS of many different productions).


  3. If you like actions instead of rules, perform the actions where indicated in the SDT.

  4. Global variable can be used (with care) instead of parameters.

  5. As illustrated earlier, you can call routines instead of setting an attribute (see addType in 5.2.5).

Chapter 6: Intermediate-Code Generation

Homework: Read Chapter 6.

6.1: Variants of Syntax Trees

6.1.1: Directed Acyclic Graphs for Expressions

The difference between a syntax DAG and a syntax tree is that the former can have undirected cycles. DAGs are useful where there are multiple, identical portions in a given input. The common case of this is for expressions where there often are common subexpressions. For example in the expression
      X + a + b + c - X + ( a + b + c )
each individual variable is a common subexpression. But a+b+c is not since the first occurrence has the X already added. This is a real difference when one considers the possibility of overflow or of loss of precision. The easy case is
      x + y * z * w - ( q + y * z * w )
where y*z*w is a common subexpression.

It is easy to find such common subexpressions. The constructor Node() above checks if an identical node exists before creating a new one. So Node ('/',left,right) first checks if there is a node with op='/' and children left and right. If so, a reference to that node is returned; if not, a new node is created as before.

Homework: 1.

6.1.2: The Value-Number Method for Constructing DAGS

Often one stores the tree or DAG in an array, one entry per node. Then the array index, rather than a pointer, is used to reference a node. This index is called the node's value-number and the triple
    <op, value-number of left, value-number of right>
is called the signature of the node. When Node(op,left,right) needs to determine if an identical node exists, it simply searches the table for an entry with the required signature.

Searching an unordered array is slow; there are many better data structures to use. Hash tables are a good choice.

Homework: 2.

6.2: Three-Address Code

Instructions of the form op a,b,c, where op is a primitive operator. For example

    lshift a,b,4   // left shift b by 4 and place result in a
    add    a,b,c   // a = b + c
    a = b + c      // alternate (more natural) representation of above
  

If we are starting with an expression DAG (or syntax tree if less aggressive), then transforming into 3-address code is just a topological sort and an assignment of a 3-address operation with a new name for the result to each interior node (the leaves already have names and values).

A key point is that nodes in an expression dag (or tree) have at most 2 children so three-address code is easy. As we produce three-address code for various constructs, we may have to generate several instructions to translate one construct.

3 address code

For example, (B+A)*(Y-(B+A)) produces the DAG on the right, which yields the following 3-address code.

    t1 = B + A
    t2 = Y - t1
    t3 = t1 * t2
  

6.2.1: Addresses and Instructions

We use the term 3-address since instructions in our intermediate-code consist of one elementary operation with three operands, each of which is often an address. Typically two of the addresses represent source operands or arguments of the operation and the third represents the result. Some of the 3-address operations have fewer than three addresses; we simply think of the missing addresses as unused (or ignored) fields in the instruction.

Possible addresses

  1. (Source program) Names. Really the intermediate code would contain a reference to the (identifier) table entry for the name. For convenience, the actual identifier is often written. An important issue is type conversion, which will be discussed later.
  2. Constants. Again, this would often be a reference to a table entry. As with names type conversion is an important issue for constants.
  3. (Compiler-generated) Temporaries. Although it may at first seem wasteful, modern practice assigns a new name to each temporary, rather than reusing the same temporary when possible. (Remember that a DAG node is considered one temporary even if it has many parents.) Later phases can combine several temporaries into one (e.g., if they have disjoint lifetimes).

L-values and R-values

Consider Q = Z; or A[f(x)+B*D] = g(B+C*h(x,y));. I am using [] for array reference and () for function call).

From a macroscopic view, we have three tasks.

  1. Evaluate the left hand side (LHS) to obtain an l-value.
  2. Evaluate the RHS to obtain an r-value.
  3. Perform the assignment.

Note the differences between L-values, quantities that can appear on the LHS of an assignment, and and R-values, quantities that can appear only on the RHS.

Possible three-address instructions

There is no universally agreed to set of three-address instructions or to whether 3-address code should be the intermediate code for the compiler. Some prefer a set close to a machine architecture. Others prefer a higher-level set closer to the source, for example, subsets of C have been used. Others prefer to have multiple levels of intermediate code in the compiler and define a compilation phase that converts the high-level intermediate code into the low-level intermediate code. What follows is the set proposed in the book.

In the list below, x, y, and z are addresses; i is an integer, not an address); and L is a symbolic label. The instructions can be thought of as numbered and the labels can be converted to the numbers with another pass over the output or via backpatching, which is discussed below.

  1. Binary and unary ops.
    1. x = y op z, a binary op
    2. x = op y, a unary op,
    3. x = y, a unary op (in this case the identity f(x)=x)

  2. Binary, unary, and nullary op jumps
    1. Nullary Junp.
      goto L
    2. Conditional unary op jumps.
      if x goto L
      ifFalse x goto L.
    3. Conditional binary op jumps.
      if x relop y goto L

  3. Procedure/Function Calls and Returns.
          param x     call p,n     y = call p,n     return     return y.
    The value n gives the number of parameters, which is needed when the argument of one function is the value returned by another, for example A = F(S,G(U,V),W) might become
    	param S
    	param U
    	param V
    	t = call G,2
    	param t
    	param W
    	A = call F,3
          
    This is not important for lab4 since we do not have functions, and procedures cannot be embedded one inside the other the way functions can.
  4. Indexed Copy ops. x = y[i]   x[i] = y.
    x[i] is the address that is the-value-of-i locations after x; in particular x[0] is the same address as x. Similarly for y[i]. But a difference that, since y[i] is on the RHS, the address is dereferenced and the value in the address is what is used.

    Note that x[i] = y[j] is not permitted as that requires 4 addresses. x[i] = y[i] could be permitted, but is not.

  5. Address and pointer ops. x = &y   x = *y   *x = y.
    x = &y sets the r-value of x equal to the l-value of y.
    x = *y sets the r-value of x equal to the contents of the location given by the r-value of y.
    *x = y sets the location given by the r-value of x equal to the r-value of y.

6.2.2: Quadruples (Quads)

Quads are an easy, almost obvious, way to represent the three address instructions: put the op into the first of four fields and the three addresses into the remaining three fields. Some instructions do not use all the fields. Many operands will be references to entries in tables (e.g., the identifier table).