Start Lecture #7

Our main example

E' → E E → E + T | T T → T * F | F F → ( E ) | idis larger than the toy I did before. The NFA would have 2+4+2+4+2+4+2=20 states (a production with k symbols on the RHS gives k+1 N-states since there k+1 places to place the dot). This gives rise to 12 D-states. However, the development in the book, which we are following now, constructs the DFA directly. The resulting diagram is on the right.

Start constructing the diagram on the board:

Begin with {E' → ·E},
take the closure, and then keep applying GOTO.

A state of the automaton is an item set as described previously.
The transition function is GOTO.
If during a parse we are up to item set I_{j} (often called
state s_{j} or simply state j) and the next input symbol is
b (it of course must be a terminal), and state j has an outgoing
transition labeled b, then the parser shifts b to the stack

.
Actually, although b is removed from the input string, what is
pushed on the stack is k, the destination state of the transition
labeled b.

If there is no such transition, then the parser performs a
reduction; choosing which reduction to use is determined by the
items in I_{j} and the FOLLOW sets.
(It is also possible that the parser will now accept the input
string or announce that the input string is not in the
language).

Naturally, programs can't use diagrams so we will construct a table, but it is instructive to see how the LR(0) diagram plus the FOLLOW sets we have see before are enough for SLR parsing.

The diagram and FOLLOW sets (plus the production list and the ACTION/GOTO table) can be see here. For now just use the diagram and FOLLOW.

Take a legal input, e.g. `id+id*id`, and shift whenever the
diagram says you can and, when you can't, reduce using the
**unique** completed item having the input symbol in
FOLLOW.
Don't forget that reducing means backing up the number of grammar
symbols in the RHS and then going forward using the arc labeled with
the LHS.

The LR-parsing algorithm must decide when to shift and when to reduce (and in the latter case, by which production). It does this by consulting two tables, ACTION and GOTO. The basic algorithm is the same for all LR parsers, what changes are the tables ACTION and GOTO.

We have already seen GOTO (for SLR).

**Remark**: Technical point that may, and probably
should, be ignored: our GOTO was defined on pairs
[item-set,grammar-symbol].
The new GOTO is defined on pairs [state,nonterminal].
A state is simply an item set (so nothing is new here).
We will not use the new GOTO on terminals so we just define it on
nonterminals.

Given a state i and a terminal a (or the endmarker), ACTION[i,a] can be

- Shift j. The terminal a is shifted on to the stack and the parser enters state j.
- Reduce A → α. The parser reduces α on the TOS to A.
- Accept.
- Error

So ACTION is the key to deciding shift vs. reduce. We will soon see how this table is computed for SLR.

Since ACTION is defined on [state,terminal] pairs and GOTO is defined on [state,nonterminal] pairs, we can combine these tables into one defined on [state,grammar-symbol] pairs.

This formalism is useful for stating the actions of the parser precisely, but I believe the parser can be explained without this formalism. The essential idea of the formalism is that the entire state of the parser can be represented by the vector of states on the stack and input symbols not yet processed.

As mentioned above the Symbols column is redundant so a configuration of the parser consists of the current stack and the remainder of the input. Formally it is

(s_{0},s_{1}...s_{m},a_{i}a_{i+1}...a_{n}$)

where the s's are states and the a's input symbols.
This configuration could also be represented by the
right-sentential form
X_{1}...X_{m},a_{i}...a_{n}

where the X is the symbol associated with the state.
X is either the terminal just shifted in or the LHS of the
reduction just performed.
The parser consults the combined ACTION-GOTO table for its current
state (TOS) and next input symbol, formally this is
ACTION[s_{m},a_{i}], and proceeds based on the value
in the table.
If the action is a shift, the next state is clear from the DFA
We have done this informally just above; here we use the formal
treatment).

- Shift s.
The input symbol a is pushed and s becomes the new state.
The new configuration is
(s
_{0}...s_{m}s,a_{i+1}...a_{n}) - Reduce A → α.
Let r be the number of symbols in the RHS of the production.
The parser pops r items off the stack (backing up r states) and
enters the state GOTO(s
_{m-r},A). That is after backing up it goes where A says to go. A real parser would now probably do something, e.g., build a tree node or perform a semantic action. Although we know about this from the chapter 2 overview, we don't officially know about it here. So for now think of it simply printing the production the parser reduced by. - Accept.
- Error.

The missing piece of the puzzle is finally revealed.

State | a | b | + | $ | A | B | C |
---|---|---|---|---|---|---|---|

7 | acc | ||||||

8 | s11 | s10 | 9 | 7 | |||

9 | |||||||

10 | |||||||

11 | s12 | ||||||

12 | s13 | ||||||

13 | r2 |

Before defining the ACTION and GOTO tables precisely, I want to do
it informally via the simple example on the right.
This is a **FAKE** example, the B'→·B item
would be in I_{0} and would have only B→·a+b
with it.

**IMPORTANT**: To construct the table, you do need
something not in the diagram, namely the FOLLOW sets from top-down
parsing.

For convenience number the productions of the grammar to make them easy to reference and assume that the production B → a+b is numbered 2. Also assume FOLLOW(B)={b} and all other follow sets are empty. Again, I am not claiming that there is a grammar with this diagram and these FOLLOW sets.

Show how to calculate every entry in the table using the diagram
and the FOLLOW sets.
Then consider input a+b

and see how the table guides to
eventually accepting the a+b

string.

This table is the SLR equivalent of the predictive-parsing table we constructed for top-down parsing.

The action table is defined with states (item sets) as rows and
**terminals** and the $ endmarker as columns.
GOTO has the same rows, but has **nonterminals** as
columns.
So we construct a combined ACTION-GOTO table, with states as rows
and grammar symbols (terminals + nonterminals) plus $ as columns.

- Each arc in the diagram labeled with a terminal indicates a shift. In the entry with row the state at the tail of the arc and column the labeling terminal place sn, where n is the state at the head of the arc. This indicates that if we are in the given state and the input is the given terminal, we shift to new state n.
- Each arc in the diagram labeled with a nonterminal informs us what state to enter if we reduce. In the entry with row the state at the tail of the arc and column the labeling nonterminal place n, where n is the state at the head of the arc.
- Each completed item (dot at the extreme right) indicates
a
**possible**reduction. In each entry with row the state containing the completed item and column a terminal in the FOLLOW set of the LHS of the production corresponding to this item, place rn, where n is the number of the production. (In particular, at entry [13,b] place an r2.). - In the entry with row (i.e., state) containing S'→S and
column $, place
accept

. - If any entry is labelled twice (i.e., a conflict) the grammar is not SLR(1).
- Any unlabeled entry corresponds to an input error. If the parser accesses this entry, the input sentence is not in the language generated by the grammar.

The book (both editions) and the rest of the world seem to use GOTO
for both the function defined on item sets and the derived function
on states.
As a result we will be defining GOTO in terms of GOTO.
Item sets are denoted by I or I_{j}, etc.
States are denoted by s or s_{i} or i.
Indeed both books use i in this section.
The advantage is that on the stack we placed integers (i.e., i's) so
this is consistent.
The disadvantage is that we are defining GOTO(i,A) in terms of
GOTO(I_{i},A), which looks confusing.
Actually, we view the old GOTO as a function and the new one as an
array (mathematically, they are the same) so we actually write
GOTO(i,A) and GOTO[I_{i},A].

The diagrams above are constructed for our use; they are not used in constructing a bottom-up parser. Here is the real algorithm, which uses just the augmented grammar (i.e., after adding S' → S) and the FOLLOW sets.

- Construct the LR(0) automaton with states
{I
_{0},...,I_{n}} and transition function GOTO. - The parsing actions for state i.
- If A→α·bβ is in I
_{i}for b a terminal, then ACTION[i,b]=shift j

, where GOTO(I_{i},b)=I_{j}. - If A→α· is in I
_{i}, for A≠S', then, for all b in FOLLOW(A), ACTION[i,b]=reduce A→α

. - If S'→S· is in I
_{i}, then ACTION[I,$]=accept

. - If any conflicts occurred, the grammar is not SLR(1).

- If A→α·bβ is in I
- If GOTO(I
_{i},A)=I_{j}, for a nonterminal A, then GOTO[i,A]=j. - All entries not yet defined are
error

. - The initial state is the one containing S'→·S.

State | ACTION | GOTO | |||||||
---|---|---|---|---|---|---|---|---|---|

id | + | * | ( | ) | $ | E | T | F | |

0 | s5 | s4 | 1 | 2 | 3 | ||||

1 | s6 | acc | |||||||

2 | r2 | s7 | r2 | r2 | |||||

3 | r4 | r4 | r4 | r4 | |||||

4 | s5 | s4 | 8 | 2 | 3 | ||||

5 | r6 | r6 | r6 | r6 | |||||

6 | s5 | s4 | 9 | 3 | |||||

7 | s5 | s4 | 10 | ||||||

8 | s6 | s11 | |||||||

9 | r1 | s7 | r1 | r1 | |||||

10 | r3 | r3 | r3 | r3 | |||||

11 | r5 | r5 | r5 | r5 |

- E' → E
- E → E + T
- E → T
- T → T * F
- T → F
- F → ( E )
- F → id

FIRST | FOLLOW | |
---|---|---|

E' | ( id | $ |

E | ( id | + ) $ |

T | ( id | * + ) $ |

F | ( id | * + ) $ |

Our main example, pictured on the right, gives the table shown on
the left.
The productions and FOLLOW sets are shown as well (the FIRST sets
are not used directly, but are needed to calculate FOLLOW).
The table entry s5 abbreviates shift and go to state 5

.
The table entry r2 abbreviates reduce by production number 2

,
where we have numbered the productions as shown.

The shift actions can be read directly off the DFA. For example I1 with a + goes to I6, I6 with an id goes to I5, and I9 with a * goes to I7.

The reduce actions require FOLLOW, which for this simple grammar is fairly easy to calculate.

Consider I_{5}={F→id·}.
Since the dot is at the end, we are ready to reduce, but we must
check if the next symbol can follow the F we are reducing to.
Since FOLLOW(F)={+,*,),$}, in row 5 (for I5) we put
r6 (for reduce by production 6

) in the columns for
+, *, ), and $.

The GOTO columns can also be read directly off the DFA.
Since there is an E-transition (arc labeled E) from I_{0}
to I_{1}, the column labeled E in row 0 contains a 1.

Since the column labeled + is blank for row 7, we see that it would be an error if we arrived in state 7 when the next input character is +.

Finally, if we are in state 1 when the input is exhausted ($ is the next input character), then we have a successfully parsed the input.

Stack | Symbols | Input | Action |
---|---|---|---|

0 | id*id+id$ | shift | |

05 | id | *id+id$ | reduce by F→id |

03 | F | *id+id$ | reduct by T→id |

02 | T | *id+id$ | shift |

027 | T* | id+id$ | shift |

0275 | T*id | +id$ | reduce by F→id |

027 10 | T*F | +id$ | reduce by T→T*F |

02 | T | +id$ | reduce by E→T |

01 | E | +id$ | shift |

016 | E+ | id$ | shift |

0165 | E+id | $ | reduce by F→id |

0163 | E+F | $ | reduce by T→F |

0169 | E+T | $ | reduce by E→E+T |

01 | E | $ | accept |

**Homework**: 2 (you already constructed the LR(0)
automaton for this example in the previous homework), 3,
4 (this problem refers to 4.2.2(a-g); only use 4.2.2(a-c).

**Example**:
What about ε-productions?
Let's do

A → B D B → b B | ε D → d

Reducing by the ε-production actually adds a state to the stack (it pops ZERO states since there are zero symbols on RHS and pushes one).

**Homework**: Do the following extension

A → B D B → b B | ε D → d D | ε

We consider **very** briefly two alternatives to SLR,
canonical-LR or LR, and lookahead-LR or LALR.

SLR used the LR(0) items, that is the items used were productions with an embedded dot, but contained no other (lookahead) information. The LR(1) items contain the same productions with embedded dots, but add a second component, which is a terminal (or $). This second component becomes important only when the dot is at the extreme right (indicating that a reduction can be made if the input symbol is in the appropriate FOLLOW set). For LR(1) we do that reduction only if the input symbol is exactly the second component of the item. This finer control of when to perform reductions, enables the parsing of a larger class of languages.

For LALR we merge various LR(1) item sets together, obtaining nearly the LR(0) item sets we used in SLR. LR(1) items have two components, the first, called the core, is a production with a dot; the second a terminal. For LALR we merge all the item sets that have the same set of cores by combining the 2nd components of items with like cores (thus permitting reductions when any of these terminals is the next input symbol). Thus we obtain the same number of states (item sets) as in SLR since only the cores distinguish item sets.

Unlike SLR, we limit reductions to occurring only for certain specified input symbols. LR(1) gives finer control; it is possible for the LALR merger to have reduce-reduce conflicts when the LR(1) items on which it is based is conflict free.

Although these conflicts are possible, they are rare and the size reduction from LR(1) to LALR is quite large. LALR is the current method of choice for bottom-up, shift-reduce parsing.

Dangling-ElseAmbiguity

The tool corresponding to Lex for parsing is yacc, which (at least originally) stood for yet another compiler compiler. This name is cute but somewhat misleading since yacc (like the previous compiler compilers) does not produce a complete compiler, just a parser.

The structure of the yacc user input is similar to that for lex, but instead of regular definitions, one includes productions with semantic actions.

There are ways to specify associativity and precedence of operators. It is not done with multiple grammar symbols as in a pure parser, but more like declarations.

Use of Yacc requires a serious session with its manual.

**Homework:** Read Chapter 5.

Again we are redoing, more formally and completely, things we briefly discussed when breezing over chapter 2.

Recall that a syntax-directed definition (SDD) adds semantic rules
to the productions of a grammar.
For example to the production `T → T / F` we might
add the rule

T.code = Tif we were doing an infix to postfix translator._{1}.code || F.code || '/'

Rather than constantly copying ever larger strings to finally output at the root of the tree after a depth first traversal, we can perform the output incrementally by embedding semantic actions within the productions themselves. The above example becomes

T → T_{1}/ F { print '/' }

Since we are generating postfix, the action comes at the end (after
we have generated the subtrees for T_{1} and F, and hence
performed their actions).
In general the actions occur within the production, not necessarily
after the last symbol.

For SDD's we conceptually need to have the entire tree available
after the parse so that we can run the postorder traversal.
(It was postorder for the infix→postfix translator since we had
a simple S-attributed

SDD.
We will traverse the parse tree in
other orders when the SDD is not S-attributed, and will see
situations when no traversal order is possible.)

In some situations semantic actions can be performed during the parse, without saving the tree. However, this does not occur for large commercial compilers since they make several passes over the tree for optimizations. We will not make use of this possibility either; your lab 3 (parser) will produce a tree and your lab 4 (semantic analyzer / intermediate code generator) will read this tree and perform the semantic rules specified by the SDD.

Formally, attributes are values (of any type) that are associated with grammar symbols. Write X.a for the attribute a of symbol X. You can think of attributes as fields in a record/struct/object.

Semantic rules (rules for short) are associated with productions and give formulas to calculate attributes of nonterminals in the production.

Terminals can have synthesized attributes (not yet officially
defined); these are given to it by the lexer (not the parser).
For example the token `id` might well have the
attribute `lexeme`, where `id.lexeme` is the lexeme
that the lexer converted into this instance of `id`.
There are no rules in an SDD giving attribute values to terminals.
Terminals do not have inherited attributes (to be defined shortly).

A nonterminal B can have both inherited and synthesized attributes. The difference is how they are computed. In either case the computation is specified by a rule associated with the production at a node N of the parse tree.

**Definition**:
A **synthesized** attribute of a nonterminal B is
defined at a node N where B is the LHS of the production associated
with N.
The attribute can depend on only (synthesized or inherited)
attribute values at the children of N (the RHS of the production)
and on inherited attribute values at N itself.

The arithmetic division example above was synthesized.

Production | Semantic Rules |
---|---|

L → E $ | L.val = E.val |

E → E_{1} + T | E.val = E_{1}.val + T.val |

E → E_{1} - T | E.val = E_{1}.val - T.val |

E → T | E.val = T.val |

T → T_{1} * F | T.val = T_{1}.val * F.val |

T → T_{1} / F | T.val = T_{1}.val / F.val |

T → F | T.val = F.val |

F → ( E ) | F.val = E.val |

F → num | F.val = num.lexval |

**Example**:
The SDD at the right gives a left-recursive grammar for
expressions with an extra nonterminal L added as the start symbol.
The terminal num is given a value by the lexer,
which corresponds to the value stored in the numbers table for lab 2.

Draw the parse tree for 7+6/3 on the board and verify that L.val is 9, the value of the expression.

This example uses only synthesized attributes.

**Definition**: An SDD with only
synthesized attributes is called **S-attributed**.

For these SDDs all attributes depend only on attribute values at
the children.

Why?

Answer: The other possibility (depending on values at the node
itself) involves inherited attributes, which are absent in an
S-attributed SDD.

Thus, for an S-attributed SDD, all the rules can be evaluated by a single bottom-up (i.e., postorder) traversal of the annotated parse tree.

Inherited attributes are more complicated since the node N of the parse tree with which the attribute is associated (and which is also the natural node to store the value) does not contain the production with the corresponding semantic rule.

**Definition**:
An **inherited** attribute of a nonterminal B at node
N, where B is the LHS of the production, is defined by a semantic
rule of the production at the **parent** of N (where B
occurs in the RHS of the production).
The value depends only on attributes at N, N's siblings, and N's
parent.

Note that when viewed from the parent node P (the site of the semantic rule), the inherited attribute depends on values at P and at P's children (the same as for synthesized attributes). However, and this is crucial, the nonterminal B is the LHS of a child of P and hence the attribute is naturally associated with that child. It is normally stored there and is shown there in the diagrams below.

We will soon see examples with inherited attributes.

**Definition**: Often the attributes are
just evaluations without side effects.
In such cases we call the SDD an
**attribute grammar**.