Start Lecture #7

**Remark**: There were bugs in the grammar (reported
today); lab3 part1 now due friday 7 march.

Say I is a set of items and one of these items is A→α·Bβ. This item represents the parser having seen α and records that the parser might soon see the remainder of the RHS. For that to happen the parser must first see a string derivable from B. Now consider any production starting with B, say B→γ. If the parser is to make progress on A→α·Bβ, it will need to be making progress on one such B→·γ. Hence we want to add all the latter productions to any state that contains the former. We formalize this into the notion of closure.

For any set of items I, *CLOSURE(I)* is formed as follows.

- Initialize CLOSURE(I) = I
- If A → α · B β is in CLOSURE(I) and B → γ is a production, then add B → · γ to the closure and repeat.

**Example**:
Recall our main example

E' → E E → E + T | T T → T * F | F F → ( E ) | idCLOSURE({E' → · E}) contains 7 elements. The 6 new elements are the 6 original productions each with a dot right after the arrow. Make

If X is a grammar symbol, then moving from A→α·Xβ to A→αX·β signifies that the parser has just processed (input derivable from) X. The parser was in the former position and (input derivable from) X was on the input; this caused the parser to go to the latter position. We (almost) indicate this by writing GOTO(A→α·Xβ,X) is A→αX·β. I said almost because GOTO is actually defined from item sets to item sets not from items to items.

**Definition**: If I is an item set and X is a
grammar symbol, then GOTO(I,X) is the closure of the set of items
A→αX·β where A→α·Xβ is
in I.

I really believe this is very clear, but I understand that the formalism makes it seem confusing. Let me begin with the idea.

We augment the grammar and get this one new production; take its
closure.
That is the first element of the collection; call it I_{0}.
Try GOTOing from I_{0}, i.e., for each grammar symbol,
consider GOTO(I_{0},X); each of these (almost) is another
element of the collection.
Now try GOTOing from each of these new elements of the collection,
etc.
Start with jane smith, add all her friends F, then add the friends
of everyone in F, called FF, then add all the friends of everyone in
FF, etc

The (almost)

is because GOTO(I_{0},X) could be
empty so formally we construct the canonical collection of LR(0)
items, C, as follows

- Initialize C = CLOSURE({S' → S})
- If I is in C, X is a grammar symbol, and GOTO(I,X)≠φ then add it to C and repeat.

This GOTO gives exactly the arcs in the DFA I constructed earlier. The formal treatment does not include the NFA, but works with the DFA from the beginning.

**Definition**: The above collection of
item sets (so this is a set of sets) is called the **canonical
LR(0) collection ** and the DFA having this collection as
nodes and the GOTO function as arcs is called the
**LR(0) automaton**.

**Homework**:
Construct the LR(0) automaton for the
following grammar (which produces simple postfix expressions).

` S → S S + | S S * | a`

Don't forget to augment the grammar.

Our main example

E' → E E → E + T | T T → T * F | F F → ( E ) | idis larger than the toy I did before. The NFA would have 2+4+2+4+2+4+2=20 states (a production with k symbols on the RHS gives k+1 N-states since there k+1 places to place the dot). This gives rise to 12 D-states. However, the development in the book, which we are following now, constructs the DFA directly. The resulting diagram is on the right.

Start constructing the diagram on the board:

Begin with {E' → ·E},
take the closure, and then keep applying GOTO.

A state of the automaton is an item set as described previously.
The transition function is GOTO.
If during a parse we are up to item set I_{j} (often called
state s_{j} or simply state j) and the next input symbol is
b (it of course must be a terminal), then the parser shifts in b if
the state j has an outgoing transition labeled b.
If there is no such transition, then the parser performs a
reduction; choosing which reduction to use is determined by the
items in I_{j} and the FOLLOW sets.
(It is also possible that the parser will now accept the input
string or announce that the input string is not in the language).

The LR-parsing algorithm must decide when to shift and when to reduce (and in the latter case, by which production). It does this by consulting two tables, ACTION and GOTO. The basic algorithm is the same for all LR parsers, what changes are the tables ACTION and GOTO.

We have already seen GOTO (for SLR).

Technical point that may, and probably should, be ignored: our GOTO was defined on pairs [item-set,grammar-symbol]. The new GOTO is defined on pairs [state,nonterminal]. A state is simply an item set (so nothing is new here). We will not use the new GOTO on terminals so we just define it on nonterminals.

Given a state i and a terminal a (or the endmarker), ACTION[i,a] can be

- Shift j. The terminal a is shifted on to the stack and the parser enters state j.
- Reduce A → α. The parser reduces α on the TOS to A.
- Accept.
- Error

So ACTION is the key to deciding shift vs. reduce. We will soon see how this table is computed for SLR.

Since ACTION is defined on [state,terminal] pairs and GOTO is defined on [state,nonterminal] pairs, we can combine these tables into one defined on [state,grammar-symbol] pairs.

This formalism is useful for stating the actions of the parser precisely, but I believe the parser can be explained without this formalism. The essential idea of the formalism is that the entire state of the parser can be represented by the vector of states on the stack and input symbols not yet processed.

As mentioned above the Symbols column is redundant so a configuration of the parser consists of the current stack and the remainder of the input. Formally it is

(s_{0},s_{1}...s_{m},a_{i}a_{i+1}...a_{n}$)

where the s's are states and the a's input symbols.
This configuration could also be represented by the
right-sentential form
X_{1}...X_{m},a_{i}...a_{n}

where the X is the symbol associated with the state.
X is either the terminal just shifted in or the LHS of the
reduction just performed.
The parser consults the combined ACTION-GOTO table for its current
state (TOS) and next input symbol, formally this is
ACTION[s_{m},a_{i}], and proceeds based on the value
in the table.
If the action is a shift, the next state is clear from the DFA
We have done this informally just above; here we use the formal
treatment).

- Shift s.
The input symbol a is pushed and s becomes the new state.
The new configuration is
(s
_{0}...s_{m}s,a_{i+1}...a_{n}) - Reduce A → α.
Let r be the number of symbols in the RHS of the production.
The parser pops r items off the stack (backing up r states) and
enters the state GOTO(s
_{m-r},A). That is after backing up it goes where A says to go. A real parser would now probably do something, e.g., build a tree node or perform a semantic action. Although we know about this from the chapter 2 overview, we don't officially know about it here. So for now simply print the production the parser reduced by. - Accept.
- Error.

The missing piece of the puzzle is finally revealed.

Before defining the ACTION and GOTO tables precisely, I want to do it informally via the simple example on the right. I produced that diagram without starting from a grammar so I really don't know if it is realistic, but it does illustrate how the tables are constructed directly from the diagram.

For convenience number the productions of the grammar to make them easy to reference. Assume that the production B → a+b is numbered 2.

**IMPORTANT**: In order to construct the ACTION table,
you do need something not in the diagram.
You need to know the FOLLOW sets, the same sets that we constructed
for top-down parsing.

For this example let us assume FOLLOW(B)={b} and all other follow sets are empty. Again, I am not claiming that there is a grammar with this diagram and these FOLLOW sets.

The action table is defined with states (item sets) as rows and
**terminals** and the $ endmarker as columns.
GOTO has the same rows, but has **nonterminals** as
columns.
So we construct a combined ACTION-GOTO table, with states as rows
and grammar symbols (terminals + nonterminals) as columns.

State | a | b | + | $ | A | B | C |
---|---|---|---|---|---|---|---|

7 | acc | ||||||

8 | s11 | s10 | 9 | 7 | |||

9 | |||||||

10 | |||||||

11 | s12 | ||||||

12 | s13 | ||||||

13 | r2 |

- Each arc in the diagram labeled with a terminal indicates a shift. In the entry with row the state at the tail of the arc and column the labeling terminal place sn, where n is the state at the head of the arc. This indicates that if we are in the given state and the input is the given terminal, we shift to new state n.
- Each arc in the diagram labeled with a nonterminal informs us what state to enter if we reduce. In the entry with row the state at the tail of the arc and column the labeling nonterminal place n, where n is the state at the head of the arc.
- Each completed item (dot at the extreme right) indicates
a
**possible**reduction. In each entry with row the state containing the completed item and column a terminal in the FOLLOW set of the LHS of the production corresponding to this item, place rn, where n is the number of the production. (In particular, at entry [13,c] place an r2.). - In the entry with row (i.e., state) containing S'→S and
column $, place
accept

. - If any entry is labelled twice (i.e., a conflict) the grammar is not SLR(1).
- Any unlabeled entry corresponds to an input error. If the parser accesses this entry, the input sentence is not in the language generated by the grammar.

The book (both editions) and the rest of the world seem to use GOTO
for both the function defined on item sets and the derived function
on states.
As a result we will be defining GOTO in terms of GOTO.
Item sets are denoted by I or I_{j}, etc.
States are denoted by s or s_{i} or i.
Indeed both books use i in this section.
The advantage is that on the stack we placed integers (i.e., i's) so
this is consistent.
The disadvantage is that we are defining GOTO(i,A) in terms of
GOTO(I_{i},A), which looks confusing.
Actually, we view the old GOTO as a function and the new one as an
array (mathematically, they are the same) so we actually write
GOTO(i,A) and GOTO[I_{i},A].

We start with an augmented grammar (i.e., we added S' → S).

- Construct {I
_{0},...,I_{n}} the LR(0) items. - The parsing actions for state i.
- If A→α·bβ is in I
_{i}for b a terminal, then ACTION[i,b]=shift j

, where GOTO(I_{i},b)=I_{j}. - If A→α· is in I
_{i}, for A≠S', then, for all b in FOLLOW(A), ACTION[i,b]=reduce A→α

. - If S'→S· is in I
_{i}, then ACTION[I,$]=accept

. - If any conflicts occurred, the grammar is not SLR(1).

- If A→α·bβ is in I
- If GOTO(I
_{i},A)=I_{j}, for a nonterminal A, then GOTO[i,A]=j. - All entries not yet defined are
error

. - The initial state is the one containing S'→·S.

State | ACTION | GOTO | |||||||
---|---|---|---|---|---|---|---|---|---|

id | + | * | ( | ) | $ | E | T | F | |

0 | s5 | s4 | 1 | 2 | 3 | ||||

1 | s6 | acc | |||||||

2 | r2 | s7 | r2 | r2 | |||||

3 | r4 | r4 | r4 | r4 | |||||

4 | s5 | s4 | 8 | 2 | 3 | ||||

5 | r6 | r6 | r6 | r6 | |||||

6 | s5 | s4 | 9 | 3 | |||||

7 | s5 | s4 | 10 | ||||||

8 | s6 | s11 | |||||||

9 | r1 | s7 | r1 | r1 | |||||

10 | r3 | r3 | r3 | r3 | |||||

11 | r5 | r5 | r5 | r5 |

shift and go to state 5.

The entry r2 abbreviates

reduce by production number 2, where we have numbered the productions as follows.

- E → E + T
- E → T
- T → T * F
- T → F
- F → ( E )
- F → id

The shift actions can be read directly off the DFA. For example I1 with a + goes to I6, I6 with an id goes to I5, and I9 with a * goes to I7.

The reduce actions require FOLLOW.
Consider I_{5}={F→id·}.
Since the dot is at the end, we are ready to reduce, but we must
check if the next symbol can follow the F we are reducing to.
Since FOLLOW(F)={+,*,),$}, in row 5 (for I5) we put
r6 (for reduce by production 6

) in the columns for
+, *, ), and $.

The GOTO columns can also be read directly off the DFA.
Since there is an E-transition (arc labeled E) from I_{0}
to I_{1}, the column labeled E in row 0 contains a 1.

Since the column labeled + is blank for row 7, we see that it would be an error if we arrived in state 7 when the next input character is +.

Finally, if we are in state 1 when the input is exhausted ($ is the next input character), then we have a successfully parsed the input.

Stack | Symbols | Input | Action |
---|---|---|---|

0 | id*id+id$ | shift | |

05 | id | *id+id$ | reduce by F→id |

03 | F | *id+id$ | reduct by T→id |

02 | T | *id+id$ | shift |

027 | T* | id+id$ | shift |

0275 | T*id | +id$ | reduce by F→id |

027 10 | T*F | +id$ | reduce by T→T*F |

02 | T | +id$ | reduce by E→T |

01 | E | +id$ | shift |

016 | E+ | id$ | shift |

0165 | E+id | $ | reduce by F→id |

0163 | E+F | $ | reduce by T→F |

0169 | E+T | $ | reduce by E→E+T |

01 | E | $ | accept |

**Homework**: 2 (you already constructed the LR(0)
automaton for this example in the previous homework), 3,
4 (this problem refers to 4.2.2(a-g); only use 4.2.2(a-c).

**Example**:
What about ε-productions?
Let's do

A → B D B → b B | ε D → d

Reducing by the ε-production actually adds a state (pops ZERO states since zero symbols on RHS and pushes one).

**Homework**: Do the following extension

A → B D B → b B | ε D → d D | ε