Start Lecture #7
Remark: Lab 3 assigned. Part 1 (no programming) due in one week; the remainder due a week later.
A state of the automaton is an item set as described previously. The transition function is GOTO. If during a parse we are up to item set Ij (often called state sj or simply state j) and the next input symbol is b (it of course must be a terminal), then the parser shifts in b if the state j has an outgoing transition labeled b. If there is no such transition, then the parser performs a reduction; choosing which reduction to use is determined by the items in Ij and the FOLLOW sets. (It is also possible that the parser will now accept the input string or announce that the input string is not in the language).
The LR-parsing algorithm must decide when to shift and when to reduce (and in the latter case, by which production). It does this by consulting two tables, ACTION and GOTO. The basic algorithm is the same for all LR parsers, what changes are the tables ACTION and GOTO.
We have already seen GOTO (for SLR).
Technical point that may, and probably should, be ignored: our GOTO was defined on pairs [item-set,grammar-symbol]. The new GOTO is defined on pairs [state,nonterminal]. A state is simply an item set (so nothing is new here). We will not use the new GOTO on terminals so we just define it on nonterminals.
Given a state i and a terminal a (or the endmarker), ACTION[i,a] can be
So ACTION is the key to deciding shift vs. reduce. We will soon see how this table is computed for SLR.
Since ACTION is defined on [state,terminal] pairs and GOTO is defined on [state,nonterminal] pairs, we can combine these tables into one defined on [state,grammar-symbol] pairs.
This formalism is useful for stating the actions of the parser precisely, but I believe the parser can be explained without this formalism. The essential idea of the formalism is that the entire state of the parser can be represented by the vector of states on the stack and input symbols not yet processed.
As mentioned above the Symbols column is redundant so a configuration of the parser consists of the current stack and the remainder of the input. Formally it is
The parser consults the combined ACTION-GOTO table for its current state (TOS) and next input symbol, formally this is ACTION[sm,ai], and proceeds based on the value in the table. If the action is a shift, the next state is clear from the DFA We have done this informally just above; here we use the formal treatment).
The missing piece of the puzzle is finally revealed.
State | a | b | + | $ | A | B | C |
---|---|---|---|---|---|---|---|
7 | acc | ||||||
8 | s11 | s10 | 9 | 7 | |||
9 | |||||||
10 | |||||||
11 | s12 | ||||||
12 | s13 | ||||||
13 | r2 |
Before defining the ACTION and GOTO tables precisely, I want to do it informally via the simple example on the right. This is a FAKE example, the B'→B· item would be in I0 and would have only B→·a+b with it.
IMPORTANT: To construct the table, you do need something not in the diagram, namely the FOLLOW sets from top-down parsing.
For convenience number the productions of the grammar to make them easy to reference and assume that the production B → a+b is numbered 2. Also assume FOLLOW(B)={b} and all other follow sets are empty. Again, I am not claiming that there is a grammar with this diagram and these FOLLOW sets.
Show how to calculate every entry in the table using the diagram
and the FOLLOW sets.
Then consider input a+b
and
The action table is defined with states (item sets) as rows and terminals and the $ endmarker as columns. GOTO has the same rows, but has nonterminals as columns. So we construct a combined ACTION-GOTO table, with states as rows and grammar symbols (terminals + nonterminals) plus $ as columns.
accept.
The book (both editions) and the rest of the world seem to use GOTO for both the function defined on item sets and the derived function on states. As a result we will be defining GOTO in terms of GOTO. Item sets are denoted by I or Ij, etc. States are denoted by s or si or i. Indeed both books use i in this section. The advantage is that on the stack we placed integers (i.e., i's) so this is consistent. The disadvantage is that we are defining GOTO(i,A) in terms of GOTO(Ii,A), which looks confusing. Actually, we view the old GOTO as a function and the new one as an array (mathematically, they are the same) so we actually write GOTO(i,A) and GOTO[Ii,A].
The diagrams above are constructed for our use; they are not used in constructing a bottom-up parser. Here is the real algorithm, which uses just the augmented grammar (i.e., after adding S' → S) and the FOLLOW sets.
shift j, where GOTO(Ii,b)=Ij.
reduce A→α.
accept.
error.
State | ACTION | GOTO | |||||||
---|---|---|---|---|---|---|---|---|---|
id | + | * | ( | ) | $ | E | T | F | |
0 | s5 | s4 | 1 | 2 | 3 | ||||
1 | s6 | acc | |||||||
2 | r2 | s7 | r2 | r2 | |||||
3 | r4 | r4 | r4 | r4 | |||||
4 | s5 | s4 | 8 | 2 | 3 | ||||
5 | r6 | r6 | r6 | r6 | |||||
6 | s5 | s4 | 9 | 3 | |||||
7 | s5 | s4 | 10 | ||||||
8 | s6 | s11 | |||||||
9 | r1 | s7 | r1 | r1 | |||||
10 | r3 | r3 | r3 | r3 | |||||
11 | r5 | r5 | r5 | r5 |
FIRST | FOLLOW | |
---|---|---|
E' | ( id | $ |
E | ( id | + ) $ |
T | ( id | * + ) $ |
F | ( id | * + ) $ |
Our main example, pictured on the right, gives the table shown on
the left.
The productions and FOLLOW sets are shown as well (the FIRST sets
are not used directly, but are needed to calculate FOLLOW).
The table entry s5 abbreviates shift and go to state 5
.
The table entry r2 abbreviates reduce by production number 2
,
where we have numbered the productions as follows.
The shift actions can be read directly off the DFA. For example I1 with a + goes to I6, I6 with an id goes to I5, and I9 with a * goes to I7.
The reduce actions require FOLLOW, which for this simple grammar is fairly easy to calculate.
Consider I5={F→id·}.
Since the dot is at the end, we are ready to reduce, but we must
check if the next symbol can follow the F we are reducing to.
Since FOLLOW(F)={+,*,),$}, in row 5 (for I5) we put
r6 (for reduce by production 6
) in the columns for
+, *, ), and $.
The GOTO columns can also be read directly off the DFA. Since there is an E-transition (arc labeled E) from I0 to I1, the column labeled E in row 0 contains a 1.
Since the column labeled + is blank for row 7, we see that it would be an error if we arrived in state 7 when the next input character is +.
Finally, if we are in state 1 when the input is exhausted ($ is the next input character), then we have a successfully parsed the input.
Stack | Symbols | Input | Action |
---|---|---|---|
0 | id*id+id$ | shift | |
05 | id | *id+id$ | reduce by F→id |
03 | F | *id+id$ | reduct by T→id |
02 | T | *id+id$ | shift |
027 | T* | id+id$ | shift |
0275 | T*id | +id$ | reduce by F→id |
027 10 | T*F | +id$ | reduce by T→T*F |
02 | T | +id$ | reduce by E→T |
01 | E | +id$ | shift |
016 | E+ | id$ | shift |
0165 | E+id | $ | reduce by F→id |
0163 | E+F | $ | reduce by T→F |
0169 | E+T | $ | reduce by E→E+T |
01 | E | $ | accept |
Homework: 2 (you already constructed the LR(0) automaton for this example in the previous homework), 3, 4 (this problem refers to 4.2.2(a-g); only use 4.2.2(a-c).
Example: What about ε-productions? Let's do
A → B D B → b B | ε D → d
Reducing by the ε-production actually adds a state to the stack (it pops ZERO states since there are zero symbols on RHS and pushes one).
Homework: Do the following extension
A → B D B → b B | ε D → d D | ε
We consider very briefly two alternatives to SLR, canonical-LR or LR, and lookahead-LR or LALR.
SLR used the LR(0) items, that is the items used were productions with an embedded dot, but contained no other (lookahead) information. The LR(1) items contain the same productions with embedded dots, but add a second component, which is a terminal (or $). This second component becomes important only when the dot is at the extreme right (indicating that a reduction can be made if the input symbol is in the appropriate FOLLOW set). For LR(1) we do that reduction only if the input symbol is exactly the second component of the item. This finer control of when to perform reductions, enables the parsing of a larger class of languages.
For LALR we merge various LR(1) item sets together, obtaining nearly the LR(0) item sets we used in SLR. LR(1) items have two components, the first, called the core, is a production with a dot; the second a terminal. For LALR we merge all the item sets that have the same cores by combining the 2nd components (thus permitting reductions when any of these terminals is the next input symbol). Thus we obtain the same number of states (item sets) as in SLR since only the cores distinguish item sets.
Unlike SLR, we limit reductions to occurring only for certain specified input symbols. LR(1) gives finer control; it is possible for the LALR merger to have reduce-reduce conflicts when the LR(1) items on which it is based is conflict free.
Although these conflicts are possible, they are rare and the size reduction from LR(1) to LALR is quite large. LALR is the current method of choice for bottom-up, shift-reduce parsing.
Dangling-ElseAmbiguity
The tool corresponding to Lex for parsing is yacc, which (at least originally) stood for yet another compiler compiler. This name is cute but somewhat misleading since yacc (like the previous compiler compilers) does not produce a compiler, just a parser.
The structure of the yacc user input is similar to that for lex, but instead of regular definitions, one includes productions with semantic actions.
There are ways to specify associativity and precedence of operators. It is not done with multiple grammar symbols as in a pure parser, but more like declarations.
Use of Yacc requires a serious session with its manual.