Compilers

================ Start Lecture #6 ================

Do the FIRST and FOLLOW sets for

    E  → T E'
    E' → + T E' | ε
    T  → F T'
    T' → * F T' | ε
    F  → ( E ) | id
  

Homework: Compute FIRST and FOLLOW for the postfix grammar S → S S + | S S * | a

4.4.3: LL(1) Grammars

The predictive parsers of chapter 2 are recursive descent parsers needing no backtracking. A predictive parser can be constructed for any grammar in the class LL(1). The two Ls stand for (processing the input) Left to right and for producing Leftmost derivations. The 1 in parens indicates that 1 symbol of lookahead is used.

Definition: A grammar is LL(1) if for all production pairs A → α | β

  1. FIRST(α) ∩ FIRST(β) = φ.
  2. If β ⇒* ε, then no string derived from α begins with a terminal in FOLLOW(A). Similarly, if α ⇒* ε.

The 2nd condition may seem strange; it did to me for a while. Let's consider the simplest case that condition 2 is trying to avoid.

    A → ε      // β=ε so β derives ε
    S → A b    // b is in FOLLOW(A)
    A → b      // α=b so α derives a string beginning with b
  
ll1 def

Assume we are using predictive parsing and, as illustrated in the diagram to the right, we are at A in the parse tree and b in the input. Since lookahead=b and b is in FIRST(RHS) for the bottom A production, we would choose that production to expand A. But this could be wrong! Remember that we don't look ahead in the tree just in the input. So we would not have noticed that the next node in the tree (i.e., in the frontier) is b. This is possible since b is in FOLLOW(A). So perhaps we should use the second A production to produce ε in the tree, and then the next node b would match the input b.

Constructing a Predictive Parsing Table

The goal is to produce a table telling us at each situation which production to apply. A situation means a nonterminal in the parse tree and an input symbol in lookahead.

So we produce a table with rows corresponding to nonterminals and columns corresponding to input symbols (including $, the endmarker). In an entry we put the production to apply when we are in that situation.

We start with an empty table M and populate it as follows. (2e has typo; it has FIRST(A) instead of FIRST(α).) For each production A → α

  1. For each terminal a in FIRST(α), add A → α to M[A,a]. This is what we did with predictive parsing in chapter 2. The point was that if we are up to A in the tree and a is the lookahead, we could (should??) use the production A→α.
  2. If ε is in FIRST(α), then add A → α to M[A,b] (resp. M[A,$]) for each terminal b in FOLLOW(A) (if $ is in FOLLOW(A)). This is not so obvious; it corresponds to the second (strange) condition above. If ε is in FIRST(α), then α⇒*ε. Hence we could (should??) apply the production A→α, have the α go to ε and then the b (or $), which follows A will match the b in the input.

When we have finished filling in the table M, what do we do if an slot has

  1. no entries? This means that from this situation, no production is appropriate. Hence, if parsing an input leads to this entry, we cannot parse the sentence (because it is not in the language) so we report an error and try to repair it.

  2. one entry? Perfect! This means we know exactly what to in this situation.

  3. more than one entry? This should not happen since this section is entitled LL(1) grammars Someone erred when they said the grammar generated an LL(1) language. Since the language is not LL(1), we must use a different technique. One possibility is to use bottom-up parsing, which we study next. Another is to modify the procedure for this non-terminal to look further ahead (typically one more token) to decide what action to perform.

Example: Work out the parsing table for

    E  → T E'
    E' → + T E' | ε
    T  → F T'
    T' → * F T' | ε
    F  → ( E ) | id
  
FIRSTFOLLOW
E( id$ )
E'ε +$ )
T( id+ $ )
T'ε *+ $ )
F( id* + $ )

We already computed FIRST and FOLLOW as shown on the right. The table skeleton is
Nonter-
minal
Input Symbol
+*()id$
E
E'
T
T'
F

Homework: Produce the predictive parsing table for

  1. S → 0 S 1 | 0 1
  2. the prefix grammar S → + S S | * S S | a
Don't forget to eliminate left recursion and perform left factoring if necessary.

4.4.4: Nonrecursive Predictive Parsing

This illustrates the standard technique for eliminating recursion by keeping the stack explicitly. The runtime improvement can be considerable.

4.4.5: Error Recovery in Predictive Parsing

Skipped. bottom-up parse id*id

4.5: Bottom-Up Parsing

Now we start with the input string, i.e., the bottom (leaves) of what will become the parse tree, and work our way up to the start symbol.

For bottom up parsing, we are not as fearful of left recursion as we were with top down. Our first few examples will use the left recursive expression grammar

    E → E + T | T
    T → T * F | F
    F → ( E ) | id
  

4.5.1: Reductions

Remember that running a production in reverse, i.e., replacing the RHS by the LHS is called reducing. So our goal is to reduce the input string to the start symbol.

On the right is a movie of parsing id*id in a bottom-up fashion. Note the way it is written. For example, from step 1 to 2, we don't just put F above id*id. We draw it as we do because it is the current top of the tree (really forest) and not the bottom that we are working on so we want the top to be in horizontal line and hence easy to read.

The tops of the forest are the roots of the subtrees present in the diagram. For the movie those are
id * id, F * id, T * F, T, E
Note that (since the reduction successfully reaches the start symbol) each of these sets of roots is a sentential form.

The steps from one frame of the movie, when viewed going down the page, are reductions (replace the RHS of a production by the LHS). Naturally, when viewed going up the page, we have a derivation (replace LHS by RHS). For our example the derivation is
E ⇒ T ⇒ T * F ⇒ T * id ⇒ F * id ⇒ id * id

Note that this is a rightmost derivation and hence each of the sets of roots identified above is a right sentential form. So the reduction we did in the movie was a rightmost derivation in reverse.

Remember that for a non-ambiguous grammar there is only one rightmost derivation and hence there is only one rightmost derivation in reverse.

Remark: You cannot simply scan the string (the roots of the forest) from left to right and choose the first substring that matches the RHS of some production. If you try it in our movie you will reduce T to E right after T appears. The result is not a right sentential form.
Right
Sentential
Form
HandleReducing
Production
id1 * id2id1F → id
F * id2FT → F
T * id2id2F → id
T * FT * FE → T * F

4.5.2: Handle Pruning

The strings that are reduced during the reverse of a rightmost derivation are called the handles. For our example, this is shown in the table on the right.

Note that the string to the right of the handle must contain only terminals. If there was a non-terminal to the right, it would have been reduced in the RIGHTmost derivation that leads to this right sentential form.

Often instead of referring to a derivation A→α as a handle, we call α the handle. I should say a handle because there can be more than one if the grammar is ambiguous.

So (assuming a non-ambiguous grammar) the rightmost derivation in reverse can be obtained by constantly reducing the handle in the current string.

Homework: 4.23 a c

4.5.3: Shift-Reduce Parsing

We use two data structures for these parsers.

  1. A stack of grammar symbols, terminals and nonterminals. This stack is drawn in examples as having its top on the right and bottom on the left. The items shifted (see below) onto the stack will be terminals, but some are reduced to nonterminals. The bottom of the stack is marked with $ and initially the stack is empty (i.e., has just $).
  2. An input buffer that (conceptually) holds the remainder of the input, i.e., the part that has yet to be shifted onto the stack. An endmarker $ is placed after the end of the input. Initially the input buffer contains the entire input followed by $. (In practice we use some more sophisticated buffering technique, as we saw in section 3.2 with buffers pairs, that does not require having the entire input in memory at once.)

StackInputAction
$id1*id2$shift
$id1*id2$reduce F→id
$F*id2$reduce T→F
$T*id2$shift
$T*id2$shift
$T*id2$reduce F→id
$T*F$reduce T→T*F
$T$reduce E→T
$E$accept
The idea, illustrated by the table on the right, is that at any point the parser can perform one of four operations.

  1. The parser can shift a symbol from the beginning of the input onto the TOS.
  2. If the TOS is a handle, the parser can reduce it to its LHS.
  3. If the parser reaches the accepting state with the stack $S and the input $, the parser terminates successfully.
  4. The parser reaches an error state.

A technical point, which explains the usage of a stack is that a handle is always at the TOS. See the book for a proof; the idea is to look at what rightmost derivations can do (specifically two consecutive productions) and then trace back what the parser will do since it does the reverse operations (reductions) in the reverse order.

We have not yet discussed how to decide whether to shift or reduce when both are possible. We have also not discussed which reduction to choose if multiple reductions are possible. These are crucial question for bottom up (shift-reduce) parsing and will be addressed.

Homework: 4.23 b

4.5.4: Conflicts During Shift-Reduce Parsing

There are grammars (non-LR) for which no viable algorithm can decide whether to shift or reduce when both are possible or which reduction to perform when several are possible. However, for most languages, choosing a good lexer yields an LR(k) language of tokens. For example, ada uses () for both function calls and array references. If the lexer returned id for both array names and procedure names then a reduce/reduce conflict would occur when the stack was ... id ( id and the input ) ... since the id on TOS should be reduced to parameter if the first id was a procedure name and to expr if the first id was an array name. A better lexer (and an assumption, which is true in ada, that the declaration must precede the use) would return proc-id when it encounters a lexeme corresponding to a procedure name. It does this by constructing the symbol table it builds.

4.6: Introduction to LR Parsing: Simple LR

I will have much more to say about SLR than the other LR schemes. The reason is that SLR is simpler to understand, but does capture the essence of shift-reduce, bottom-up parsing. The disadvantage of SLR is that there are LR grammars that are not SLR.

4.6.1: Why LR Parsers?

The text's presentation is somewhat controversial. Most commercial compilers use hand-written top-down parsers of the recursive-descent (LL not LR) variety. Since the grammars for these languages are not LL(1), the straightforward application of the techniques we have seen will not work. Instead the parsers actually look ahead further than one token, but only at those few places where the grammar is in fact not LL(1). Recall that (hand written) recursive descent compilers have a procedure for each nonterminal so we can customize as needed.

These compiler writers claim that they are able to produce much better error messages than can readily be obtained by going to LR (with its attendant requirement that a parser-generator be used since the parsers are too large to construct by hand). Note that compiler error messages is a very important user interface issue and that with recursive descent one can augment the procedure for a nonterminal with statements like
if (nextToken == X) then error(expected Y here)

Nonetheless, the claims made by the text are correct, namely.

  1. LR parsers can be constructed to recognize nearly all programming-language constructs for which CFGs exist.
  2. LR-parsing is the most general nonbacktracking, shift-reduce method known, yet can be implemented relatively efficiently.
  3. LR-parsing can detect a syntactic error as soon as possible.
  4. LR grammars can describe more languages than LL grammars.

4.6.2: Items and the LR(0) Automaton

We now come to grips with the big question: How does a shift-reduce parser know when to shift and when to reduce? This will take a while to answer in a satisfactory manner. The unsatisfactory answer is that the parser has tables that say in each situation whether to shift or reduce (or announce error, or announce acceptance). To begin the path toward the answer, we need several definitions.

An item is a production with a marker saying how far the parser has gotten with this production. Formally,

Definition: An (LR(0)) item of a grammar is a production with a dot added somewhere to the RHS.

Examples:

  1. E → E + T generates 4 items.
    1. E → · E + T
    2. E → E · + T
    3. E → E + · T
    4. E → E + T ·
  2. A → ε generates A → · as its only item.

The item E → E · + T signifies that the parser has just processed input that is derivable from E and will look for input derivable from + T.

Line 4 indicates that the parser has just seen the entire RHS and must consider reducing it to E. Important: consider reducing does not mean reduce.

The parser groups certain items together into states. As we shall see, the items with a given state are treated similarly.

Our goal is to construct first the canonical LR(0) collection of states and then a DFA called the LR(0) automaton (technically not a DFA since it has no dead state).

To construct the canonical LR(0) collection formally and present the parsing algorithm in detail we shall

  1. augment the grammar
  2. define functions CLOSURE and GOTO

Augmenting the grammar is easy. We simply add a new start state S' and one production S'→S· The purpose is to detect success, which occurs when the parser is ready to reduce S to S'.

So our example grammar

    E → E + T | T
    T → T * F | F
    F → ( E ) | id
  
is augmented by adding the production E' → E·.

Interlude: Thea Rough Idea

I hope the following interlude will prove helpful. In preparing to present SLR, I was struck how it looked like we were working with a DFA that came from some (unspecified and unmentioned) NFA. It seemed that by first doing the NFA, I could give some rough insight. Since for our current example the NFA has more states and hence a bigger diagram, let's consider the following extremely simple grammar.

    E → E + T
    E → T
    T → id
  
When augmented this becomes
   E' → E
    E → E + T
    E → T
    T → id
  
When the dots are added we get 10 items (4 from the second production, 2 each from the other three). See the diagram at the right. We begin at E'→.E since it is the start item. lr0 ajg nfa

Note that there are really four kinds of edges.

  1. Edges labeled with terminals. These correspond to shift actions, where the indicated terminal is shifted from the input to the stack.
  2. Edges labeled with nonterminals. These will correspond to reduce actions when we construct the DFA. The stack is reduced by a production having the given nonterminal as LHS. Reduce actions do more as we shall see.
  3. Edges labeled with ε. These are associated with the closure operation to be discussed and are the source of the nondeterminism (i.e., why the diagram is an NFA).
  4. An edge labeled $. This edge, which can be thought of as shifting the endmarker, is used when we are reducing via the E'→E production and accepting the input.

If we were at the item E→E·+T (the dot indicating that we have seen an E and now need a +) and shifted a + from the input to the stack we would move to the item E→E+·T. If the dot is before a non-terminal, the parser needs a reduction with that non-terminal as the LHS.

Now we come to the idea of closure, which I illustrate in the diagram with the ε's. Please note that this is rough, we are not doing regular expressions again, but I hope this will help you understand the idea of closure, which like ε in regular production leads to nondeterminism.

Look at the start state. The placement of the dot indicates that we next need to see an E. Since E is a nonterminal, we won't see it in the input, but will instead have to generate it via a production. Thus by looking for an E, we are also looking for any production that has E on the LHS. This is indicated by the two ε's leaving the top left box. Similarly, there are ε's leaving the other three boxes where the dot is immediately to the left of a nonterminal.

As with regular expressions, we combine n-items connected by an ε arc into a d-item. The actual terminology used is that we combine these items into a set of items (later referred to as a state). There is another combination that occurs. The top two n-items in the left column are combined into the same d-item and both n-items have E transitions (outgoing arcs labeled E). Since we are considering these two n-items to be the same d-item and the arcs correspond to the same transition, the two targets (the top two n-items in the 2nd column) are combined. A d-item has all the outgoing arcs of the original n-items it contains. This is the way we converted an NFAs into a DFA in the previous chapter.