Compilers

Start Lecture #5

3.8.1: The structure of the generated analyzer

We have seen simulators for DFAs and NFAs.

The remaining large question is how is the lex input converted into one of these automatons.

Also

Lex permits functions to be passed through to the yy.lex.c file. This is fairly straightforward to implement, but is not part of lab2.
Lex also supports actions that are to be invoked by the simulator when a match occurs. This is also fairly straight forward, but again is not part of lab2.
The lookahead operator is not so simple in the general case and is discussed briefly below, but again is not part of lab2.

In this section we will use transition graphs. Of course lexer-generators do not draw pictures; instead they use the equivalent transition tables.

Recall that the regular definitions in Lex are mere conveniences that can easily be converted to REs and hence we need only convert REs into an FSA.

nfa png

We already know how to convert a single RE into an NFA. But lex input will contain (and lab 2 does contain) several REs since it wishes to recognize several different tokens. The solution is to

Produce an NFA for each RE.
Introduce a new start state.
Introduce an ε transition from the new start state to the start of each NFA constructed in step 1.
When one of the NFAs reaches one of the accepting states, the simulation does NOT stop. See below for an explanation.

The result is shown to the right.

Label each of the accepting states (for all NFAs constructed in step 1) with the actions specified in the lex program for the corresponding pattern.

3.8.2: Pattern Matching Based on NFAs

We use the algorithm for simulating NFAs presented in 3.7.2.

The simulator starts reading characters and calculates the set of states it is at.

Pattern Action to perform

a Action1

abb Action2

a^*bb^* Action3

Pattern	Action to perform
a	Action1
abb	Action2
a^bb^	Action3

At some point the input character does not lead to any state or we have reached the eof. Since we wish to find the longest lexeme matching the pattern we proceed backwards from the current point (where there was no state) until we reach an accepting state (i.e., the set of N-states contains an accepting N-state). Each accepting N-state corresponds to a matched pattern. The lex rule is that if a lexeme matches multiple patterns we choose the pattern listed first in the lex-program. I don't believe this rule will be needed in lab 2 since I can't think of a case where two different patterns will match the same (longest) lexeme.

nfa 52

Example

Consider the example just above with three patterns and their associated actions and consider processing the input aaba.

We begin by constructing the three NFAs. To save space, the third NFA is not the one that would be constructed by our algorithm, but is an equivalent smaller one. For example, some unnecessary ε-transitions have been eliminated. If one views the lex executable as a compiler transforming lex source into NFAs, this would be considered an optimization.
We introduce a new start state and ε-transitions as in the previous section.
We start at the ε-closure of the start state, which is {0,1,3,7}.
The first a (remember the input is aaba) takes us to {2,4,7}. This includes an accepting state and indeed we have matched the first patten. However, we do not stop since we may find a longer match.
The next a takes us to {7}.
The b takes us to {8}.
The next a fails since there are no a-transitions out of state 8. So we must back up to before trying the last a.
We are back in {8} and ask if one of these N-states (I know there is only one, but there could be more) is an accepting state.
Indeed state 8 is accepting for third pattern. If there were more than one accepting state in the list, we would choose the one in the earliest listed pattern.
Action3 would now be performed.

dfa 54

3.8.3: DFA's for Lexical Analyzers

We could also convert the NFA to a DFA and simulate that. The resulting DFA is on the right. Note that it shows the same D-states (i.e., sets of N-states) we saw in the previous section, plus some other D-states that arise from inputs other than aaba.

We label the accepting states with the pattern matched. If multiple patterns are matched (because the accepting D-state contains multiple accepting N-states), we use the first pattern listed (assuming we are using lex conventions). For example, the middle D-state on the bottom row contains two accepting N-states, 6 and 8. Since the RE for 6 was listed first, it appears below the state.

Consider processing the string aa. Show how you get two tokens.

Technical point. For a DFA, there must be a outgoing edge from each D-state for each possible character. In the diagram, when there is no NFA state possible, we do not show the edge. Technically we should show these edges, all of which lead to the same D-state, called the dead state, and corresponds to the empty subset of N-states.

Remark: A pure DFA or NFA simply accepts or rejects a single string. In the context of lexers, that would mean accepting or rejecting a single lexeme. But a lexer isn't given a single lexeme to process, it is given a program consisting of many lexemes. That is why we have the complication of backtracking after getting stuck; a pure NFA/DFA would just say reject if the next character did not correspond to an outgoing arc.

Alternatives for Implementing Lab 2

There are trade-offs depending on how much you want to do by hand and how much you want to program. At the extreme you could write a program that reads in the regular expression for the tokens and produces a lexer, i.e., you could write a lexical-analyzer-generator. I very much advise against this, especially since the first part of the lab requires you to draw the transition diagrams anyway.

The two reasonable alternatives are.

By hand, convert the NFA to a DFA and then write your lexer based on this DFA, simulating its actions for input strings.
Write your program based on the NFA.

3.8.4: Implementing the Lookahead Operator

This has some tricky points; we are basically skipping it. This lookahead operator is for when you must look further down the input but the extra characters matched are not part of the lexeme. We write the pattern r1/r2. In the NFA we match r1 then treat the / as an ε and then match s1. It would be fairly easy to describe the situation when the NFA has only one ε-transition at the state where r1 is matched. But it is tricky when there are more than one such transition.

3.9: Optimization of DFA-Based Pattern Matchers

3.9.1: Important States of an NFA

3.9.2: Functions Computed form the Syntax Tree

3.9.3: Computing nullable, firstpos, and lastpos

3.9.4: Computing followpos

Remark: Lab 2 assigned. Part 1 (no programming) due in one week; the remainder is due in 2 weeks (i.e, one week after part 1).

Chapter 4: Syntax Analysis

Homework: Read Chapter 4.

4.1: Introduction

4.1.1: The role of the parser

As we saw in the previous chapter the parser calls the lexer to obtain the next token.

Conceptually, the parser accepts a sequence of tokens and produces a parse tree. In practice this might not occur.

The source program might have errors. Shamefully, we will do very little error handling.
Instead of explicitly constructing the parse tree, the actions that the downstream components of the front end would do on the tree can be integrated with the parser and done incrementally on components of the tree. We will see examples of this, but your lab number 3 will produce a parse tree. Your lab number 4 will process this parse tree and do the actions.
Real compilers produce (abstract) syntax trees not parse trees (concrete syntax trees). We don't do this for the pedagogical reasons given previously.

There are three classes for grammar-based parsers.

universal
top-down
bottom-up

The universal parsers are not used in practice as they are inefficient; we will not discuss them.

As expected, top-down parsers start from the root of the tree and proceed downward; whereas, bottom-up parsers start from the leaves and proceed upward.

The commonly used top-down and bottom parsers are not universal. That is, there are context-free grammars that cannot be used with them.

The LL (top down) and LR (bottom-up) parsers are important in practice. Hand written parsers are often LL. Specifically, the predictive parsers we looked at in chapter two are for LL grammars.

The LR grammars form a larger class. Parsers for this class are usually constructed with the aid of automatic tools.

4.1.2: Representative Grammars

Expressions with + and *

    E → E + T | T
    T → T * F | F
    F → ( E ) | id

This takes care of precedence, but as we saw before, gives predictive parsing trouble since it is left-recursive. So we used the following non-left-recursive grammar that generates the same language.

    E  → T E'
    E' → + T E' | ε
    T  → F T'
    T' → * F T' | ε
    F  → ( E ) | id

The following ambiguous grammar will be used for illustration, but in general we try to avoid ambiguity.

    E → E + E | E * E | ( E ) | id

This grammar does not enforce precedence and it does not specify left vs right associativity.
For example, id + id + id and id + id * id each have two parse trees.

4.1.3: Syntax Error Handling

There are different levels of errors.

Lexical errors: For example, spelling.
Syntactic errors: For example missing ; .
Semantic errors: For example wrong number of array indexes.
Logical errors: For example off by one usage of < instead of <=.

4.1.4: Error-Recovery Strategies

The goals are clear, but difficult.

Report errors clearly and accurately. One difficulty is that one error can mask another and can cause correct code to look faulty.
Recover quickly enough to not miss other errors.
Add minimal overhead.

Trivial Approach: No Recovery

Print an error message when parsing cannot continue and then terminate parsing.

Panic-Mode Recovery

The first level improvement. The parser discards input until it encounters a synchronizing token. These tokens are chosen so that the parser can make a fresh beginning. Good examples for C/Java are ; and }.

Phrase-Level Recovery

Locally replace some prefix of the remaining input by some string. Simple cases are exchanging ; with , and = with ==. Difficulties occur when the real error occurred long before an error was detected.

Error Productions

Include productions for common errors.

Global Correction

Change the input I to the closest correct input I' and produce the parse tree for I'.

4.2: Context-Free Grammars

4.2.1: Formal Definition

Definition: A Context-Free Grammar consists of

Terminals: The basic components found by the lexer. They are sometimes called token names, i.e., the first component of the token as produced by the lexer.
Nonterminals: Syntactic variables that help define the syntactic structure of the language.
Start Symbol: A nonterminal that forms the root of the parse tree.
Productions:
1. Head or left (hand) side or LHS. For context-free grammars, which are our only interest, the LHS must consist of just a single nonterminal.
2. →
3. Body or right (hand) side or RHS. A string of terminals and nonterminals.

4.2.2: Notational Conventions

I am not as formal as the book. In particular, I don't use italics. Nonetheless I do (try to) use some of the conventions, in particular the ones below. Please correct me if I violate them.

Uppercase letters early in the alphabet like A, B, and C are use to represent nonterminals. Also S is often used as the start symbol, but remember that if no explicit start symbol is given for a grammar, than the LHS of the first production is the start symbol.
Lowercase letters early in the alphabet are used to represent terminals.
Uppercase letters late in the alphabet are used to represent a single grammar symbol, i.e., either a single terminal or a single nonterminal.
Lowercase letters late in the alphabet represent are used to represent a (possibly empty) string of terminals.
Lowercase Greek letters are used to represent a (possibly empty) string of grammar symbols. I never use ε for this purpose since ε always represents the empty string.
So A → α would be a generic production in a context-free grammar.
X → α and β → α include illegal productions for a context-free grammar.

As I have mentioned before, when the entire grammar is written, no conventions are needed to tell the nonterminals, terminals, and start symbol. The nonterminals are the LHSs, the terminals are everything else on the RHS, and the start symbol is the LHS of the first production. The the notational conventions are used when you just give a few productions not a full grammar.

4.2.3: Derivations

This is basically just notational convenience, but important nonetheless.

Assume we have a production A → α. We would then say that A derives α and write
A ⇒ α

We generalize this. If, in addition, β and γ are strings (each may contain terminals and/or nonterminals), we say that βAγ derives βαγ and write

    βAγ ⇒ βαγ

We say that βAγ derives βαγ in one step.

We generalize further. If α derives β and β derives γ, we say α derives γ and write
α ⇒* γ.

The notation used is ⇒ with a * over it (I don't see it in html). This should be read derives in zero or more steps. Formally,

α ⇒* α, for any string α.
If α ⇒* β and β ⇒ γ, then α ⇒* γ.

Informally, α ⇒* β means you can get from α to β, and α ⇒ β means you can get from α to β in one step.

Definition: If S is the start symbol and S⇒*α, we say α is a sentential form of the grammar.

A sentential form may contain nonterminals and terminals.

Definition: A sentential form containing only terminals is called a sentence of the grammar.

Definition: The language generated by a grammar G, written L(G), is the set of these sentences.

Definition: A language generated by a (context-free) grammar is called a context free language.

Definition: Two grammars generating the same language are called equivalent.

Examples: Recall the ambiguous grammar above

    E → E + E | E * E | ( E ) | id

We see that id + id is a sentence. Indeed it can be derived in two ways from the start symbol E.

    E ⇒ E + E ⇒ id + E ⇒ id + id
    E ⇒ E + E ⇒ E + id ⇒ id + id

(Since both derivations give the same parse tree, this does not show the grammar is ambiguous. You should be able to find a sentence—without looking back in the notes—that has two different parse trees.)

In the first derivation shown just above, each step replaced the leftmost nonterminal by the body of a production having the nonterminal as head. This is called a leftmost derivation. Similarly the second derivation, in which each step replaced the rightmost nonterminal, is called a rightmost derivation. Sometimes the latter are called canonical derivations, but we won't do so.

When one wishes to emphasize that a (one step) derivation is leftmost they write an lm under the ⇒. To emphasize that a (general) derivation is leftmost, one writes an lm under the ⇒*. Similarly one writes rm to indicate that a derivation is rightmost. I won't do this in the notes but will on the board.

Definition: If x can be derived using a leftmost derivation, we call x a left-sentential form. Similarly for a right-sentential form.

Homework: 1(ab), 2(ab).

4.2.4: Parse Trees and Derivations

The leaves of a parse tree (or of any other tree), when read left to right, are called the frontier of the tree. For a parse tree we also call them the yield of the tree.

If you are given a derivation starting with a single nonterminal,
A ⇒ α₁ ⇒ α₂ ... ⇒ α_n it is easy to write a parse tree with A as the root and α_n as the leaves. Just do what (the production contained in) each step of the derivation says. The LHS of each production is a nonterminal in the frontier of the current tree so replace it with the RHS to get the next tree.

Do this for both the leftmost and rightmost derivations of id+id above.

So there can be many derivations that wind up with the same final tree.

But for any parse tree there is a unique leftmost derivation producing that tree (always choose the leftmost unmarked nonterminal to be the LHS, mark it, and write the production with this LHS and the children as the RHS). Similarly, there is a unique rightmost derivation that produces the tree. There may be others as well (e.g., sometime choose the leftmost unmarked nonterminal to expand and other times choose the rightmost; or choose a middle unmarked nonterminal).

Homework: 1c

4.2.5: Ambiguity

Recall that an ambiguous grammar is one for which there is more than one parse tree for a single sentence. Since each parse tree corresponds to exactly one leftmost derivation, a grammar is ambiguous if and only if it permits more than one leftmost derivation of a given sentence. Similarly, a grammar is ambiguous if and only if it permits more than one rightmost of a given sentence.

We know that the grammar

    E → E + E | E * E | ( E ) | id

is ambiguous. For example, there are two parse trees for

    id + id * id

So there must me at least two leftmost derivations. Here they are

    E ⇒ E + E          E ⇒ E * E
      ⇒ id + E           ⇒ E + E * E
      ⇒ id + E * E       ⇒ id + E * E
      ⇒ id + id * E      ⇒ id + id * E
      ⇒ id + id * id     ⇒ id + id * id

As we stated before we prefer unambiguous grammars. Failing that, we want disambiguation rules, as are often given for the dangling else in the C language.

4.2.6: Verification

4.2.7: Context-Free Grammars Versus Regular Expressions

Alternatively context-free languages vs regular languages.

Given an RE, we learned in Chapter 3 how to construct an NFA accepting the same strings.

Now we show that, given an NFA, we can construct an RE accepting the same string

Define a nonterminal A_i for each state i.
For a transition from A_i to A_j on input a (or ε), add a production
A_i → aA_j (or A_i → A_j).
If i is accepting, add A_i → ε
If i is start, make A_i start.

If you trace an NFA accepting a sentence, it just corresponds to the constructed grammar deriving the same sentence. Similarly, follow a derivation and notice that at any point prior to acceptance there is only one nonterminal; this nonterminal gives the state in the NFA corresponding to this point in the derivation.

The book starts with (a|b)*abb and then uses the short NFA on the left below. Recall that the NFA generated by our construction is the longer one on the right.

nfa 34 nfa 24

The book gives the simple grammar for the short diagram.

Let's be ambitious and try the long diagram

    A0 → A1 | A7
    A1 → A2 | A4
    A2 → a A3
    A3 → A6
    A4 → b A5
    A5 → A6
    A6 → A1 | A7
    A7 → a A8
    A8 → b A9
    A9 → b A10
    A10 → ε

Now trace a path in the NFA beginning at the start state and see that it is just a derivation. That is the string corresponding to that path is a sentential form. The same is true in reverse (derivation gives path). The key is that at every stage you have at only one nonterminal.

Then notice that when you get to an accepting state, you have no nonterminals so accepting a string in the NFA shows it is a sentence in the language

Grammars, but not Regular Expressions, Can Count

The grammar

    A → c A b | ε

generates all strings of the form cⁿbⁿ, where there are the same number of c's and b's. In a sense the grammar has counted. No RE can generate this language.

A proof is in the book. The idea is that you need a infinite number of states to represent the number of c's you have seen so that you can ensure that you see the same number of b's. But a RE is equivalent to a DFA (or NFA) and the F stands for finite.

4.3: Writing a Grammar

4.3.1: Lexical vs Syntactic Analysis

Why have a separate lexer and parser? Since the lexer deals with REs / Regular Languages and the parser deals with the more powerful Context Free Grammars (CFGs) / Context Free Languages (CFLs), everything a lexer can do, a parser could do as well. The reasons for separating the lexer and parser are from software engineering considerations.

REs are easier than CFGs to understand.
More efficient algorithms/tools exist for automating the RE-based lexer than for the CFG-based parser.
Only the lexer need deal with the external environment.

4.3.2: Eliminating Ambiguity

Recall the ambiguous grammar with the notorious dangling else problem.

      stmt → if expr then stmt
      | if expr then stmt else stmt
      | other

This has two leftmost derivations for
if E1 then S1 else if E2 then S2 else S3

Do these on the board. They differ in the beginning.

In this case we can find a non-ambiguous, equivalent grammar.

        stmt → matched-stmt | open-stmt
matched-stmt → if expr then matched-stmt else matched-stmt
	     | other
   open-stmt → if expr then stmt
	     | if expr then matched-stmt else open-stmt

On the board find the unique parse tree for the problem sentence and from that the unique leftmost derivation.

Remark:
There are three areas relevant to the above example.

Language design. C vs Ada (end if). We are not studying this.
Finding a non-ambiguous grammar for the C if-then-else. This was not easy.
Parsing the dangling else example with the non-ambiguous grammar. We can do this.

End of Remark.

4.3.3: Eliminating Left Recursion

We did special cases in chapter 2. Now we do it right(tm).

Previously we did it separately for one production and for two productions with the same nonterminal on the LHS. Not surprisingly, this can be done for n such productions (together with other non-left recursive productions involving the same nonterminal).

Specifically we start with

    A → A α₁ | A α₂ | ... A α_n | β₁ | β₂ | ... β_m

where the α's and β's are strings, and no β begins with A (otherwise it would be an α).

The equivalent non-left recursive grammar is

    A  → β₁ A' | ... | β_m A'
    A' → α₁ A' | ... | α_n A' | ε

The idea is as follows. Look at the left recursive grammar. At some point you stop producing more As and have the A (which is always on the left) become one of the βs. So the final string starts with a β. Up to this point all the As became Aα for one of the αs. So the final string is a β followed by a bunch of αs, which is exactly what the non-left recursive definition says.

Example: Assume n=m=1, α₁ is + and β₁ is *. So the left recursive grammar is

    A → A + | *

and the non-left recursive grammar is

    A  → * A'
    A' → + A' | ε

With the recursive grammar, we have the following derivation.

    A ⇒ A + ⇒ A + + ⇒ * + +

With the non-recursive grammar we have

    A ⇒ * A' ⇒ * + A' ⇒ * + + A' ⇒ * + +

This procedure removes direct left recursion where a production with A on the left hand side begins with A on the right. If you also had direct left recursion with B, you would apply the procedure twice.

The harder general case is where you permit indirect left recursion, where, for example one production has A as the LHS and begins with B on the RHS, and a second production has B on the LHS and begins with A on the RHS. Thus in two steps we can turn A into something starting again with A. Naturally, this indirection can involve more than 2 nonterminals.

Theorem: All left recursion can be eliminated.

Proof: The book proves this for grammars that have no ε-productions and no cycles and has exercises asking the reader to prove that cycles and ε-productions can be eliminated.

We will try to avoid these hard cases.

Homework: Eliminate left recursion in the following grammar for simple postfix expressions.

    S → S S + | S S * | a

4.3.4: Left Factoring

If two productions with the same LHS have their RHS beginning with the same symbol (terminal or nonterminal), then the FIRST sets will not be disjoint so predictive parsing (chapter 2) will be impossible and more generally top down parsing (defined later this chapter) will be more difficult as a longer lookahead will be needed to decide which production to use.

So convert A → α β1 | α β2 into

   A  → α A'
   A' → β1 | β2

In other words factor out the α.

Homework: Left factor your answer to the previous homework.

4.3.5: Non-CFL Constructs

Although our grammars are powerful, they are not all-powerful. For example, we cannot write a grammar that checks that all variables are declared before used.

4.4: Top-Down Parsing

We did an example of top down parsing, namely predictive parsing, in chapter 2.

For top down parsing, we

Start with the root of the parse tree, which is always the start symbol of the grammar. That is, initially the parse tree is just the start symbol.
Choose a nonterminal in the frontier.
1. Choose a production having that nonterminal as LHS.
2. Expand the tree by making the RHS the children of the LHS.
Repeat above until the frontier is all terminals.
Hope that the frontier equals the input string.

The above has two nondeterministic choices (the nonterminal, and the production) and requires luck at the end. Indeed, the procedure will generate the entire language. So we have to be really lucky to get the input string.

Another problem is that the procedure may not terminate.

4.4.1: Recursive Decent Parsing

Let's reduce the nondeterminism in the above algorithm by specifying which nonterminal to expand. Specifically, we do a depth-first (left to right) expansion. This corresponds to a leftmost derivation. That is, we expand the leftmost nonterminal in the frontier.

We leave the choice of production nondeterministic.

We also process the terminals in the RHS, checking that they match the input. By doing the expansion depth-first, left to right, we ensure that we encounter the terminals in the order they will appear in the frontier of the final tree. Thus if the terminal does not match the corresponding input symbol now, it never will (since there are no nonterminals to its left) and the expansion so far will not produce the input string as desired.

Now our algorithm is

Initially, the tree is the start symbol, the nonterminal we are currently processing.
Choose a production having the current nonterminal as LHS and expand the tree with the RHS, X1 X2 ... Xn.

for i = 1 to n
  if Xi is a nonterminal
    process Xi  // recursive call to step 2
  else if Xi (a terminal) matches current input symbol
    advance input to next symbol
  else // trouble Xi doesn't match and never will

Note that the trouble mentioned at the end of the algorithm does not signify an erroneous input. We may simply have chosen the wrong production in step 2.

In a general recursive descent (top-down) parser, we would support backtracking. That is, when we hit the trouble, we would go back and choose another production.

It is possible that no productions work for this nonterminal, because the wrong choice was made earlier. In that case we must back up further.

As an example consider the grammar

    S → A a | B b
    A → 1 | 2
    B → 1 | 2

and the input string 1 b. On the right we show a movie of a recursive descent parsing of this string in which we have to backup two steps.

The good news is that we will work with grammars where we can control the nondeterminism much better. Recall that for predictive parsing, the use of 1 symbol of lookahead made the algorithm fully deterministic, without backtracking.

4.4.2: FIRST and FOLLOW

We used FIRST(RHS) when we did predictive parsing.

Now we learn the whole truth about these two sets, which proves to be quite useful for several parsing techniques (and for error recovery, but we won't make use of this).

The basic idea is that FIRST(α) tells you what the first terminal can be when you fully expand the string α and FOLLOW(A) tells what terminals can immediately follow the nonterminal A.

Definition: For any string α of grammar symbols, we define FIRST(α) to be the set of terminals that occur as the first symbol in a string derived from α. So, if α⇒*cβ for c a terminal and β a string, then c is in FIRST(α). In addition, if α⇒*ε, then ε is in FIRST(α).

Definition: For any nonterminal A, FOLLOW(A) is the set of terminals x, that can appear immediately to the right of A in a sentential form. Formally, it is the set of terminals c, such that S⇒*αAcβ. In addition, if A can be the rightmost symbol in a sentential form (i.e., if S⇒*αA), the endmarker $ is in FOLLOW(A).

Note that there might have been symbols between A and c during the derivation, providing they all derived ε and eventually c immediately follows A.

Unfortunately, the algorithms for computing FIRST and FOLLOW are not as simple to state as the definition suggests, in large part caused by ε-productions.

Calculating FIRST

Remember that FIRST(α) is the set of terminals that can begin a string derived from α. Since a terminal can derive only itself, FIRST of a terminal is trivial.

    FIRST(a) = {a}, for any terminal a

FIRST of a nonterminal is not so easy

Initialize FIRST(A)=φ for all nonterminals A
If A → ε is a production, add ε to FIRST(A).
For each production A → Y₁ ... Y_n,
1. add to FIRST(A) any terminal a satisfying
  1. a is in FIRST(Y_i) and
  2. ε is in all previous FIRST(Y_j).
2. add ε to FIRST(A) if ε is in all FIRST(Y_j).
Repeat step 3 until nothing is added.

FIRST of an arbitrary string now follows.

FIRST of any string X=X₁X₂...X_n is initialized to φ.
add to FIRST(X) any non-ε symbol in FIRST(X_i) if ε is in all previous FIRST(X_j).
add ε to FIRST(X) if ε is in every FIRST(X_j). In particular if X is ε, FIRST(X)={ε}.

Calculating FOLLOW

Recall that FOLLOW is defined only for nonterminals A and is the set of terminals that can immediately follow A in a sentential form (a string derivable from the start symbol

Initialize FOLLOW(S)=$ and FOLLOW(A)=φ for all other nonterminals A.
For every production A → α B β, add all of FIRST(β) except ε to FOLLOW(B).
Apply the following rule until nothing is added to any FOLLOW set.
For every production ending in B, i.e. for
A → α B and for
A → α B β, where FIRST(β) contains ε,
add all of FOLLOW(A) to FOLLOW(B).

Do the FIRST and FOLLOW sets for

    E  → T E'
    E' → + T E' | ε
    T  → F T'
    T' → * F T' | ε
    F  → ( E ) | id

Homework: Compute FIRST and FOLLOW for the postfix grammar S → S S + | S S * | a

4.4.3: LL(1) Grammars

The predictive parsers of chapter 2 are recursive descent parsers needing no backtracking. A predictive parser can be constructed for any grammar in the class LL(1). The two Ls stand for processing the input Left to right and for producing Leftmost derivations. The 1 in parens indicates that 1 symbol of lookahead is used.

Definition: A grammar is LL(1) if for all production pairs A → α | β

FIRST(α) ∩ FIRST(β) = φ.
If β ⇒* ε, then no string derived from α begins with a terminal in FOLLOW(A). Similarly, if α ⇒* ε.

ll1 def

The 2nd condition may seem strange; it did to me for a while. Let's consider the simplest case that condition 2 is trying to avoid. S is the start symbol

    A → ε      // β=ε so β derives ε
    A → c      // α=c so α derives a string beginning with c
    S → A c    // c is in FOLLOW(A)

Probably the simplest derivation possible is

    S ⇒ A c ⇒ c

Assume we are using predictive parsing and, as illustrated in the diagram to the right, we are at A in the parse tree and c in the input. Since lookahead=c and c is in FIRST(RHS) for the second A production, we would choose that production to expand A. But this is wrong! Remember that we don't look ahead in the tree, we look ahead just in the input. So we would not have noticed that the next node in the tree (i.e., in the frontier) is c. The next node can indeed be c since c is in FOLLOW(A). So we should have used the top A production to produce ε in the tree, and then the next node c would match the input c.