Parsing algorithms

Note: All these routines suppose that the grammar G is given in a set of CFG rules of the form "L -> RHS", where RHS is a non-empty string of symbols.

Algorithm 1: Non-deterministic algorithm for parsing CFGs top-down.

parse(in Q : sentence; G : set of CFG rules;
      out T : parse tree)

begin initialize T to be a tree with root "S";
      loop while there are non-terminal leaves in T
         pick a non-terminal leaf L of T;
         choose a rule in G of the form "L -> RHS"
         if RHS is a word W 
           then choose an occurrence O of W in Q;
                associate O with L;
                if the order of leaves in T violates the order of
                   associated words in Q then fail endif;
           else make children for L corresponding to RHS
        endif
      endloop
      if not all words in Q are attached then fail;
end

Algorithm 2: Deterministic algorithm for parsing CFG's top-down, left to right, using DFS.

parse(in Q : sentence; G : set of CFG rules; 
      out T_OUT : parse tree)

begin initialize T_IN to be a tree with root "S";
      parse1(Q,G,T_IN,T_OUT)
end


/* parse1 is a recursive routine that does the work. Note: each recursion
corresponds to one more step in building the tree. You are not recurring
down the parse tree. 

parse1 returns TRUE if T_IN can be expanded into a complete parse tree T_OUTi
    and FALSE if it cannot. 


Note: The modifications to T_IN must be done non-destructively. That is,
the value of T_IN in a calling rountine must not be changed by its subroutines.
*/

parse1(in Q : sentence; G : set of CFG rules; T_IN : parse tree;
       out T_OUT : parse tree) :  boolean;

begin if all the leaves in T_IN are terminals 
         then if words in Q are all attached to T_IN
                 then begin T_OUT:= T_IN;
                            return(TRUE)
                      end endif
              else return(FALSE) /* there are words in Q unaccounted for */
      endif     
      L := leftmost non-terminal in T_IN;
      if L is a part-of-speech 
        then W := first word in Q not attached to T_IN;
             if L is a possible part of speech for W
               then begin make W a child of L;
                          return(parse1(Q,G,T_IN,T_OUT))
                     end
                else return(FALSE) endif 
        else for each rule in G of form "L -> RHS"
                    create chidren for L with non-terminals RHS;
                    FOUND := parse1(Q,G,T_IN,T_OUT);
                    if FOUND then exitloop
                endfor
                return(FOUND)
         endif

end

Algorithm 3: Non-deterministic algorithm for parsing CFGs bottom-up.

parse(in Q : sentence; G : set of CFG rules;
      out T : forest of parse tree)
/* If the routine succeeds then at the end T will be a single parse tree */

begin initialize T to be the forest of trees where each tree contains one
          word in Q;
      repeat choose a sequence RHS of consecutive roots of trees in T;
             find a rule of form "L -> RHS" in G;
             make L the parent of RHS in T;
       until T is a single tree;
end.

Algorithm 4: Deterministic algorithm for parsing CFGs bottom-up, left-right using DFS.

parse(in Q : sentence; G : set of CFG rules;
      out T : parse tree)

begin initialize T to be the forest of trees where each tree contains one
          word in Q;
      Q1 := the list of the first word of Q;
      parse1(Q,G,Q1,T_IN,T_OUT)
end


/* parse1 return TRUE if T_IN can be grown into a complete parse tree T_OUT
   and FALSE if it cannot. 

In each recursive call it either builds a node whose rightmost descendant 
is the last word in Q1
or it extends Q1 by one more word in Q. This enables the state space to
be tree-structured (systematic).
*/

parse1(in Q : sentence; G : set of CFG rules; Q1 : list of words;
          T_IN : forest of parse trees;
       out T_OUT : parse tree) :  boolean;

begin if T_IN is a single tree containing all the words of Q
       then begin T_OUT := T_IN;
                  return(TRUE)
            end
      FOUND := FALSE;
      RL := all the roots of the trees in T_IN;
      R := the last element of RL;
      if R is a word of Q 
        then repeat for each possible part of speech P for R do
                      make a parent of M labelled R;
                 until(FOUND := parse1(Q,G,Q1,T_IN,T_OUT))
        else for M in RL going from right to left do
                RHS := the end of RL from M to R;
                repeat for each rule of form "L -> RHS" in G do
                     make a parent of RHS labelled L
                  until(FOUND := parse1(Q,G,Q1,T_IN,T_OUT))             
             endfor
             if (not FOUND) and (Q1 is not all of Q) 
                then  begin add the next word from Q into Q1;
                            FOUND := parse1(Q,G,Q1,T_IN,T_OUT) 
                      end endif endif
      return(FOUND)
end.