CYK-PARSE algorithm with tree recovery

The CYK-PARSE algorithm on p. 894 of Russell and Norvig 3rd edn. is a little too stripped down, in that recovering the tree from the data structure returned is laborious. Here is an amplified version that constructs the optimal tree (plus all possible parses of every substring).

We assume an enumerated class NonTerms with all the non-terminals, both phrase structure symbols and parts of speech.

A node of the parse tree is an object with the following data fields (in Java-esque notation)

class Tree {  
    NonTerm phrase %  The Non-terminal
    int startPhrase, int endPhrase; % indices of starting and ending word
    String word;   %  If a leaf, then the word 
    Tree left;     
    Tree right;    
    double prob;   
  }

The fields left, right, and prob are the left-hand child, right-hand child and probability in the best parse found so far for this particular non-terminal from this start to this end (e.g. the best parse for an NP from word 3 to word 6). Note that the probability on a node is _not_ the probability of this parse tree given the phrase, it is the probability of the parse tree including the choice of word given the non-terminal at the root. However for any particular non-terminal and choice of sentence those two probabilities are proportional, by Bayes' rule, which we will come to later in the course.

The main data structure in the procedure is a chart which is an array P[M,I,J]. M is indexed on the non-terminal, I and J go from 1 to N (the length of the sentence). P[M,I,J] is the node with NonTerm==M, startPhrase==I, and endPhrase==J. Note that this uses 1-based indexing, like the textbook.

function CYK-PARSE(sentence,grammar) return P, a chart. {

1. N = length(sentence);
2. for (i = 1 to N) {
3.   word = sentence[i];
4.    for (each rule  "POS --> Word [prob]" in the grammar) 
5.       P[POS,i,i] = new Tree(POS,i,i,word,null,null,prob);
6.    }                           % endfor line 2.

7. for (length = 2 to N)          % length = length of phrase
8.   for (i = 1 to N+1-length) {  % i == start of phrase
9.     j = i+length-1;            % j == end of phrase
10.    for (each NonTerm M)  {
11.        P[M,i,j] = new Tree(M,i,j,null,null,null,0.0);
12.        for (k = i to j-1)    % k = end of first subphrase
13.            for (each rule "M -> Y,Z [prob]" in the grammar) {
14.                newProb = P[Y,i,k].prob * P[Z,k+1,j].prob * prob;
15.                if (newProb > P[M,i,j].prob) {
16.                   P[M,i,j].left = P[Y,i,k];
17.                   P[M,i,j].right = P[Z,k+1,j];
18.                   P[M,i,j].prob = newProb;
19.                }  % endif line 15 
20.            }      % endfor line 13
21.      }            % endfor line 10
22.    }              % endfor line 8

23. return P;
24. }  % end CYK-PARSE.
The value returned in P["S",1,N] is the root of the most probable parse.

The most probable parse tree can be printed out (not very elegantly) using pre-order search.

printTree(P) {  % P is the chart returned by CYK-PARSE above.
   printTree1(P[S,1,N],0)  % S is the top-level nonterminal in the grammar.
   }                       % N is the length of the sentence.



printTree1(Tree tree, int INDENT) {
   if (tree != null) {
      print out INDENT spaces;
      print out tree.phrase
      if (tree.word != null) print out tree.word;
      printTree(tree.left, INDENT+3);
      printTree(tree.right, INDENT+3);
   }
The output looks like this:
S
   Noun  people
   VP
      Modal can
      Verb  fish      
A trace of an example is here.