The CYK-PARSE algorithm on p. 894 of Russell and Norvig 3rd edn. is a little too stripped down, in that recovering the tree from the data structure returned is laborious. Here is an amplified version that constructs the optimal tree (plus all possible parses of every substring).

We assume an enumerated class NonTerms with all the non-terminals, both phrase structure symbols and parts of speech.

A node of the parse tree is an object with the following data fields (in Java-esque notation)

class Tree { NonTerm phrase % The Non-terminal int startPhrase, int endPhrase; % indices of starting and ending word String word; % If a leaf, then the word Tree left; Tree right; double prob; }

The fields left, right, and prob are the left-hand child, right-hand child and probability in the best parse found so far for this particular non-terminal from this start to this end (e.g. the best parse for an NP from word 3 to word 6). Note that the probability on a node is _not_ the probability of this parse tree given the phrase, it is the probability of the parse tree including the choice of word given the non-terminal at the root. However for any particular non-terminal and choice of sentence those two probabilities are proportional, by Bayes' rule, which we will come to later in the course.

The main data structure in the procedure is a chart which is an array P[M,I,J]. M is indexed on the non-terminal, I and J go from 1 to N (the length of the sentence). P[M,I,J] is the node with NonTerm==M, startPhrase==I, and endPhrase==J. Note that this uses 1-based indexing, like the textbook.

function CYK-PARSE(sentence,grammar) return P, a chart. { 1. N = length(sentence); 2. for (i = 1 to N) { 3. word = sentence[i]; 4. for (each rule "POS --> Word [prob]" in the grammar) 5. P[POS,i,i] = new Tree(POS,i,i,word,null,null,prob); 6. } % endfor line 2. 7. for (length = 2 to N) % length = length of phrase 8. for (i = 1 to N+1-length) { % i == start of phrase 9. j = i+length-1; % j == end of phrase 10. for (each NonTerm M) { 11. P[M,i,j] = new Tree(M,i,j,null,null,null,0.0); 12. for (k = i to j-1) % k = end of first subphrase 13. for (each rule "M -> Y,Z [prob]" in the grammar) { 14. newProb = P[Y,i,k].prob * P[Z,k+1,j].prob * prob; 15. if (newProb > P[M,i,j].prob) { 16. P[M,i,j].left = P[Y,i,k]; 17. P[M,i,j].right = P[Z,k+1,j]; 18. P[M,i,j].prob = newProb; 19. } % endif line 15 20. } % endfor line 13 21. } % endfor line 10 22. } % endfor line 8 23. return P; 24. } % end CYK-PARSE.The value returned in P["S",1,N] is the root of the most probable parse.

The most probable parse tree can be printed out (not very elegantly) using pre-order search.

printTree(P) { % P is the chart returned by CYK-PARSE above. printTree1(P[S,1,N],0) % S is the top-level nonterminal in the grammar. } % N is the length of the sentence. printTree1(Tree tree, int INDENT) { if (tree != null) { print out INDENT spaces; print out tree.phrase if (tree.word != null) print out tree.word; printTree(tree.left, INDENT+3); printTree(tree.right, INDENT+3); }The output looks like this:

S Noun people VP Modal can Verb fishA trace of an example is here.