G22.2591 - Advanced Natural Language Processing - Spring 2004

Assignment #5:  Word Sense Disambiguation

Due 8 am Monday April 12

For this assignment, you are to write a program for word sense disambiguation for a set of 6 words with a two-way ambiguity.  The six words are bass, crane, motion, palm, plant, and tank.  Training corpora for these words are available from Rada Mihalcea's resources page;  these small corpora have 100-200 examples of each word.

We suggest a simple naive Bayes method, as described in last week's lecture (and J&M p. 638-640;  M&S p. 236-239), based on all the words in a window around the word being disambiguated, but you may try something else (decision list, SVM) if you prefer.  If you use a window, be sure to report how big a window you use.

You should evaluate your procedure using 5-way cross validation:  train on 4/5 of the data and test on 1/5, selecting each 1/5 of the data in turn, and then averaging the results.  Report an accuracy for each word, a baseline (based on selecting the most likely prior), and an overall accuracy (total correct / total examples, not the average of the accuracy for each word).

For naive Bayes, some smoothing of the counts is essential to get a good result.  J&M have a good discussion of smoothing in section 6.3 (p. 206-216).  I found that even 'add-one' smoothing works quite well, though you may prefer to add something smaller than 1.