G22.2591 - Advanced Natural Language Processing - Spring 2004
Assignment #5: Word Sense
Due 8 am Monday April 12
For this assignment, you are to write a program for word sense
disambiguation for a set of 6 words with a two-way ambiguity. The
six words are bass, crane, motion, palm, plant, and tank. Training corpora for
these words are available from Rada Mihalcea's resources page;
these small corpora have 100-200 examples of each word.
We suggest a simple naive Bayes method, as described in last week's
lecture (and J&M p. 638-640; M&S p. 236-239), based on
all the words in a window around the word being disambiguated, but you
may try something else (decision list, SVM) if you prefer. If you
use a window, be sure to report how big a window you use.
You should evaluate your procedure using 5-way cross validation:
train on 4/5 of the data and test on 1/5, selecting each 1/5 of the
data in turn, and then averaging the results. Report an accuracy
for each word, a baseline (based on selecting the most likely prior),
and an overall accuracy (total correct / total examples, not the average of the accuracy for
For naive Bayes, some smoothing of the counts is essential to get a
good result. J&M have a good discussion of smoothing in
section 6.3 (p. 206-216). I found that even 'add-one' smoothing
works quite well, though you may prefer to add something smaller than 1.