G22.2591 - Advanced Natural Language Processing - Spring 2011

Assignment #2

Due Sunday March 20 by midnight

For the second assignment, you will be using lexico-syntactic models to acquire hyponym information. The challenge in this task lies less in applying the models than in having a good evaluation, possibly along with some discussion of the evaluation.

To search for the lexico-syntactic patterns, we suggest that you use the n-gram search tool developed by Prof. Sekine. This has the advantage of doing a fast search for matches to a pattern including token wild cards. He has indexed several collections using this tool, including a corpus with 86 years of newswire data ... about 1.8 gigawords of text (an order of magnitude more than what Snow used), and provided basic instructions for using the tool (these instructions also describe how to invoke the tool from Perl, which is not essential but may be convenient for the evaluations).

We want to evaluate the precision of a pattern (what is the probability that a word pair matching the pattern is a hyponym-hypernym pair) and the recall of a pattern (what is the probability that a hyponym-hypernym pair will be found occurring in such a pattern). Recall can be measured relative to a single sentence containing this pair of words or relative to the entire corpus. We will make measurements relative to the corpus, in effect treating the corpus + pattern(s) as a classifier.

The evaluation needs a source of "ground truth" and I suggest that we use WordNet. Following Snow, we want to treat word pairs as Known Hyponym, Known Non-Hyponym, or unknown (one or both words not in WordNet). Measuring recall requires taking a random set of known hyponyms and seeing how many can be found using the patterns. Measuring precision involves taking a set of word pairs matching the pattern and seeing how many are Known Hyponyms and Known Non-Hyponyms (unknowns don't count). You can check with a small sample by hand; if you are adventurous, you can automate the checking of pairs through WordNet.  Interfaces are available for most popular programming languages (C, Java, Perl, Python).

You may try this for a single Hearst pattern, several patterns, or for the union of some Hearst patterns. It is also possible to add some constraint to the pattern if you find that it matches some identifiable "junk".

You may also be able to identify some systematic causes of overgeneration with some of the patterns; if so, give some examples and explanation in your report.