G22.2591 - Advanced Natural Language Processing - Spring 2004

Assignment #3

Due 8 am Monday March 1

For the next assignment (due in 3 weeks), you are to build a simple system to learn the rules for NP chunking.  This is a relatively simple task in that we are basically assigning (B I O) tags to words.  The training and test data is available from UPenn  in a  simple form, one word per line, with the word, its part of speech (as assigned by the Brill tagger), and tag on the line, so almost no work has to be done massaging the data.  The standard training data is 200K words, large enough for good performance but small enough to run most learning algorithms on a PC.

As one simple learning tool, we suggest the opennlp MaxEnt package in Java. The package is relatively well documented and makes the train/test cycle quite simple.  With this package, you
  1. write a series of lines (one for each word in the training corpus) of the form
       feature=value feature=value ... tag
  2. train a MaxEnt model from this training data with code such as
    public static void main (String[] args) {
        String dataFileName = <the file written in step 1>
        String modelFileName = <maxent model file>;
        try {
            FileReader datafr = new FileReader(new File(dataFileName));
            EventStream es = new BasicEventStream(new PlainTextByLineDataStream(datafr));
            GIS.SMOOTHING = USE_SMOOTHING;
            GIS.SMOOTHING_OBSERVATION = SMOOTHING_OBSERVATION;
            GISModel model = GIS.trainModel(es, 100, 4);

            File outputFile = new File(modelFileName);
            GISModelWriter writer = new SuffixSensitiveGISModelWriter(model, outputFile);
            writer.persist();
        } catch (Exception e) {
            System.out.print("Unable to create model due to exception: ");
            System.out.println(e);
        }
  3. use this model to tag the test corpus:
    read the model in with
       m = new SuffixSensitiveGISModelReader(new File(modelFileName)).getModel();
    and then, for each line in the test corpus, build a string of
       feature=value feature=value ...
    (like you used for training, but without the final tag) and use
       m.getBestOutcome(m.eval(features))
    to select the best tag according to the MaxEnt model;  compare this tag to the one in the test corpus (to be a bit fancier, you should check whether the tags predict the best start and end boundaries for the noun group)
The important question is what features to compute on the word sequence.  Should you just use parts-of-speech or also words?  How far should you look ahead?  Should you use feature conjunctions?  What cut-off should be placed on features?  (This has a big
effect on training speed / space.)  You don't have to do a beam search as Ratnaparkhi did;  it is sufficient to take the best tag at each word (i.e., extend only a single hypothesis).

A good assignment will