CSCI-GA.2590 - Natural Language Processing -- Spring 2013 -- Prof. Grishman

Assignment #6

Due:  March 26 [4 points + up to 3 points extra, see below]

For the previous two assignments, you developed hand-coded patterns for noun and verb group chunking.

For the next assignment (due in 2 weeks), you are to build a simple system to learn a model for noun group chunking.

You will use a portion of the training data which is available from UPenn
This data is provided in a  simple form, one word per line, with the word, its part of speech, and BIO tag on a line.
Sentence boundaries are indicated by a blank line.
The B tag is used only if a word begins a new noun group, and the previous word ends a noun group.  Any other word at the beginning or inside a noun group is tagged with an I.
The Penn data consists of 200K words training;  we provide 100K lines of this for training, and 10K lines for test.

As your learning tool, we suggest the opennlp MaxEnt package in Java. The package is relatively well documented and makes the train/test cycle quite simple.  With this package, you
  1. write a series of lines (one for each word in the training corpus) of the form
       feature=value feature=value ... tag

  2. train a MaxEnt model from this training data with code such as
    import opennlp.maxent.*;
    public static void main (String[] args) {
        String dataFileName = <the file written in step 1>
        String modelFileName = <maxent model file>;
        try {
            FileReader datafr = new FileReader(new File(dataFileName));
            EventStream es = new BasicEventStream(new PlainTextByLineDataStream(datafr));
            GISModel model = GIS.trainModel(es, 100, 4);
            File outputFile = new File(modelFileName);
            GISModelWriter writer = new SuffixSensitiveGISModelWriter(model, outputFile);
        } catch (Exception e) {
            System.out.print("Unable to create model due to exception: ");
    compile this program, putting the maxent and trove jar files (distributed with the Jet package) on the class path.

  3. use this model to tag the test corpus:
    read the model in with
       GISModel m = new SuffixSensitiveGISModelReader(new File(modelFileName)).getModel();
    and then, for each line in the test corpus, build an array of feature/value strings
       ["feature=value", "feature=value" ... ]
    (like you used for training, but with each feature/value a separate array element, and without the final tag) and use
    to select the best tag according to the MaxEnt model; if you are more ambitious, implement a Viterbi search, retrieving the probabilities of successors with the getOutcomeProbabilities method.

  4. compare these tags to the ones in the test corpus to produce a tag accuracy (to be a bit fancier, check whether the tags predict the best start and end boundaries for the noun group, producing recall, precision, and F-measure)
The important question is what features to compute on the word sequence.  Should you just use parts-of-speech or also words?  Should you include the parts of speech of the previous or following words?  Should you use feature conjunctions?  Does it help if the prior state is a feature?

Please submit a very brief report along with a listing of your code.  A good assignment will