G22.2591 - Advanced Natural Language Processing - Spring 2011

Assignment #1

Due Friday February 25

For the first assignment, you will get some experience with a supervised version of the name tagging task for English.  We will use the training and test sets from the CONLL 2003 evaluation.  This evaluation used the Reuters news corpus, which was prepared in a format with one line per token which makes it easy to train and annotate.

As one simple learning tool, we suggest the opennlp MaxEnt package in Java, which we used in the basic NLP course for chunking. The package is relatively well documented and makes the train/test cycle quite simple.  With this package, you
  1. write a series of lines (one for each word in the training corpus) of the form
       feature=value feature=value ... tag
  2. train a MaxEnt model from this training data with code such as
    public static void main (String[] args) {
        String dataFileName = <the file written in step 1>
        String modelFileName = <maxent model file>;
        try {
            FileReader datafr = new FileReader(new File(dataFileName));
            EventStream es = new BasicEventStream(new PlainTextByLineDataStream(datafr));
            GISModel model = GIS.trainModel(es);

            File outputFile = new File(modelFileName);
            GISModelWriter writer = new SuffixSensitiveGISModelWriter(model, outputFile);
        } catch (Exception e) {
            System.out.print("Unable to create model due to exception: ");
  3. use this model to tag the test corpus:
    read the model in with
       m = new SuffixSensitiveGISModelReader(new File(modelFileName)).getModel();
    and then, for each line in the test corpus, build an array of strings
       feature=value feature=value ...
    (like you used for training, but without the final tag) and use
    to select the best tag according to the MaxEnt model;  compare this tag to the one in the test corpus (to be a bit fancier, you should check whether the tags predict the best start and end boundaries for each name)
A minimal assignment will
The more ambitious will
We will announce in class the location of the CONLL training and test data.