G22.2590 - Natural Language Processing -- Spring 2006 -- Prof. Grishman

Assignment #4

February 7, 2006

1.  HMM:  (a) [1.5 points] Consider a HMM with two states, Cow and Duck, and a start and end state.  Emission probabilities:
(Nothing is emitted in the start or end state.)  Transition probabilities: 
Using the Viterbi algorithm, decode (find the most likely state sequence for) 'moo hello quack'.  What is the probability of emitting this sentence from this state sequence?  Show your work, so that you can get partial credit even if you make an error.

(b) [1 point] Is there another state sequence which also generates 'moo hello quack'?  What is the total probability of emitting this sentence?

2.  JET HMM Tagger.  [1.5 points] Try the Jet HMM tagger.  Submit the output for one correctly tagged sentence and for one sentence with a single incorrect tag.  Explain the error in terms of the emission and transition probabilities in the HMM (file pos_hmm.txt).  This is not a lengthy calculation ... you need only compute the relative probabilities of the two tag sequences.We recommend (so that you do not have to deal with the back-off statistics of the tagger) that you choose an erroneous example for which the word occured with both the correct and incorrect parts of speech in the training corpus.

Due February 14th.


Running the tagger:

Add the properties file tagPOS.jet to your props directory:
# JET properties file for POS tagging
Jet.dataPath     = data
Tags.fileName    = pos_hmm.txt
processSentence  = tagPOS
On the "tagger" menu, turn on the "POS tagger trace".

Cautions:
Analyzing the HMM file:

The pos_hmm file consists of a series of lines each beginning with a keyword:
STATE state-name
Defines a new state with name state-name.  All following lines until the next STATE line are part of the definition of this state.
ARC TO state-name [count]
Indicates that there is an arc from the current state to the state named state-name.  The count, which will be used to compute the probability of this transition, indicates how often the transition to state-name was observed.  If absent, a count of 1 is assumed.
EMIT token [count]
Indicates that the current state can emit token token.  The count, which will be used to compute the probability of this emission, indicates how often the emission of token was observed.  If absent, a count of 1 is assumed.
TAG tag
Indicates that the current state is associated with tag tag.  These tags are used to associate HMM states with annotations, as explained below.

An example of a simple file which matches a sequence of "oink"s and "quacks" is:

STATE start
ARC TO middle
STATE middle
EMIT quack 1
EMIT oink 2
ARC TO middle 2
ARC TO end 1
STATE end
Note that the file gives counts, not probabilities (these are actual counts from a million words of text.)  To compute the emission and transition probabilities, you also need to know the total count for a state.  This is not included in the pos_hmm file distributed with Jet, but we have created a new pos_hmm file which provides this additional information (as a count on each STATE line).