G22.2590 - Natural Language Processing -- Spring 2008 -- Prof.
February 14, 2008
1. HMM: (a) [1.5 points] Consider a HMM with two
Cow and Duck, and a start and end state. Emission probabilities:
(Nothing is emitted in the start or end state.) Transition
- In state Cow, the HMM can emit 'moo' (with 0.9 probability) or
'hello' (0.1 probability).
- In state Duck, the HMM can emit 'quack' (0.6 probability) or
‘hello' (0.4 probability). The Duck has been studying English longer.
Using the Viterbi algorithm, decode (find the most likely state
sequence for) 'moo hello quack'. What is the probability of
emitting this sentence from this state sequence? Show your work,
so that you can get partial credit even if you make an error.
- From the start state, the HMM goes to state Cow with 1.0
probability (i.e., always).
- From state Cow, the HMM can remain in state Cow (0.5
probability), go to state Duck (0.3 probability), or go to state end
- From state Duck, the HMM can remain in state Duck (0.5
probability), go to state Cow (0.3 probability), or go to state end
(b) [1 point] Is there another state sequence which also generates 'moo
quack'? What is the total
of emitting this sentence?
2. JET HMM Tagger. [1.5 points] Try the Jet HMM
Submit the output for one correctly tagged sentence and for one
with a single incorrect tag. Explain the error in terms of the
and transition probabilities in the HMM (file pos_hmm.txt). This is not a lengthy calculation ... you
only compute the relative probabilities
of the two tag sequences.We recommend (so that you do not have to deal
the back-off statistics of the tagger) that you choose an erroneous
for which the word occured with both the correct and incorrect parts of
in the training corpus.
Due February 21st.
Running the tagger:
Add the properties file tagPOS.jet to your props directory:
# JET properties file for POS tagging
On the "tagger" menu, turn on the "POS tagger trace".
Jet.dataPath = data
Tags.fileName = pos_hmm.txt
processSentence = tagPOS
Analyzing the HMM file:
- The tagger was trained on sentences ending in a period; be
to include a period when entering sentences or you may get bizarre tag
- The training corpus is taken from the Wall Street Journal.
is therefore likely to do beter on words you would expect to find in
news, particularly the business news.
The pos_hmm file consists of a series of lines each beginning with a
- STATE state-name
- Defines a new state with name state-name. All
lines until the next STATE line are part of the definition of this
- ARC TO state-name [count]
- Indicates that there is an arc from the current state to the
named state-name. The count, which will be used
the probability of this transition, indicates how often the transition
to state-name was observed. If absent, a count of 1 is
- EMIT token [count]
- Indicates that the current state can emit token token. The count,
which will be used to compute the probability of this emission,
indicates how often the emission of token was observed.
If absent, a count
of 1 is assumed.
- TAG tag
- Indicates that the current state is associated with tag tag.
These tags are used to associate HMM states with annotations, as
An example of a simple file which matches a sequence of "oink"s and
Note that the file gives counts, not probabilities (these are actual
from a million words of text.) To compute the emission and
probabilities, you also need to know the total count for a state.
is not included in the pos_hmm file distributed with Jet, but we have
a new pos_hmm file which
provides this additional information (as a count on each STATE line).
ARC TO middle
EMIT quack 1
EMIT oink 2
ARC TO middle 2
ARC TO end 1