G22.2590 - Natural Language Processing -- Spring 2005 -- Prof.
February 14, 2005
1. HMM: (a) [1.5 points] Consider a HMM with two
states, Cow and Duck,
and a start and end state. Emission probabilities:
(Nothing is emitted in the start or end state.) Transition
- In state Cow, the HMM can emit 'moo' (with 0.9 probability) or
- In state Duck, the HMM can emit 'quack' (0.6 probability) or
(0.4 probability). The Duck has been studying English longer.
Using the Viterbi algorithm, decode (find the most likely state
for) 'moo hello quack'. What is the probability of emitting this
from this state sequence? Show your work, so that you can get
credit even if you make an error.
- From the start state, the HMM goes to state Cow with 1.0
- From state Cow, the HMM can remain in state Cow (0.5
go to state Duck (0.3 probability), or go to state end (0.2
- From state Duck, the HMM can remain in state Duck (0.5
go to state Cow (0.3 probability), or go to state end (0.2 probability).
(b) [1 point] Is there another state sequence which also generates 'moo
hello quack'? What is the total
probability of emitting this sentence?
2. JET HMM Tagger. [1.5 points] Try the Jet HMM
the output for one correctly tagged sentence and for one sentence with
a single incorrect tag. Explain the error in terms of the
emission and transition probabilities in the HMM (file pos_hmm.txt). This is not a lengthy calculation ... you
need only compute the relative probabilities
of the two tag sequences.
Due March 7th.
Running the tagger:
Add the properties file tagPOS.jet to your props directory:
# JET properties file for POS tagging
On the "tagger" menu, turn on the "POS tagger trace".
Jet.dataPath = data
Tags.fileName = pos_hmm.txt
processSentence = tagPOS
Analyzing the HMM file:
- The tagger was trained on sentences ending in a period; be
sure to include a period when entering sentences or you may get bizarre
- The training corpus is taken from the Wall Street Journal.
It is therefore likely to do beter on words you would expect to find in
the news, particularly the business news.
The pos_hmm file consists of a series of lines each beginning with a
- STATE state-name
- Defines a new state with name state-name. All
until the next STATE line are part of the definition of this state.
- ARC TO state-name [count]
- Indicates that there is an arc from the current state to the
state named state-name.
The count, which will be used to compute the probability of
transition, indicates how often the transition to state-name
observed. If absent, a count of 1 is assumed.
- EMIT token [count]
- Indicates that the current state can emit token token. The count,
which will be used to compute the probability of this emission,
how often the emission of token was observed. If absent,
count of 1 is assumed.
- TAG tag
- Indicates that the current state is associated with tag tag.
These tags are used to associate HMM states with annotations, as
An example of a simple file which matches a sequence of "oink"s and
Note that the file gives counts, not probabilities (these are actual
counts from a million words of text.) Furthermore, the file does
not give the total count for a state; that value is computed by
Jet from the sum of the ARC TO counts leaving a state. You need
this value to compute the emission and transition probabilities. For
the homework it will be OK (if you don't want to add all 40+ counts) to
estimate this total by summing
the few arcs with the largest values.
ARC TO middle
EMIT quack 1
EMIT oink 2
ARC TO middle 2
ARC TO end 1