JET HMM Tools

JET provides the basic tools for building Hidden Markov Models (HMMs), and for using HMMs to annotate text.  Two simple annotators are included in JET, a part-of-speech tagger and a name tagger;  this documentation is provided to allow users to modify these taggers and to write their own taggers.

There are four separate components:

The HMM and HMMannotator are described below in separate sections.

The HMM

JET provides a set of classes for defining HMMs.  The principal classes are The HMM is made up of a set of HMMstates.  Each HMMstate in turn has an HMMemitter, which specifies which tokens can be emitted by that state, and a set of HMMarcs, which indicate which states can follow the current state, and with what probability.  Note that in JET HMMs, emission is associated with states, not arcs.  Every state emits exactly one token.

HMMemitter is an abstract class.  The actual emitter class used in a specific HMM must be an extension of HMMemitter.  Two such extensions are currently implemented, BasicHMMemitter and WordFeatureHMMemitter.

Every state has a name.  Every HMM should have a state named "start" and a state named "end".  The HMM always begins in the start state, and always ends in the end state.  The start and end states do not emit tokens;  therefore, a sequence of n+2 states, including the start and end states, generates a sequence of n tokens.

External Representation of an HMM

An HMM has an external (readable) representation.  This representation is generated by method HMM.store, and can be read by HMM.load.  This representation consists of a series of lines each beginning with a keyword.  The following types of lines are generated and recognized:
STATE state-name
Defines a new state with name state-name.  All following lines until the next STATE line are part of the definition of this state.
ARC TO state-name [count]
Indicates that there is an arc from the current state to the state named state-name.  The count, which will be used to compute the probability of this transition, indicates how often the transition to state-name was observed.  If absent, a count of 1 is assumed.
EMIT token [count]
Indicates that the current state can emit token token.  The count, which will be used to compute the probability of this emission, indicates how often the emission of token was observed.  If absent, a count of 1 is assumed.
TAG tag
Indicates that the current state is associated with tag tag.  These tags are used to associate HMM states with annotations, as explained below.
Keywords (STATE, ARC, TO, EMIT, TAG) may be in upper or lower case.  Blank lines are allowed in the file.

An example of a simple file which matches a sequence of "oink"s and "quacks" is:

STATE start
ARC TO middle
STATE middle
EMIT quack 1
EMIT oink 2
ARC TO middle 2
ARC TO end 1
STATE end

Defining the HMM Topology

The first step in creating an HMM is to define its topology --- its states and arcs (i.e., to define its possible transitions, but not the transition or emission probabilities).  The topology can be defined in two ways.  First, you can describe it in the external representation (omitting the counts),
and then create an HMM with
HMM h = new HMM();
h.load(new BufferedReader (new FileReader (file-name)));
Alternatively, you can create the HMM with calls to the HMMstate and HMMarc constructors:
HMM h = new HMM();
HMMstate start = new HMMstate("start", "", BasicHMMemitter.class);
start.addArc( new HMMarc("middle", 0));
h.addState(start);
HMMstate middle = new HMMstate("middle", "personTag", BasicHMMemitter.class);
middle.addArc(new HMMarc("middle",0));
middle.addArc(new HMMarc("end", 0));
h.addState(middle);
HMMstate end = new HMMstate("end", "", BasicHMMemitter.class);
h.addState(end);
h.resolveNames();
The latter approach is particularly useful for large, regular HMMs, such as ergodic HMMs.

The HMMannotator:  associating HMMs with annotations

Defining the correspondence

Most processing in JET works by adding annotations to a document.  In the case of HMMs, we do this by associating particular states of the HMM with particular annotations on the document.  This is done by class HMMannotator.  The properties of an HMMannotator determine the correspondence between the HMM and the annotations.

One of the properties of an HMMstate is its tag.  The tag is used to establish the correspondence between the HMM state and the annotations;  all states with the same tag are considered equivalent for annotation purposes.  (In simple cases, we will set the tag of a state equal to its name, but having names and tags separate allows for greater flexibility.)

The most important property of an HMMannotator is its tagTable.  The tag table is of type String[][4].  Each row of the tag table is a quadruple:

{annotation-type,    annotation-attribute,    annotation-value,    tag}
This row says that tag tag corresponds to having an annotation of type annotation-type with attribute annotation-attribute with value annotation-value.   For example,
{"namex",    "type",    "person",    "personTag"}
indicates that the state with tag personTag corresponds to an annotation on the document of <namex type=person> ... </namex>.  This means that, if we analyze a document with an HMM, and in the most likely analysis the word "Anastasia" is matched by state "middle" which has tag "middleTag", then we will add an annotation <namex type=person>Anastasia</namex>.  The tag table can be read from a file, one row per line, by the readTagTable method.

This doesn't completely define the correspondence, however.  Suppose the document contains the words "Albert Anastasia", and both tokens "Albert" and "Anastasia" are matched by the same state, with tag middleTag.  Should we generate one namex annotation covering both words, or two separate annotations?  If the property annotateEachToken is true, then a separate annotation is produced for each token;  this is appropriate, for example, for part-of-speech tagging, where each token should be separately tagged.  If this property is false, then a single annotation is generated for one or more consecutive states with the same tag;  this is appropriate whenever we need to tag multi-token items.

This is not quite sufficient, because we may have two consecutive multi-word names, as in the sentence "By accident, I called Albert Anastasia Fred Smith.", which we would like to annotate as "By accident, I called <namex type=person>Albert Anastasia</namex> <namex type=person>Fred Smith</namex>."  To handle such cases, we must distinguish the state which start a person name from the state which continues a person name.  This is done with the BItag property.  If BItag is false, correspondences are as previously described.  If BItag is true, and the tag table is as given above, then the state corresponding to the first token of a name must have tag B-personTag, while the state corresponding to the continuation of a name must have tag I-personTag.

To build an annotator based on an HMM, we first create the HMM (as described in the previous section),

HMM h = new HMM();
then create an annotator using this HMM,
HMMannotator annotator = new HMMannotator (h);
and finally set the properties of this annotator, using
annnotator.setTagTable (...)
or
annotator.readTagTable (...)

Training the HMMannotator

Once a correspondence has been established, we can use the HMMannotator to train the HMM.  If we have a collection of documents with the appropriate annotations, this can be done by the HMMannotator.train method:
Jet.Tipster.Collection col = new Jet.Tipster.Collection(...);
annotator.train (col);
h.computeProbabilities();
HMMannotator.train applies the HMM separately to each sentence (sequence of tokens marked with an S annotation) in the document.  This can be changed by using the zoneToTag property of the annotator.

Using the trained HMMannotator

The trained annotator can then be used to annotate new documents.  The HMMannotate.annotate method annotates each sentence in the document (again, the unit to annotate can be changed through the zoneToTag property).