There are four separate components:
HMMemitter is an abstract class. The actual emitter class used in a specific HMM must be an extension of HMMemitter. Two such extensions are currently implemented, BasicHMMemitter and WordFeatureHMMemitter.
Every state has a name. Every HMM should have a state named "start" and a state named "end". The HMM always begins in the start state, and always ends in the end state. The start and end states do not emit tokens; therefore, a sequence of n+2 states, including the start and end states, generates a sequence of n tokens.
An example of a simple file which matches a sequence of "oink"s and "quacks" is:
STATE start
ARC TO middle
STATE middle
EMIT quack 1
EMIT oink 2
ARC TO middle 2
ARC TO end 1
STATE end
HMM h = new HMM();The latter approach is particularly useful for large, regular HMMs, such as ergodic HMMs.
HMMstate start = new HMMstate("start", "", BasicHMMemitter.class);
start.addArc( new HMMarc("middle", 0));
h.addState(start);
HMMstate middle = new HMMstate("middle", "personTag", BasicHMMemitter.class);
middle.addArc(new HMMarc("middle",0));
middle.addArc(new HMMarc("end", 0));
h.addState(middle);
HMMstate end = new HMMstate("end", "", BasicHMMemitter.class);
h.addState(end);
h.resolveNames();
One of the properties of an HMMstate is its tag. The tag is used to establish the correspondence between the HMM state and the annotations; all states with the same tag are considered equivalent for annotation purposes. (In simple cases, we will set the tag of a state equal to its name, but having names and tags separate allows for greater flexibility.)
The most important property of an HMMannotator is its tagTable. The tag table is of type String[][4]. Each row of the tag table is a quadruple:
{annotation-type, annotation-attribute, annotation-value, tag}This row says that tag tag corresponds to having an annotation of type annotation-type with attribute annotation-attribute with value annotation-value. For example,
{"namex", "type", "person", "personTag"}indicates that the state with tag personTag corresponds to an annotation on the document of <namex type=person> ... </namex>. This means that, if we analyze a document with an HMM, and in the most likely analysis the word "Anastasia" is matched by state "middle" which has tag "middleTag", then we will add an annotation <namex type=person>Anastasia</namex>. The tag table can be read from a file, one row per line, by the readTagTable method.
This doesn't completely define the correspondence, however. Suppose the document contains the words "Albert Anastasia", and both tokens "Albert" and "Anastasia" are matched by the same state, with tag middleTag. Should we generate one namex annotation covering both words, or two separate annotations? If the property annotateEachToken is true, then a separate annotation is produced for each token; this is appropriate, for example, for part-of-speech tagging, where each token should be separately tagged. If this property is false, then a single annotation is generated for one or more consecutive states with the same tag; this is appropriate whenever we need to tag multi-token items.
This is not quite sufficient, because we may have two consecutive multi-word names, as in the sentence "By accident, I called Albert Anastasia Fred Smith.", which we would like to annotate as "By accident, I called <namex type=person>Albert Anastasia</namex> <namex type=person>Fred Smith</namex>." To handle such cases, we must distinguish the state which start a person name from the state which continues a person name. This is done with the BItag property. If BItag is false, correspondences are as previously described. If BItag is true, and the tag table is as given above, then the state corresponding to the first token of a name must have tag B-personTag, while the state corresponding to the continuation of a name must have tag I-personTag.
To build an annotator based on an HMM, we first create the HMM (as described in the previous section),
HMM h = new HMM();then create an annotator using this HMM,
HMMannotator annotator = new HMMannotator (h);and finally set the properties of this annotator, using
annnotator.setTagTable (...)or
annotator.readTagTable (...)
Jet.Tipster.Collection col = new Jet.Tipster.Collection(...);HMMannotator.train applies the HMM separately to each sentence (sequence of tokens marked with an S annotation) in the document. This can be changed by using the zoneToTag property of the annotator.
annotator.train (col);
h.computeProbabilities();