HMM taggers

HMM-based taggers

Jet incorporates procedures for training Hidden Markov Models (HMMs) and for using trained HMMs to annotate new text. These procedures have been used to implement part-of-speech taggers and a name tagger within Jet.

Part-of-speech taggers

The part-of-speech taggers are trained from Version II of the Penn Tree Bank, and therefore are based on the tag set used by the tree bank. However, the patterns and lexicons in Jet generally use a more standard set of word categories and attributes, in which, for example, NN and NNS are represented by [cat=noun number=singular] and [cat=noun number=plural]. The annotation actions provided in Jet allow you to use either the Penn or Jet categories:

`tagPOS`	Assigns annotations of type constit to each token, with feature cat corresponding to the Penn part-of-speech tag.
`tagJet`	First assigns annotations of type tagger to each token, with feature cat corresponding to the Penn part-of-speech tag. Then (using the tagger annotations) assigns annotations of type constit to each token, with features cat and number corresponding to the Jet part-of-speech encoding.
`pruneTags`	This action assumes that the tokens have already been assigned constit annotations by dictionary look-up, using the Jet part-of-speech; words with several parts of speech will have been assigned several such annotations. pruneTags uses the HMM tagger to select the determine the most likely part-of-speech P of the word in context, and removes all constit annotations except those corresponding to P. This makes it possible to use the additional information provided by the lexicon (base word forms, syntactic and semantic features, predicate-argument structures) while still retaining the benefit (one part-of-speech per word) of the tagger. Like tagJet, pruneTags first assigns annotations of type tagger to each token, with feature cat corresponding to the Penn part-of-speech tag; it then uses these annotations to guide the pruning of the constit annotations.

Name tagger

The Jet name tagger is trained from the named-entity training corpus of Message Understanding Conference - 7, and uses the tags adopted for that evaluation. The following tags are used:

annotation type	TYPE feature	significance
ENAMEX	ORGANIZATION	organization name
ENAMEX	PERSON	person's name
ENAMEX	LOCATION	location name
TIMEX	DATE	date
TIMEX	TIME	time
NUMEX	MONEY	monetary expression
NUMEX	PERCENT	percentage

The action to assign these tags is "tagNames".

Note: A detailed description of the internal and external representation of the HMMs is provided as part of the API documentation.