Part of Speech Tagger


action names
tagPOS
tagJet
pruneTags

resources required
HMM part-of-speech model
properties
Tags.fileName
annotations required
token
annotations added
constit
tagger

The part-of-speech tagger assigns parts of speech to tokens based on lexical statistics (the frequency with which a word is assigned a given part of speech) and POS bigram statistics (the frequency with which part of speech X is followed by part of speech Y).  Jet provides a tagger file trained on a portion of the part-of-speech tagged Penn TreeBank, although other tagger files can be used.  For previously unseen tokens, statistics for upper-case, lower-case and numeric tokens are used, but no additional morphology information is used.

The part-of-speech taggers are trained from Version II of the Penn Tree Bank, and therefore are based on the tag set used by the tree bank.  However, the patterns and lexicons in Jet generally use a more standard set of word categories and attributes, in which, for example, NN and NNS are represented by [cat=noun number=singular] and [cat=noun number=plural].  The annotation actions provided in Jet allow you to use either the Penn categories or Jet categories:
 
tagPOS Assigns annotations of type constit to each token, with feature cat corresponding to the Penn part-of-speech tag.
tagJet First assigns annotations of type tagger to each token, with feature cat corresponding to the Penn part-of-speech tag. Then (using the tagger annotations) assigns annotations of type constit to each token, with features cat and number corresponding to the Jet part-of-speech encoding.
pruneTags This action assumes that the tokens have already been assigned constit annotations by dictionary look-up, using the Jet part-of-speech;  words with several parts of speech will have been assigned several such annotations.  pruneTags uses the HMM tagger to select the determine the most likely part-of-speech P of the word in context, and removes all constit annotations except those corresponding to P.  This makes it possible to use the additional information provided by the lexicon (base word forms, syntactic and semantic features, predicate-argument structures) while still retaining the benefit (one part-of-speech per word) of the tagger.  It also provides more accurate tagging of words in the lexicon but not known to the statistical tagger.

Like tagJet, pruneTags first assigns annotations of type tagger to each token, with feature cat corresponding to the Penn part-of-speech tag;  it then uses these annotations to guide the pruning of the constit annotations.