action
name |
tagNamesFromOnoma |
resources
required |
onomasticon (name dictionary)
|
properties |
Onoma.fileName |
annotations
required |
token |
annotations
added |
ENAMEX |
Jet provides two meams of tagging names: a statistical name model, implemented as an
HMM or MEMM, and a name dictionary, formally called an onomasticon. Each line in the
onomasticon defines a single name and should consist of one or more tokens separated by spaces,
a tab character, and a name type; a second tab and an entitiy subtype are optional.
For example, the line
New York (tab) GPE
defines "New York" as a geo-political entity name; the line
New York (tab) GPE (tab) Population-Center
further specifies it as being of subtype Population-Ceenter.
Matches must be exact, including case.
In case of ambiguity, the longest match is preferred. Nested matches are not recognized;
after a name is matched, the matcher advances to the first token following the matched name.
It is possible to use both a name dictionary and a statistical name tagger. In this
case the statistical tagger is applied first, followed by the onoma tagger:
processSentence = ..., tagNames, tagNamesFromOnoma, ...
A token sequence in the text which matches an onoma entry will be tagged by the onoma
tagger unless some name tagged by the statistical tagger is partially but not
wholely contained in the sequence. In particular this means that if a sequence
X Y Z has been tagged by the statistical tagger, shorter sequences such as Y
Z or partially overlapping seqences such as W X will not be retagged by the
onoma tagger.