Onomasticon (Name Dictionary)


action name
tagNamesFromOnoma
resources required
onomasticon (name dictionary)
properties
Onoma.fileName
annotations required
token
annotations added
ENAMEX

Jet provides two meams of tagging names: a statistical name model, implemented as an HMM or MEMM, and a name dictionary, formally called an onomasticon. Each line in the onomasticon defines a single name and should consist of one or more tokens separated by spaces, a tab character, and a name type; a second tab and an entitiy subtype are optional. For example, the line

New York (tab) GPE

defines "New York" as a geo-political entity name; the line

New York (tab) GPE (tab) Population-Center

further specifies it as being of subtype Population-Ceenter. Matches must be exact, including case. In case of ambiguity, the longest match is preferred. Nested matches are not recognized; after a name is matched, the matcher advances to the first token following the matched name.

It is possible to use both a name dictionary and a statistical name tagger. In this case the statistical tagger is applied first, followed by the onoma tagger:

processSentence = ..., tagNames, tagNamesFromOnoma, ...

A token sequence in the text which matches an onoma entry will be tagged by the onoma tagger unless some name tagged by the statistical tagger is partially but not wholely contained in the sequence. In particular this means that if a sequence X Y Z has been tagged by the statistical tagger, shorter sequences such as Y Z or partially overlapping seqences such as W X will not be retagged by the onoma tagger.