Document and sentence processing is controlled by a set of scripts. Each script specifies a series of actions to be performed on a portion of the document. The actions correspond for the most part to the tools or annotators provided with Jet.
Scripts are defined in the Jet configuration file. A script definition has the form
scriptName = action1, action2, ...The following actions are allowed:
tokenize | divide text into tokens |
tag(XMLtag) | assign annotation XMLtag to text enclosed in <XMLtag> ... </XMLtag> |
sentenceSplit | divide text into sentences |
lexLookup | look up tokens in lexicon |
pruneTags | use HMM POS tagger to select most likely POS from those given in lexicon |
tagPOS | use HMM POS tagger to assign Penn POS tags |
tagJet | use HMM POS tagger to assign Jet POS tags |
tagNames | use HMM name tagger to assign MUC name tags |
chunk |
use Maxent tagger to identify
noun groups |
parse | parse sentence using context-free parser |
statParse |
parse sentence using
TreeBank-trained probabilistic parser |
pat(patternSetName) | apply patterns in patternSetName |
resolve | resolve references |
tag : scriptName | apply script scriptName to every instance of text annotated with tag tag |
Note the last action: it allows one script to invoke another script over a portion of the text.
In batch mode (or when the processDocuments menu item is selected from the console), the processDocument script is applied to each input document. When a sentence is typed in directly at the console, the processSentence script is applied. The following scripts are defined by default:
processDocument = tag(TEXT), TEXT:processTextZoneThis means that when Jet processes a document, it looks for tags of the form <TEXT> ... </TEXT> in the document, and adds an annotation of type text to the enclosed text. Then it runs the script processTextZone on each such text. The script processTextZone runs the tokenizer and then the sentence splitter on that text. The sentence splitter adds annotations of type sentence to the text. Then, for the text subsumed by each sentence annotation, we run the script processSentence.
processTextZone = tokenize, sentenceSplit, sentence:processSentence
There is no default for the processSentence script. A simple example, to look words up in the dictionary and the run the parser, would be
processSentence = lexLookup, parseTo look up words, prune the result using the POS tagger, and then apply two sets of patterns, for dates and names, one would write
processSentence = lexLookup, pruneTags, pat(dates), pat(names)