Jet scripts

Jet scripts: document and sentence processing

When Jet is run in batch mode, the basic unit of processing is the document. In addition, the Jet console can be used to enter and process individual sentences.

Document and sentence processing is controlled by a set of scripts. Each script specifies a series of actions to be performed on a portion of the document. The actions correspond for the most part to the tools or annotators provided with Jet.

Scripts are defined in the Jet configuration file. A script definition has the form

scriptName = action₁, action₂, ...

The following actions are allowed:

tokenize	divide text into tokens
tag(XMLtag)	assign annotation XMLtag to text enclosed in `<XMLtag> ... </XMLtag>`
sentenceSplit	divide text into sentences
lexLookup	look up tokens in lexicon
pruneTags	use HMM POS tagger to select most likely POS from those given in lexicon
tagPOS	use HMM POS tagger to assign Penn POS tags
tagJet	use HMM POS tagger to assign Jet POS tags
tagNames	use HMM name tagger to assign MUC name tags
chunk	use Maxent tagger to identify noun groups
parse	parse sentence using context-free parser
statParse	parse sentence using TreeBank-trained probabilistic parser
pat(patternSetName)	apply patterns in patternSetName
resolve	resolve references
tag : scriptName	apply script scriptName to every instance of text annotated with tag tag

Note the last action: it allows one script to invoke another script over a portion of the text.

In batch mode (or when the processDocuments menu item is selected from the console), the processDocument script is applied to each input document. When a sentence is typed in directly at the console, the processSentence script is applied. The following scripts are defined by default:

processDocument = tag(TEXT), TEXT:processTextZone
processTextZone = tokenize, sentenceSplit, sentence:processSentence

This means that when Jet processes a document, it looks for tags of the form <TEXT> ... </TEXT> in the document, and adds an annotation of type text to the enclosed text. Then it runs the script processTextZone on each such text. The script processTextZone runs the tokenizer and then the sentence splitter on that text. The sentence splitter adds annotations of type sentence to the text. Then, for the text subsumed by each sentence annotation, we run the script processSentence.

There is no default for the processSentence script. A simple example, to look words up in the dictionary and the run the parser, would be

processSentence = lexLookup, parse

To look up words, prune the result using the POS tagger, and then apply two sets of patterns, for dates and names, one would write

processSentence = lexLookup, pruneTags, pat(dates), pat(names)