Jet scripts:  document and sentence processing

When Jet is run in batch mode, the basic unit of processing is the document.  In addition, the Jet console can be used to enter and process individual sentences.

Document and sentence processing is controlled by a set of scripts.  Each script specifies a series of actions to be performed on a portion of the document.  The actions correspond for the most part to the tools or annotators provided with Jet.

Scripts are defined in the Jet configuration file.  A script definition has the form

scriptName = action1, action2, ...
The following actions are allowed:
 
tokenize divide text into tokens 
tag(XMLtag) assign annotation XMLtag to text enclosed in 
<XMLtag> ... </XMLtag>
sentenceSplit divide text into sentences
lexLookup look up tokens in lexicon
pruneTags use HMM POS tagger to select most likely POS from those given in lexicon
tagPOS use HMM POS tagger to assign Penn POS tags
tagJet use HMM POS tagger to assign Jet POS tags
tagNames use HMM name tagger to assign MUC name tags
chunk
use Maxent tagger to identify noun groups
parse parse sentence using context-free parser
statParse
parse sentence using TreeBank-trained probabilistic parser
pat(patternSetName) apply patterns in patternSetName
resolve  resolve references
tag : scriptName apply script scriptName to every instance of text
annotated with tag tag

Note the last action:  it allows one script to invoke another script over a portion of the text.

In batch mode (or when the processDocuments menu item is selected from the console), the processDocument script is applied to each input document.  When a sentence is typed in directly at the console, the processSentence script is applied.   The following scripts are defined by default:

processDocument = tag(TEXT), TEXT:processTextZone
processTextZone = tokenize, sentenceSplit, sentence:processSentence
This means that when Jet processes a document, it looks for tags of the form <TEXT> ... </TEXT> in the document, and adds an annotation of type text to the enclosed text.  Then it runs the script processTextZone on each such text.  The script processTextZone runs the tokenizer and then the sentence splitter on that text.  The sentence splitter adds annotations of type sentence to the text.  Then, for the text subsumed by each sentence annotation, we run the script processSentence.

There is no default for the processSentence script.  A simple example, to look words up in the dictionary and the run the parser, would be

processSentence  = lexLookup, parse
To look up words, prune the result using the POS tagger, and then apply two sets of patterns, for dates and names, one would write
processSentence  = lexLookup, pruneTags, pat(dates), pat(names)