Jet can be used in two ways ... processing individual documents interactively for instructional and development purposes, and processing large volumes of text for data mining. This tutorial section provides an introduction to Jet using its interactive capabilities. All of the sample property files and language resources are provided as part of the standard Jet distribution.
We will begin by parsing a few sentences with a very simple grammar. Each sentence goes through 3 stages of processing: tokenization; dictionary look-up; and parsing. Two 'resource files' are involved: a grammar and a lexicon. Jet's operation is controlled by a properties file; the Jet properties file must specify the resource file names, and the processing steps for each sentence (tokenization is automatic for sentences entered from the console):
Jet.dataPath = directory with grammars and dictionariesThe Jet distribution includes a grammar, 'grammar1.txt', a dictionary, 'tiny1.dict', and a properties file 'tinyParse.jet', shown here:
Grammar.fileName = grammar file name
EnglishLex.fileName1 = dictionary file name
processSentence = lexLookup, parse
Jet.dataPath = dataAfter installing Jet, type jet into a command window, select tinyParse.jet as your properties file, and you are ready to go.
Grammar.fileName = grammar1.txt
EnglishLex.fileName1 = tiny1.dict
processSentence = lexLookup, parse
Note that resource files are generally loaded from the directory specified by Jet.dataPath (in this case, directory data) relative to the Jet home directory. To use a file from your own current directory, prefix the file name with "./". For example, if you put a copy of the dictionary in your local directory you would change the third line to EnglishLex.fileName1 = ./tiny1.dict
word,, cat = part-of-speech;where part-of-speech is some pre-terminal symbol in the grammar. For example,
my,, cat = art;
old,, cat = adj;
cat,, cat = n;
cats,, cat = n;
chases,, cat = v;
mouse,, cat = n;
mice,, cat = n;
symbol := element element ... | element element ... | ... ;and each element is either a non-terminal symbol (defined elsewhere in the grammar), a pre-terminal symbol (defined in the lexicon), or a string (enclose in double quotes). For example,
sentence := np vp;Note that this grammar does not include a sentence endmark (period), so the sentences you type in should not include a period.
np := n | art n | art adj n;
vp := v | v np | v np "to" np;
The top-down parser operates with a goal stack. It produces the message "looking for x" when it removes x (a symbol or string) from the goal stack (and then tries to satisfy this goal, by either looking for an instance of x as the next word in the sentence, or by expanding x using its definition), and the message "found x" when it is has succeeded in building a node of type x. The bottom-up parser produces the message "adding node x" when it has succeeded in building a node of type x.
# JET properties file for POS taggingTo see the results of the tagger, on the "tagger" menu, turn on the "POS tagger trace".
Jet.dataPath = data
Tags.fileName = pos_hmm.txt
processSentence = tagPOS
# JET properties file for name taggingIn addition to handling individual sentences from the console, Jet is able to process complete documents. Each document should begin with the line <DOC>and end with the line </DOC>. A file can contain one or more documents. The portion of the document to be processed by Jet must be enclosed in <TEXT> and </TEXT> tags; anything outside of these tags is ignored.
Jet.dataPath = data
NameTags.fileName = MUCnameHMM.txt
processSentence = tagNames
The name of the file containing the document(s) to be processed is specified in the Jet properties file as
JetTest.fileName1 = ./article.txt(assuming here that the file is in your local directory). If several files are to be processed, they can be specified as fileName1, fileName2, etc.). The document is segmented into sentences and each sentence is processed (as specified by the processSentence properties). When a document is read from a file, tokenization is not automatic; the tokenize command must be explicitly included, generally as the first command in processSentence. When the document is complete, if the properties file contains a line of the form
WriteSGML.type = ENAMEXthe all annotations of the specified type (in this case, ENAMEX) are converted to SGML tags and the resulting document is written to a file whose name is computed by adding "response-" to the front of the input document file name (so in the example above, the output would be written to response-article.txt).
The tools menu on the Jet console contains a menu item "Process Documents". Selecting this item causes all the document files to be processed. Traces are written to the Console.
The results of all the linguistic analysis are recorded as a set of annotations on the document. Once multiple annotation types are involved, traces are no longer convenient for examining the results, so Jet also provides a document viewer. Selecting "Process Documents and View Annotated Documents" causes Jet to create a document viewer window for each document it processes.
The document viewer has 3 panes: the Annotated Document (the largest pane), with the list of annotation types to its left and a pane for displaying individual annotations below. If a type is selected from the type list, portions of the document tagged with that type of annotation are highlihted. Selecting any portion of the document by dragging the mouse causes all annotations within the selected region to be listed in the Annotations pane.
Jet includes a pattern language
which allows the developer to write regular expressions stated
in terms of specific tokens and annotations, and to add
new annotations to the document. Patterns are organized
into pattern sets; having several sets allows annotation
structures to be built up in stages. We will begin with
one pattern set, chunks, designed to recognize noun groups
and verb groups. These pattern sets are invoked
from the Jet properties file by specifying a processing step of the
form pat(pattern-set); in this
case, pat(chunks).
# JET properties file to run chunk patterns
Jet.dataPath = data
Tags.fileName = pos_hmm.txt
Pattern.fileName1 = chunkPatterns.txt
processSentence = tagJet, pat(chunks)
Note that tagJetassigns Jet
part-of-speech tags, as listed in the Jet documentation.
These are the tags which must be used with these patterns.
Two simple traces are provided for pattern matching (on the pattern
menu). The Pattern Match Trace prints a message every
time a (complete) pattern matches. If several patterns match, only one
will be applied (the one spanning the most tokens; among those
spanning the same tokens, the pattern appearing first in the
file); this is shown by the Pattern Apply Trace.
The final example in our introduction involves reference resolution. Reference resolution takes the set of (entity) mentions in a document -- principally the noun phrases -- and groups coreferential mentions to form entities. In the example, reference resolution is applied to the output of the chunk patterns, which form noun groups. The properties file, resolve.jet, is as follows:
# JET properties file to run reference resolution on file 'article.txt'
Jet.dataPath = data
JetTest.fileName1 = ./article.txt
Tags.fileName = pos_hmm.txt
NameTags.fileName = MUCnameHMM.txt
Pattern.fileName1 = chunkPatterns.txt
processSentence = tokenize, tagJet, tagNames, pat(chunks), resolve
If you have selected "Process Documents and View Annotated Docments" and the processed document includes entity annotations, then -- in addition to the standard document viewer -- Jet will display an entity viewer, showing all the entity mentions and how they have been linked.