An Introduction to Jet: Interactive Use

Jet can be used in two ways ... processing individual documents interactively for instructional and development purposes, and processing large volumes of text for data mining. This tutorial section provides an introduction to Jet using its interactive capabilities. All of the sample property files and language resources are provided as part of the standard Jet distribution.

1. Getting Started: Parsing Sentences with a Simple Grammar

We will begin by parsing a few sentences with a very simple grammar. Each sentence goes through 3 stages of processing: tokenization; dictionary look-up; and parsing. Two 'resource files' are involved: a grammar and a lexicon. Jet's operation is controlled by a properties file; the Jet properties file must specify the resource file names, and the processing steps for each sentence (tokenization is automatic for sentences entered from the console):

Jet.dataPath         =directory with grammars and dictionaries
Grammar.fileName     = grammar file name
EnglishLex.fileName1 = dictionary file name
processSentence      = lexLookup, parse

The Jet distribution includes a grammar, 'grammar1.txt', a dictionary, 'tiny1.dict', and a properties file 'tinyParse.jet', shown here:

Jet.dataPath = data Grammar.fileName = grammar1.txt EnglishLex.fileName1 = tiny1.dict processSentence = lexLookup, parse

After installing Jet, type jet into a command window, select tinyParse.jet as your properties file, and you are ready to go.

Note that resource files are generally loaded from the directory specified by Jet.dataPath (in this case, directory data) relative to the Jet home directory. To use a file from your own current directory, prefix the file name with "./". For example, if you put a copy of the dictionary in your local directory you would change the third line to EnglishLex.fileName1 = ./tiny1.dict

Resource file formats

The resource files (grammar and lexicon) are free format: spaces, tabs, and newlines may be used freely. Definitions in the grammar and lexicon must end with semicolons (;). Java-style comments (both // and /* ... */) are allowed.

Format of the lexicon:

Each word which appears in a sentence must be defined in the lexicon, or the parser will fail. Each definition has the form

word,, cat = part-of-speech;

where part-of-speech is some pre-terminal symbol in the grammar. For example,

my,, cat = art;
old,, cat = adj;
cat,, cat = n;
cats,, cat = n;
chases,, cat = v;
mouse,, cat = n;
mice,, cat = n;

Format of the grammar:

The grammar consists of a set of definitions, where each definition has the form

symbol := element element ... | element element ... | ... ;

and each element is either a non-terminal symbol (defined elsewhere in the grammar), a pre-terminal symbol (defined in the lexicon), or a string (enclose in double quotes). For example,

sentence := np vp;
np := n | art n | art adj n;
vp := v | v np | v np "to" np;

Note that this grammar does not include a sentence endmark (period), so the sentences you type in should not include a period.

Jet Console

Sentences will be entered, and results displayed, using the Jet console. At the bottom, the console contains a single line for entering sentences. The main part of the console is a scrolling text window which displays system output. At top are the Jet menus. The Parser menu allows you to select the parser to be used:

a top-down, backtracking recognizer (which does not produce a parse tree)
a top-down, backtracking parser
a bottom-up ('immediate-constituent analyzer') parser
a chart parser

It allows you to turn a parser trace on or off (the trace is displayed in the console). It also allows you to display a parse as a tree diagram.

Parser traces

The top-down parser operates with a goal stack. It produces the message "looking for x" when it removes x (a symbol or string) from the goal stack (and then tries to satisfy this goal, by either looking for an instance of x as the next word in the sentence, or by expanding x using its definition), and the message "found x" when it is has succeeded in building a node of type x. The bottom-up parser produces the message "adding node x" when it has succeeded in building a node of type x.

2. Part-of-speech tagging

The second tagger we will try is a part-of-speech tagger (tagPOS command). Part-of-speech tagging is provided in Jet using a bigram HMM (Hidden Markov Model); Jet includes (in the data directory) an HMM trained on most of the Penn TreeBank. To run with just the POS tagger, select the tagPOS.jet properties file:

# JET properties file for POS tagging
Jet.dataPath = data
Tags.fileName = pos_hmm.txt
processSentence = tagPOS

To see the results of the tagger, on the "tagger" menu, turn on the "POS tagger trace".

3. Name tagging and Document Processing

Next we will try the named entity tagger. Several name taggers are provided in Jet, including one based on an HMM, one based on an MEMM (Maximum Entropy Markov Model), and one based on lists and hand-coded patterns. Several HMM data files are provided; the simplest is one trained on the original named entity file, from MUC6. To run with this tagger, select the tagNames.jet properties file:

# JET properties file for name tagging
Jet.dataPath        = data
NameTags.fileName   = MUCnameHMM.txt
processSentence     = tagNames

In addition to handling individual sentences from the console, Jet is able to process complete documents. Each document should begin with the line <DOC>and end with the line </DOC>. A file can contain one or more documents. The portion of the document to be processed by Jet must be enclosed in <TEXT> and </TEXT> tags; anything outside of these tags is ignored.

Jet Properties for Document Processing

The name of the file containing the document(s) to be processed is specified in the Jet properties file as

JetTest.fileName1 = ./article.txt

(assuming here that the file is in your local directory). If several files are to be processed, they can be specified as fileName1, fileName2, etc.). The document is segmented into sentences and each sentence is processed (as specified by the processSentence properties). When a document is read from a file, tokenization is not automatic; the tokenize command must be explicitly included, generally as the first command in processSentence. When the document is complete, if the properties file contains a line of the form

WriteSGML.type = ENAMEX

the all annotations of the specified type (in this case, ENAMEX) are converted to SGML tags and the resulting document is written to a file whose name is computed by adding "response-" to the front of the input document file name (so in the example above, the output would be written to response-article.txt).

The tools menu on the Jet console contains a menu item "Process Documents". Selecting this item causes all the document files to be processed. Traces are written to the Console.

Viewing Tagged Documents

The results of all the linguistic analysis are recorded as a set of annotations on the document. Once multiple annotation types are involved, traces are no longer convenient for examining the results, so Jet also provides a document viewer. Selecting "Process Documents and View Annotated Documents" causes Jet to create a document viewer window for each document it processes.

The document viewer has 3 panes: the Annotated Document (the largest pane), with the list of annotation types to its left and a pane for displaying individual annotations below. If a type is selected from the type list, portions of the document tagged with that type of annotation are highlihted. Selecting any portion of the document by dragging the mouse causes all annotations within the selected region to be listed in the Annotations pane.

4. Pattern language

Jet includes a pattern language which allows the developer to write regular expressions stated in terms of specific tokens and annotations, and to add new annotations to the document. Patterns are organized into pattern sets; having several sets allows annotation structures to be built up in stages. We will begin with one pattern set, chunks, designed to recognize noun groups and verb groups. These pattern sets are invoked from the Jet properties file by specifying a processing step of the form pat(pattern-set); in this case, pat(chunks).

A properties file (chunk.jet) to run the noun/verb group chunk patterns is

# JET properties file to run chunk patterns Jet.dataPath = data Tags.fileName = pos_hmm.txt Pattern.fileName1 = chunkPatterns.txt processSentence = tagJet, pat(chunks)

Note that tagJetassigns Jet part-of-speech tags, as listed in the Jet documentation. These are the tags which must be used with these patterns.

Two simple traces are provided for pattern matching (on the pattern menu). The Pattern Match Trace prints a message every time a (complete) pattern matches. If several patterns match, only one will be applied (the one spanning the most tokens; among those spanning the same tokens, the pattern appearing first in the file); this is shown by the Pattern Apply Trace.

5. Coreference

The final example in our introduction involves reference resolution. Reference resolution takes the set of (entity) mentions in a document -- principally the noun phrases -- and groups coreferential mentions to form entities. In the example, reference resolution is applied to the output of the chunk patterns, which form noun groups. The properties file, resolve.jet, is as follows:

# JET properties file to run reference resolution on file 'article.txt' Jet.dataPath = data JetTest.fileName1 = ./article.txt Tags.fileName = pos_hmm.txt NameTags.fileName = MUCnameHMM.txt Pattern.fileName1 = chunkPatterns.txt processSentence = tokenize, tagJet, tagNames, pat(chunks), resolve

If you have selected "Process Documents and View Annotated Docments" and the processed document includes entity annotations, then -- in addition to the standard document viewer -- Jet will display an entity viewer, showing all the entity mentions and how they have been linked.