The Lexicon and Lexical Lookup

Still to mention:  multi-word entries, matching;  case sensitivity;  pa forms

The primary source of grammatical information about individual words is the lexicon.  A lexical entry for a word will give its part of speech and various features of the word.  The lexical lookup annotator processes a span of text which has already been divided into tokens, marked by token annotations (thus you must run a tokenizer prior to lexical lookup).  It looks up each token in the lexicon and adds a constit annotation for each definition it finds.  The attributes of the constit annotation are taken from the features of the lexical entry.

Basic Lexical Entry Format

The simplest form for a lexical entry is

     word,, cat = part-of-speech;

where part-of-speech is some pre-terminal symbol in the grammar (note the double comma!).  For example,

my,,  cat = art;
old,,  cat = adj;
dog,,  cat = n;
dogs,, cat = n;
chases,, cat = v;
cars,, cat=n;
The entry may give additional features for the word, in the form feature=value;  for example
dog,, cat=n, number=singular;
dogs,, cat=n, number=plural;
Thus if the word "dog" appears in a sentence, lexical lookup will assign it the annotation <constit cat=n number=singular>dog</cat>.  If a word has multiple parts of speech, it should have several entries in the lexicon:
walk,, cat=v, number=plural;
walk,, cat=n, number=singular;
When "walk" appears in a sentence, lexical lookup will add two constit annotations, one for each definition.

English Lexicon Entries

In order to keep the size of the external lexicon small, it is convenient to have a single entry for a noun or verb rather than have to write a separate entry for each inflected form.  For example, we want to have one entry for "repeat" rather than separate entries for "repeat", "repeats", "repeated", and "repeating".  JET supports this by having a small set of standard entry types;  each entry type is automatically expanded into the various inflected forms in the internal lexicon.

The basic form of an entry is

defined-item, type, feature = value, feature = value, ... ;
The defined-item is the word or word sequence being defined.  It may be a single word, a sequence of words, or a string enclosed in double quotes (").  If the defined item contains any characters other than letters, it must be enclosed in quotes.  Thus:
cat, noun;
floppy disk, noun;
"cat 'o nine tails", noun;
The type field may be noun, verb, adj, or adv, or may be empty.  If the field is empty, the attribute / value pairs are used directly to create the internal lexicon entry.  In this case, there should be at least a cat feature, indicating the word category of the lexical item:
of,, cat=p.
Each feature value may be an integer, a symbol (a sequence of letters beginning with a lower-case letter), or a string (enclosed in double quotes).  For features representing inflected forms, if the value is a single word, it may be written as a symbol or string:
ox, noun, plural = oxen;
ox, noun, plural = "oxen";
If the inflected form consists of more than one word, or includes a non-letter, it must be enclosed in quotes:
musk ox, noun, plural = "musk oxen";
The entry types and their features are described below.  All features are optional.


base-form, noun, plural = plural, attributes = attributes, xn = xn;
defines a noun (word category n) whose singular form is base-form.  Its plural form is determined as follows:  if plural is none, no plural is defined;  if plural is given explicitly, it is used as the plural form;  otherwise the plural form is determined from the base form as follows:
if it ends in 'x', 'z', 's', 'ch', or 'sh', add 'es'
if it ends in a vowel + 'y', add 's'
if it ends in a consonant + 'y', change the 'y' to 'ies'
otherwise add 's'


base-form, verb, thirdSing = singular, plural = plural, past = past, pastPart = past-participle, presPart = present-participle, attributes = attributes, xn = xn;
defines a verb whose infinitival form is base-form.  The following inflected forms are generated:


form, adj, attributes = attributes;
defines an adjective (word category adj) with the specified attributes.


form, adv, attributes = attributes;
defines an adverb (word category adv) with the specified attributes.