Jet:  Basic Concepts

Jet is based on the concepts of annotation, annotated documents, and annotators as developed for the Tipster Architecture.  An annotation is a piece of information about a portion of a document.  It consists of For example, in "My mouse eats cheese.", to indicate that "My mouse" is a noun phrase, we would record an annotation with start = 0 and end = 10 (end points to the position past the end of the annotation, and includes the white space following the token), type = "constit" (constituent), and feature "cat" (category) with value "np".

A document consists of some text (a character string) and some annotations on that text.  In processing the document, we normally do not alter the text;  rather, we add annotations to it.

An annotator is a procedure (method) which analyzes a document, including the annotations already on it, and adds new annotations to the document.  (It can also delete annotations, although that is less common.)  The processing of a document is organized as a series of annotation steps, each performed by one annotator.  For example, processing may begin with a text zoner (which finds the portion of the document containing the text to be further analyzed), a sentence splitter (which divides that portion into sentences), and a part-of-speech tagger.

An external document is a document which exists outside of a Jet system, typically as a file or a Web page.  When an external document is opened (converted to an (internal) document), some of the information from the external document may be converted to annotations.  Several different formats of external documents are recognized.  The most common uses XML-like tags, such as "<constit cat="np"> ... </constit>" to indicate that some words form an NP constituent.  Another format, for part-of-speech information only, represents data as token/POStag (for example, "The/DT cat/NN chases/VBZ mice/NNS").