Tipster Architecture: Planned Changes

In addition to changes for which formal RFC's (Requests for Change) have been submitted (and in some cases, approved), a number of changes to the Architecture are "in the works". At least three are expected to become RFC's over the summer of 1996:

Defining a Document Manager
Error Handling
Semantics of Annotation Update

A number of other revisions are in various stages of consideration:

Transaction Control
Registering Annotators
Multiple Retrieval Engines
C++ Interface
Pattern Specification Language

Defining a Document Manager

Although developers make frequent reference to a "document manager" and a "retrieval engine", the Design Document never explains what one is. It will soon.

Error Handling

Part of standardizing an architecture involves standardizing the error signaling; a list of codes for major errors is now being assembled. The standard signaling for errors will be changed to use these error codes in place of "longjmp"s (which are currently specified in the Design Document).

Semantics of Annotation Update

There are a number of ambiguities about the semantics of operations which modify annotations. A revision of the description of these operations is being prepared to resolve some of these ambiguities.

Writeable Documents and Document Protection

The Contractor Architecture Working Group decided in November 1995 to allow for writeable documents: the RawData of a document may be changed, and the spans associated with an annotation may be changed. They decided, however, not to submit this as an RFC until some protection scheme for documents can be incorporated into the Architecture.

Transaction Control

The Design Document says little about multi-process semantics. In the current architecture, changes to a Document must be written out when the Document is Sync'ed but can be written out at any time before then. For multi-process systems, we need some sort of transaction control on document modifications, so that a document will not be written out in an inconsistent state. We will also need to control access to a document, so several processes can access a document for reading, but only one can do so for writing.

Annotate Collection

The current definition of AnnotateCollection is predicated on the original notion of document id's, which were unique across a system. If the RFC on document id's is approved, a new definition of AnnotateCollection will be required.

Registering Annotators

As is mentioned at the end of section 5.6 of the Design Document, the Architecture needs a mechanism to add annotators to a system. At present, the set of annotators must be hard coded when an application is built.

Detection Object Classes

There are several deficiencies in the current structure of detection object classes. For example, there is no way to examine the retrieval or routing queries produced by relevance feedback. Several proposals have been made for revising these objects, and the structure of these objects will probably be re-examined in the fall of 1996.

Multiple Retrieval Engines

The current Architecture assumes only a single retrieval engine; it needs to be extended to handle mutiple engines.

C++ Interface

A C++ interface specification needs to be developed, to parallel the C specification in the Design Document. As a starting point we expect to take the specification prepared by Logicon for the Prides project.

Attribute-based Collection Search

Searching a collection for documents with particular attributes is currently very expensive (each document must be opened in turn). A proposal has been made by BBN to provide an operation specifically for such searches.

Pattern Specification Language

Many information extraction systems now use a quite similar "pattern matching" strategy, based on the application of a series of (regular expression) patterns. Boyan Onyshkevich is organizing a series of discussion meetings to see if some standard for such patterns can be developed.

Last updated July 15 1996 by R. Grishman (grishman@cs.nyu.edu)

Back to Tipster home page.