Tipster Architecture: Planned Changes
In addition to changes for which formal RFC's (Requests for Change)
have been submitted (and in some cases, approved), a number of changes
to the Architecture are "in the works". At least three are expected
to become RFC's over the summer of 1996:
A number of other revisions are in various stages of consideration:
Although developers make frequent reference to a "document manager"
and a "retrieval engine", the Design Document never explains what
one is. It will soon.
Part of standardizing an architecture involves standardizing the
error signaling; a list of codes for major errors is now being
assembled. The standard signaling for errors will be changed to
use these error codes in place of "longjmp"s (which are currently
specified in the Design Document).
There are a number of ambiguities about the semantics of
operations which modify annotations. A revision of the
description of these operations is being prepared to resolve
some of these ambiguities.
The Contractor Architecture Working Group decided in November 1995 to
allow for writeable documents: the RawData of a document may be
changed, and the spans associated with an annotation may be changed.
They decided, however, not to submit this as an RFC until some
protection scheme for documents can be incorporated into the
Architecture.
The Design Document says little about multi-process semantics. In
the current architecture, changes to a Document must be written out
when the Document is Sync'ed but can be written out at any time
before then. For multi-process systems, we need some sort of
transaction control on document modifications, so that a document
will not be written out in an inconsistent state. We will also need
to control access to a document, so several processes can access a
document for reading, but only one can do so for writing.
The current definition of AnnotateCollection is predicated on
the original notion of document id's, which were unique across
a system. If the RFC on document id's is approved, a new
definition of AnnotateCollection will be required.
As is mentioned at the end of section 5.6 of the Design
Document, the Architecture needs a mechanism to add annotators
to a system. At present, the set of annotators must be
hard coded when an application is built.
There are several deficiencies in the current structure of
detection object classes. For example, there is no way to
examine the retrieval or routing queries produced by relevance
feedback. Several proposals have been made for revising these
objects, and the structure of these objects will probably be
re-examined in the fall of 1996.
The current Architecture assumes only a single retrieval engine;
it needs to be extended to handle mutiple engines.
A C++ interface specification needs to be developed, to parallel
the C specification in the Design Document. As a starting point
we expect to take the specification prepared by Logicon for the
Prides project.
Searching a collection for documents with particular attributes
is currently very expensive (each document must be opened in
turn). A proposal has been made by BBN to provide an operation
specifically for such searches.
Many information extraction systems now use a quite similar "pattern
matching" strategy, based on the application of a series of (regular
expression) patterns. Boyan Onyshkevich is organizing a series of
discussion meetings to see if some standard for such patterns can be
developed.
Last updated July 15 1996 by R. Grishman (grishman@cs.nyu.edu)
Back to Tipster home page.