The Medical Language Parser

The MLP System
involves three successive stages, as shown in Figure 3. This arrangement allows the user to easily alter the specifications of his/her "user-oriented" language as his/her needs change. The system may easily be adapted for use with an English grammar of very different external appearance by changing the input to the first stage of the process.

Figure 1
OBJECT GRAMMAR OF BNF

RESTRICTION LANGUAGE SYNTAX&rarr Stage I
&darr
OBJECT RLS
&darr
ENGLISH GRAMMAR & WORD DICTIONARY&rarr Stage II
&darr
OBJECT EG & WD
&darr
ENGLISH&rarr Stage III
&darr
SENTENCE PARSES

Figure 3
The Three Stages of the System generation

The Parser
The core of the program is a very simple top-down parser for context-free grammars, which generates multiple parses of ambiguous sentences sequentially using a back-up mechanism. This parser, together with a table-driven lexical processor and a directive processor which invokes all the other system components, is present in the program for each of the three stages of the system.

GENERATORS
&darr &uarr &darr &uarr
PARSER
|
LEXICAL PROCESSOR
|
DIRECTIVE PROCESSOR
(loading, updating, etc.)

Figure 2
Organization of Stages I and II

As indicated in Figures 2 and 3, the parser has "hooks" on it to permit various routines to be invoked during the top-down parse. In the first two stages, the parser is used as a syntax-directed compiler which translates the grammar of the grammar of English (i.e. the grammar of the user-oriented language) or the grammar of English from its input text form to list structure. Hence, the routines invoked by the parser in stage I and stage II programs are code generators, which construct the requisite list structure during the top-down analysis (Figure 2).

RESTRICTIONS
&darr &uarr &darr &uarr
PARSER
|
LEXICAL PROCESSOR
|
DIRECTIVE PROCESSOR
(loading, updating, etc.)

Figure 3
Organization of Stage III

For stage III, the generators are replaced by a restriction interpreter (Figure 3). The grammar of English consists of a context-free component plus a set of restrictions, each of which is associated with one or more productions in the context-free component. These restrictions state conditions which the parse tree must meet if the analysis is to be accepted. Each time a node is added to the parse tree, and each time a level in the tree is completed, the parser invokes the restriction interpreter to execute those restrictions appearing in the corresponding production; the restriction interpreter returns a success or failure indication to the parser. If the restriction has succeeded, the parser continues normally (i.e., as if there had been no restriction). If the restriction has failed, the parser must either try an alternate option, or if all options in a production have been exhausted, dismantle part of the parse tree.

The system has been implemented on the Control Data 6600 at New York University in 1973. Today, it is totally in C++. It consisted of approximately 8,000 source lines in about 100 subroutines, only some of which are included in the program for any one stage of the system. It is written almost entirely in FORTRAN, with only a few assembly routines for manipulating part-word fields. Each stage occupies approximately 55,000 60-bit words (corresponding roughly to 400,000 8-bit bytes).

The Compiler
In Stages I and II, the compiler has a throughput of about 1,500 to 3,000 cards (grammar lines) per minute, depending on the complexity of the grammar. With the English grammar of about 20,000 source lines, including the word dictionary, a fast compiler (directive *COMPILE()) is clearly an asset. Still, re-compiling the grammar for seven or eight minutes each time a change is made is rather expensive, so an updating system (directive *MODIFY()) was included in the compiler. In the output of stages I and II, the source text and generated list structure are combined in a single file called an object grammar or an object dictionary (named in analogy with the object program produced by a compiler). Once an object grammar (file name with the extension obg) or object dictionary (file name with the extension wdo) as been initially created, the user may specify modifications to it on a statement-by-statement basis. The system will compile the new statements and will insert, delete, or replace the source text and corresponding list structure in paralle.

The function of the various stages, and format of the source input to these stages, will now be described.

Stage 1
parses the grammar of the grammar of English, which is a set of BNF statements describing the syntax of five components of the English grammar:
  1. The context-free component
  2. The lists
  3. The restrictions
  4. The dictionary canonical forms (lexical entry conventions)
  5. the dictionary
The Dictionary Look-Up