System Description

Click to download the MLP system [in tar format].

The MLP medical language processor includes:
     English healthcare syntactic lexicon and medically tagged lexicon,
     the MLP C++ parser,
     parsing with English medical grammar,
     selection with medical cooccurrence patterns,
     English transformation,
     syntactic regularization, and
     mapping into medical information format structures.

And a set of XML tools for browsing and display.

The Output Data Structure, Information Format

An Information Format (I-F) is a template for holding the words of a sentence (or sentence-part) that corresponds to a statement type of the sublanguage. Figure 1 shows a clinical sentence analyzed into I-F occurrences of the PATIENT STATE type (IF-5), the most commonly occurring one in clinical documents.

Figure 1
*SID=GLCBA 002B.1.01
* THIS 29 YEAR OLD GIRL WHO IS KNOWN TO BE ASTHMATIC FOR 15 YEARS PRESENTED
* WITH 4 DAY HISTORY OF STEADILY INCREASING EXERTIONAL WHEEZY DYSPNOEA WITH
* A COUGH PRODUCTIVE OF GREEN SPUTUM .
(CONNECTIVE (REL-CLAUSE (CONN = who))
(CONNECTIVE (RELATION (CONN = with (H-CONN)))
(I-F5: PATIENT STATE
(PSTATE-SUBJ(PT = This 29 (QNUMBER) year (NUNIT NTIME1) old (H-AGE) girl (H-PT))
(VERB = presented (H-PTVERB) with (H-CONN))
(TIME(TM-PERIOD = a history (H-TMPER))
(Q-N(NUM = 4 (QNUMBER))
(UNIT = day (NUNIT NTIME1))))
(TENSE = [PAST])
(PSTATE-DATA (S-S = of exertional (H-PTFUNC) wheezy (H-INDIC) dyspnoea (H-INDIC))
(QUANT = steadily increasing (H-CHANGE [(MORE)]))))
(I-F5: PATIENT STATE
(PSTATE-DATA(S-S = a cough (H-INDIC) productive of ('OF') green sputum (H-INDIC))))
(CONNECTIVE (EMBEDDED (CONN = [EMBEDDED-OBJ]))
(I-F00: SENTENTIAL OPERATOR
(VERB = is (H-VTEST VBE) known)
(TENSE ([PRESENT]))
(I-F5: PATIENT STATE
(PSTATE-SUBJ (PT = girl (H-PT))
(VERB = be (H-VTEST VBE))
(PSTATE-DATA(DIAG = asthmatic (H-DIAG))
(TIME(TPREP1 = for ('FOR'))
(Q-N(NUM = 15 (QNUMBER))
(UNIT = years (NTIME1)))))))

System Architecture

The MLP information-formatting program is composed of five modules that operate in sequence on each successive sentence of the document, as illustrated in Fig. 2. Equipped with a sublanguage dictionary and grammar, the program determines what statement types are present in a given sentence, creates the appropriate information formats and connective relations, and maps the words of the sentence into the resulting strucures. The MLP system for medical documents now operates in English, French, and (at the level of a Ph.D. thesis) in German and Dutch. The type of grammar used made it relatively straightforward to move the system from English to neighboring languages. The medical portions of the system carried over with almost no change.

Figure 2

Referring to Fig. 2, the first module parses the sentence into its grammatical components using a grammar that embodies syntactic structures and constraints. The second module filters out alternative syntactic analyses that are not semantically correct based on established patterns of medical word-class combination (medical co-occurrence patterns or "selection lists"). The third (Transformation) module makes every conjunctional substatement complete (e.g. by expanding pain in epigastrium and right lower quadrant to pain in epigastrium and pain in right lower quadrant) and also in other ways reduces syntactic variation. Regularization treats the connective structure, turning the whole into Polish notation. Finally, the formatting module places sentence words into the appropriate slots of the I-F's and prepares the output for mapping into the current database structure, or online Viewer.

While the output of the information-formatting program is in the form of tree structures (XML format), each information format tree can also be mapped into a "flattened" form to become a row (record) of a relational database table.

Quality Control of Language Processing

To ensure that medical language processing when applied to patient documents produces reliable patient data, the LSP system contains a procedure for quality control of language processing (the 'NIMPH' program). A database field is created to hold the results of the quality assessment of the record to be loaded. It contains:
(empty) if the record passes all tests;
N if there is a potential Negative problem;
I if the record is semantically Ill-formed (wrong type word in field);
M if there is a potential Modal problem (Modal=uncertainty);
P if there is a failure and the sentence is Partially recovered, the unrecoverable text is stored in the TEXTPLUS field).
H if there is a system Hangup, the whole sentence text is stored in the TEXTPLUS field).

The LSP System (core of MLP)
involves three successive stages, as shown in Figure 3. This arrangement allows users to alter the specifications of their "user-oriented" language as their needs change. The system may be adapted for use with a grammar of very different external appearance by changing the input to the first stage of the process.


Figure 3

OBJECT GRAMMAR OF BNF
RESTRICTION LANGUAGE SYNTAX Stage I
OBJECT RLS
GRAMMAR & WORD DICTIONARY Stage II
OBJECT OBG & WDO
TEXT Stage III
SENTENCE PARSES

Stage 1
parses the grammar of the grammar of [English] (or the restriction language syntax, the RLS), which is a set of BNF statements describing the syntax of five components of the [English] grammar:
  1. The context-free component
  2. The lists
  3. The restrictions
  4. The dictionary canonical forms (lexical entry conventions)
  5. the dictionary
Stage 2
parses the grammar and dictionary of [English], interpreted by the compiled grammar of the grammar of [English] in Stage 1, and generates object grammar and dictionary. The input grammar and dictionary sources consist of
  1. The BNF declaration, switched on by a statement *BNF: a context-free component describing the grammar of [English]
  2. The attributes, lists, global functions, and types, switched on by a statement *LISTS
  3. The routines and restrictions, switched on by a statement *RESTR. There are three types of restrictions whose names begin with 'D' disqualifying a BNF generation, 'W' wellformedness, or 'T' transformation.
  4. The dictionary canonical forms, switched on by a statement *WDCAN
  5. the dictionary, switched on by a statement *WD
Stage 3
reads in a standardized source text document, performs tokenization and dictionary lookup (by a process called dictionary lookup), using the compiled dictionary and lists, and passes the results to the parser, which uses the compiled grammar to map each source sentence into one or more grammatical parse trees.
The Compiler
The Compiler is a particular use of the basic MLP parser. In the first two stages, it is used as a syntax-directed compiler which translates the grammar of the grammar of [English] (i.e. the grammar of the user-oriented language) or the grammar of [English] from its input text form to list structure. Hence, the routines invoked by the parser in stage I and stage II programs are code generators, which construct the requisite list structure during the top-down analysis (Figure 4).


Figure 4
Organization of Stages I and II

GENERATORS
↓ ↑ ↓ ↑
COMPILER
|
LEXICAL PROCESSOR
|
DIRECTIVE PROCESSOR
(loading, updating, etc.)

In the output of stages I and II, the source text and generated list structure are combined in a single file called an object grammar or an object dictionary (named in analogy with the object program produced by a compiler). Once an object grammar (file name with the extension obg) or object dictionary (file name with the extension wdo) as been initially created, the user may specify modifications to it on a statement-by-statement basis. The system will compile the new statements and will insert, delete, or replace the source text and corresponding list structure in parallel.

In Stages I and II, a compiler (directive *COMPILE()) is used to create object grammars or dictionaries. One can also use an updating system (directive *MODIFY()), which was included in the compiler.

The function of the various stages, and format of the source input to these stages, will now be described.

The Parser
As indicated in Figures 2 and 3, the parser has "hooks" on it to permit various routines to be invoked during the top-down parse.

The core of the program is a very simple top-down parser for context-free grammars, which generates multiple parses of ambiguous sentences sequentially using a back-up mechanism. This parser, together with a table-driven lexical processor and a directive processor which invokes all the other system components, is present in the program for each of the three stages of the system.


Figure 5
Organization of Stage III

RESTRICTIONS
↓ ↑ ↓ ↑
PARSER
|
LEXICAL PROCESSOR
|
DIRECTIVE PROCESSOR
(loading, updating, etc.)

For stage III, the generators are replaced by a restriction interpreter (Figure 3). The grammar of English consists of a context-free component plus a set of restrictions, each of which is associated with one or more productions in the context-free component. These restrictions state conditions which the parse tree must meet if the analysis is to be accepted. Each time a node is added to the parse tree, and each time a level in the tree is completed, the parser invokes the restriction interpreter to execute those restrictions appearing in the corresponding production; the restriction interpreter returns a success or failure indication to the parser. If the restriction has succeeded, the parser continues normally (i.e., as if there had been no restriction). If the restriction has failed, the parser must either try an alternate option, or if all options in a production have been exhausted, dismantle part of the parse tree.

The parser is written totally in C++. It consists of 22,359 source lines in about 100 subroutines, only some of which are included in the program for any one stage of the system.

The Dictionary Look-Up
The dictionary lookup (dlookup) was formerly part of the parser. It is now separated and forms an independent function, with three main jobs:
  1. A Tokenizer

    dlookup breaks the input sentence along the blank delimiter into a series of 2n-1 possible sentences (n: number of words). Tokens of these sentences are matched against lexical entries to find best matches. After evaluation for most number of matched lexical entries, and least number of tokens, one of these best matches will be chosen as the sentence to be parsed.

  2. A generator of lexical entries

    dlookup automatically generates appropriate lexical categories and classes for:

    • standard numbers, times and dates
    • medical terms, according to a medical list
    • dose strings, according to a dose pattern list
    • organism terms, according to an organism list
    • geographic nouns, according to a geographic list
    • patient nouns, according to a patient list
    • institution/ward/service nouns, according to an institution list
    • physician/staff nouns, according to a staff list

  3. A reader of lexical entries from the main dictionary
The dictionary lookup is 7,573 lines long, and creates a list of lexical entries with all categories and attributes for the input sentence to be picked up by the parser for parsing.