What is LSP?

 
The Linguistic String Project (LSP) began in 1965 at New York University with funding from the U.S. National Science Foundation (NSF) to implement an English language parsing program as a first step in the computer processing of the scientific literature. The goal was to facilitate the retrieval of specific information from texts in answer to queries by investigators. The parsing program was based on Linguistic String Analysis (Zellig S. Harris, String Analysis of Sentence Structure, The Hague: Mouton & Co., 1962) and an algorithm developed by Sager (N. Sager, Procedure for left-to-right recognition of sentence structure, T.D.A.P. No. 27, University of Pennsylvania, 1960). The program underwent successive implementations at NYU, as did the computer grammar used by the parser (N. Sager, Natural Language Information Processing, Addison-Wesley Publishing Co. 1981).

The system came to embody its own programming language (N. Sager & R. Grishman, "The restriction language for computer grammars of natural language," Communications of the ACM 18, 1975, pp. 390-400) and an extensive "dictionary" providing parts of speech and lexical subclass memberships of the words in the sentences to be parsed (E. Fitzpatrick & N. Sager, Appendix 3: The lexical subclasses of the LSP English grammar, op. cit. 1981, pp. 322-374, and "The lexical subclasses of the Linguistic String Parser," American Journal of Computational Linguistics, No. 2, 1974).

The main features of the program and grammar have remained intact over time, serving as the basis for further developments.

As a result of the parsing program's initial success, further funding was supplied by the National Library of Medicine (NLM) of the National Institute of Health (NIH) to develop an application for treating the clinical narrative of patient documents. This work resulted in the Medical Language Processor system.

Theory & methods

The linguistic basis for LSP text processing and the organization of the programs that carry it out have been summarized, first in N. Sager's "Syntactic analysis of natural language," (Advances in Computers, vol. 8, Academic Press, 1967), and after further developments, in N. Sager's "Natural language information formatting," (Advances in Computers, vol. 17, Academic Press, 1978, pp. 89-162).

For a quick view, consult these sections of the 1978 publication cited above.

Section 2: Principles and Methods of Analysis (pp. 97-99)
2.1 The Form–Content Relation in Language
2.2 Sublanguage Grammar
Section 3: Computer Programs for Information Formatting (pp. 116-120)
3.1 Linguistic Framework: String analysis, Relation to information, Relation to Transformations
3.2 Representation of the Grammar

Sublanguage grammar (Zellig S. Harris, Mathematical Structures of Language, Section 5.9, Wiley Interscience Publishers, 1968, pp. 152-155) provided the framework for specializing the language processing to apply to texts in a particular scientific subfield. This occurred in the LSP context in relation to medical documents, in particular, to the narrative found in reports of clinic visits, hospital discharge summaries, and the like. The resulting system comprises five stages of processing to arrive at a standard representation of the content of the documents:

  • The Parsing component obtains a syntactic analysis of each input sentence in the form of a parse tree. The linguistic string character of the analysis is maintained by a typing of tree nodes such that routines are enabled to carry out linguistic string-based operations.

  • The Selection component reads in the selection "grammar" (procedures) and medical word class co-occurrence pattersn as filters on the parse tree to resolve ambiguities.

  • The Transformation component decomposes sentences into their basic cannonical sentences.

  • The Regularization component operates on the transformed parse tree so as to obtain a uniform connective structure among component assertions (or assertion-level fragments) in each sentence. The connective structure is arranged according to the Polish (or prefix) notation, but the XML allows the actual connective words to appear between the conjuncts.

  • The Information Formatting component maps each syntactic, transformed, regularized parse tree into a medically labeled representation based on the medical subclasses of the words covered by the tree. The template of the mapping is called an Information Format (IF). At present, there are 11 IF's defined.
This system was named Medical Language Processor (MLP). The term "MLP" is also used in a general sense to refer to the processing of medical texts independent of who does it or by what means.