MLP Medical Language Processing

GRAMMAR BUILDING LEVEL 2
The Tower Text, getting started

The Basic Grammar and Dictionary — eg2X, wd2X

Read also Homework 2X.

Files in use:: Grammar 2X
Dictionary 2X
Parse directive 2X
Sentences 2X
The old stone tower
tower.ocf

The file eg2X.txt is the source form of the grammar that will be used, after successive updates, to parse a short text The Old Stone Tower. The text appears at the end of this file, and, in the form of input to the parser, in the file tower.ocf. When compiled, the 'object' grammar is eg2X.obg, and its associated symbol table is eg2X.sym. We will refer collectively to these 3 files as 'eg2X'. The same convention will apply to the successively enriched grammars eg2A, eg2B, eg2C, eg2D.

The parser inputs include a grammar, a dictionary of the words in the text, and the sentences of the text (each with its sentence identifier SID). Each grammar eg2X, eg2A, eg2B, eg2C, eg2D has a correspondingly enhanced word dictionary in both source form and compiled form. Thus, on the level of eg2X, we have wd2X.src (source form) and wd2X.wdo (compiled form).

The grammar eg2X is essentially the same grammar as that in the Restriction Language Manual (RLM) Table 1, pp. 22-23, with a few differences described below. Because it is so elementary it cannot handle many constructions, such as relative clauses, passives, and conjunction strings. These and other problems will be dealt with in creating the successors to eg2X and wd2X, each level of grammar and dictionary building upon its predecessor. At the end of this process, the eg2D level grammar and dictionary will parse tower.txt.

It should be noted that these graded grammars all treat grammatical issues in a manner similar to their treatment in the more complete grammar of (N. Sager, Natural language information processing: A computer grammar of English and its applications, Addison Wesley, 1981). For example, the names for BNF definitions and word categories (parts of speech) and attributes that are used in the graded grammars of these lessons are the same as those appearing in corresponding roles in the more complete computer English grammar.

The differences between the grammar eg2X and the beginning grammar presented in the Restriction Language Manual are minor and are mostly tied in with the dictionary developed for this text. The attribute POBJLIST has been added for verb occurrences in passive constructions. The attribute NCOUNT of the RLM beginning grammar has been changed to NCOUNT1 to accord with the large MLP dictionary.

A new BNF definition <NSTGO> is added for occurrences of <NSTG> in object positions. It appears in the definitions of <PN> and <OBJECT> in place of <NSTG>. <NSTGO> is defined as <NSTG> but differs from it in carrying a case constraint.

Another change from the RLM grammar is the inclusion of LN and SENTENCE on the TYPE STRING LIST. LN stands for Left adjuncts of N. It is like a linguistic string in that there is an order relation in the occurrence of its elements (two large cats, not large two cats) just as the elements SUBJECT, VERB, OBJECT are ordered in the ASSERTION string. But unlike the elements of the ASSERTION string, the elements of LN are optional (cats, two cats, large cats). The root node, SENTENCE is string-like in that it has two required elements, a CENTER string followed by a period.

The dictionary wd2X includes all the words in the tower text plus a few extra words that were added to allow for a greater range of test sentences. The definitions in wd2X are simpler than the corresponding dictionary entries in the full MLP dictionary because the grammar, eg2X, is simpler than the full MLP English grammar. The BNF definitions in eg2X contain a limited set of word categories (parts of speech) as its terminal elements (ATOMIC nodes of the parse tree) and the restrictions in eg2X test for a limited set of word attributes. Note that some words have no definitions, e.g. what and .. These appear as literals in BNF definitions of the grammar. Some other words that appear as literals in wd2X will have definitions added in a later update.

With regard to the form of dictionary entries: In the source dictionary associated with the grammar, a definition consists of the word to be defined, followed by a list of its parts of speech ("categories"), separated by commas. If the word has attributes associated with particular category occurrences, these occur as a sublist following the category symbol, e.g. N: (SINGULAR) for a singular noun. For example, the definition for find in this notation is

FIND: N:(SINGULAR, NCOUNT1), V:(OBJLIST:(NSTGO)),
TV:(PLURAL, OBJLIST:(NSTGO)).

This says that find as a noun N (They made an unusual find) has the attributes SINGULAR and NCOUNT1, that find as an infinitive verb V (to find a tower, They did not find a tower) has the attribute OBJLIST, which in turn has the attribute NSTGO, and that find as a tensed verb TV (Archeologists find towers) has the attribute PLURAL and has the same OBJLIST attribute as the infinitive V.

An alternative way of representing attribute lists is to write their contents on a numbered line where that number replaces the use of parentheses. Thus,

FIND: N: (.11, SINGULAR), V: .12, TV: (12. PLURAL).
.11 = NCOUNT1.
.12 = OBJLIST: (NSTGO).

The attributes SINGULAR and PLURAL are separated from the other attributes on an attribute list so that the remaining portions of the list can be shared.

One more special feature of the source form dictionary is the use of so-called canonical forms. Many words have similar syntactic profiles. For example, many words are like find in being in the 3 categories N, V, and TV. Think of find, walk, talk, jump, swim, but not build or know or think (unless one accepts such sentences as I am going to have a think on this). A canonical form is simply a way of abbreviating the portion of a definition that is a case of a common syntactic profile. As an example, the canonical form (NVTV) is defined as follows:

(NVTV) = N: (.11, SINGULAR), V: .12, TV: (.12, PLURAL).

Using this canonical form, the definition for find can be written:

(NVTV) FIND.

.11 = NCOUNT1.

.12 = OBJLIST: .3.

.3 = NSTGO.

When the dictionary is compiled, canonical forms are expanded to create the full form of the source entry. The canonical forms used in wd2X are listed in the file: top.canforms.

Following are some linguistic comments on some of the attributes in the grammar and dictionary on the eg2X level.

For words such as find which are both verbs and nouns, the noun attribute NCOUNT1 is common: That was a real find, *That was real find. Similarly, for the noun group: We formed a group, *We formed group.

The case attributes NOMINATIVE and ACCUSATIVE apply to pronouns: I vs. me, he vs. him, she vs. her, etc. The constraint is that pronouns in subject position are not ACCUSATIVE, and those occurring as a verb object or in a prepostional phrase are not NOMINATIVE. Pronouns (e.g. you) which are neither NOMINATIVE nor ACCUSATIVE can occur in either position. The constraint is more easily applied in a restriction by defining the object-case <NSTG> as <NSTGO>, to which the case constraint applies. The object of the verb be (often called the predicate) is <NSTG>, not <NSTGO>.

THE OLD STONE TOWER
(In G. G. Doty, J. Ross, Language and Life in the USA)

In Newport, Rhode Island, there is an old stone tower that was built many years ago. No one knows by whom it was built, but it must be very old. Some people think it may have been built by the Vikings, who may have come into the region before America was discovered by Columbus. A few years ago an investigation was begun by a group of scientists to find out who built the tower. They dug under it. They had to dig quite a while before anything was found. Finally some old buttons and Indian arrowheads were found, but the scientists never discovered who built the tower.