Comlex Syntax

Comlex Syntax is a monolingual English Dictionary consisting of 38,000 head words intended for use in natural language processing. This dictionary is being developed by the Proteus Project at New York University under the auspices of the Linguistic Data Consortium (LDC) as one of the lexical resources which will comprise ``COMLEX'' (COMmon LEXicon). Comlex Syntax, like other LDC products, is available for both research and commercial use to LDC members with minimal legal restrictions on its usage. The first version of Comlex Syntax was delivered to the Linguistic Data Consortium (LDC) in May, 1994, and is available to members by ftp from the LDC. The LDC will eventually distribute the dictionary on CD-ROM. We plan to continue running quality checks on Version 1 and periodically provide updated versions to the LDC for further distribution.

The dictionary includes entries for approximately 21,000 nouns, 8000 adjectives and 6,000 verbs, all of which are marked with a rich set of syntactic features and complements. Nouns have 9 possible features and 9 possible complements; adjectives have 7 features and 14 complements; and verbs have 5 features and 92 complements. Other entries identify words as adverbs, prepositions, cardinal numbers, etc. without further specification. The noun, adjective and verb entries were created by a team of four linguistics graduate students, working half-time for approximately one year. Each ELF (enterer of lexical features) has been provided with a menu-based entry program, which is written in Lisp using the Garnet GUI package, and which provides access to a concordance based on approximately 90 MB of text. Elves enter features and complements for verbs based on: (1) the concordance; (2) hard copy dictionaries; and (3) their individual usage.

Each lexical entry is organized as a nested set of feature-value lists, using a Lisp-style notation which, if needed, could be mapped into other forms, e.g. Prolog, SGML-marked text, etc. Each list consists of a type symbol followed by zero or more keyword-value pairs. Each value may in turn be an atom, a string, a list of strings, feature-value list, or a list of feature-value lists. Key-words identify orthography (:orth) inflected forms (e.g., :plural, :pastpart, etc.), features (:features), subcategorization/complements (:subc), and other information. Subcategorization is mostly self-explanatory, e.g., verbs marked with "np" and "part-np" respectively take "np" and "particle + np" complements. Features include "apreq" which is marked on adjectives which can modify a numerically quantified NP, e.g., "the above-mentioned one hundred gorillas" where "above-mentioned" modifies the group of one hundred gorillas (each gorilla is not above-mentioned) and ntitle which refer to nouns that occur as titles preceding names, e.g. "Prof. Mary Fitzburg". Some example lexical entries follow.

(verb		:orth "build" 

		:subc ((np) (np-for-np) (part-np :adval ("up"))))


(noun		:orth "assertion" 

		:subc ((noun-that-s) (noun-be-that-s))) 

(adverb		:orth "even") 

(adjective	:orth "above-mentioned" 

		:features ((apreq) (attributive))) 

(verb		:orth "abbreviate" 

		:subc ((np-pp :pval ("to")) (np) (np-np-pred) (np-as-np))
		:features ((vveryving :pastpart t))) 

(noun		:orth "Prof." 

		:features ((ntitle)))

We expect to complete Version 2 of Comlex Syntax in May of 1995. The two most significant changes will be: (1) An improvement in the quality and coverage of Comlex as the result of our own quality checks as well as feedback from users; and (2) A corpus, tagged with all of our verb complement classes and many of our verb features. Each of our lexical entries for verbs will include a list of tags, where each tag will consist of one feature or complement, the name of the source (Brown Corpus, Wall Street Journal, etc.) and a pointer to a corpus file. This effort will be significant for gathering statistics on the frequency of complements and features.

For more information about Comlex, please click on any of our references.

(For information about other natural language processing research at the NYU Proteus Project, click here).

Catherine Macleod, Adam Meyers, and Ralph Grishman
{macleod,meyers,grishman}@cs.nyu.edu

References

Grishman, Ralph, Catherine Macleod and Adam Meyers (1994). "Comlex Syntax: Building a Computational Lexicon", Presented at Coling 1994, Kyoto.

Macleod, Catherine and Ralph Grishman (1995). Comlex Syntax Reference Manual, Proteus Project, NYU.

Macleod, Catherine, Ralph Grishman and Adam Meyers (1994a). "The Comlex Syntax Project: The First Year", Presented at the 1994 ARPA Human Language Technology Workshop.

Macleod, Catherine, Ralph Grishman and Adam Meyers (1994b). "Creating a Common Syntactic Dictionary of English", Presented at SNLR: International Workshop on Sharable Natural Language Resources, Nara, August, 1994.

Macleod, Catherine, Ralph Grishman and Adam Meyers (1994c). "Developing Multiply Tagged Corpora for Lexical Research", Presented at the International Workshop on Directions of Lexical Research, Beijing, China, August, 1994.

Meyers, Adam, Catherine Macleod and Ralph Grishman (1994). "Standardization of the Complement Adjunct Distinction", Proteus Project Memorandum 64, Computer Science Department, New York University.

Wolff, Susanne Rohen, Catherine Macleod and Adam Meyers (1993). Comlex Word Classes Manual, Proteus Project, New York University.