CSCI-GA.2590 - Natural Language Processing - Spring 2013 Prof. Grishman

Lecture 12 Outline

April 23, 2013

Term projects: presentations.

Statistical Parsers, cont'd

Evaluating Parsers

Constituent Parsers: the accuracy of constituent parsers is stated in terms of labeled constituent recall/precision/F-measure when compared to a standard parse. Comparison against a standard parse is feasible because UPenn parses have become such a widely used standard. (Text, sectionn 14.7). (Sofware at http://nlp.cs.nyu.edu/evalb/.)

Dependency Parsers: the accuracy of dependency parsers is generally stated in terms of the fraction of tokens for which the proper head and dependency label is assigned; unlabeled dependency may also be reported (see, for example, Nivre and Scholz).

Typical scores for constituent and dependency parsers are now in the low 90's.

Fast Dependency Parsers

Dependency parses can be generated easily from constituent parses, so we can generate a constituent parse using a CKY parser in time n3 and then convert it to a dependency parse. In the past few years, there has been considerable interest in producing dependency parses directly and quickly.

One benefit of dependency trees is that they can handle discontinuous constituents, as in "A hearing is scheduled on the issue today" [Nivre and McDonald 2008]. These are not common in English, but are much more common in languages with freer word order. Trees with such crossing edges are termed non-projective dependency parses.

We will discuss three general approaches: graph-based, transition-based, and easy-first.

Graph-based parsers

Graph-based parsers make an exhaustive search of possible dependency structures, seeking the highest-scoring tree. The score of a tree is the product (or sum) of the scores of the individual arcs; the score of an arc may represent its probability (as for a probabilistic constituent grammar) or some other linear combination of features. There are implementations which are similar to CKY parsers and take time O(n3) (J. Eisner, COLING 1996).

Transition-based parsers

Transition-based parsers (also called shift-reduce parsers) are deterministic left-to-right parses. They are similar to the parsers used for programming languages. Given an input sequence and a stack, at each step the parser can push the next word onto the stack or link the top item on the stack with the next word in the input. The decision is made by a classifier which takes as inputs the current state of the stack and a limited number of words ahead in the input sequence. The classifier is trained by converting each dependency tree to a transition sequence which generates that tree. This is a linear-time (O(n)) algorithm.

Making deterministic decisions with limited look-ahead limits the accuracy of the parser. The effect can be reduced by maintaining multiple states. Beam search keeps the top beam-width states. Equivalent states can be merged (Huang and Sagae 2010).

One widely-used parser of this type is the MALT parser.

Easy-First

Reference: Goldberg and Elhadad, NAACL 2010

Easy-first parsers are deterministic bottom-up parsers. In contrast to transition-based parsers, they do not necessarily build their structures from left to right; at each step they select the best pair of neighbors to link. In that way, an action may have more information about its right context than in a transition-based parser.

The parsing algorithm is quite simple: we initialize pending to the sequence of words in the sentence. At each step there are 2*|pending| possible actions, attaching the i-th element of pending above or below the i+1-st element. A score function, based on a linear combination of features around i and i+1, assigns a score to each possible action; we choose the action with the highest score.

The training algorithm adjusts the weights of the scoring function. For each sentence, for each processing step, if the scoring function selects a valid reduction, we perform that reduction and continue. If it selects an invalid action, we reduce the weights associated with the invalid action and increase the weights associated with the valid action.

Performance: Goldberg reports accuracy slightly better than a shift-reduce (MALT) parser but slightly below a graph-based parser. Speed is O(n log n) -- computing the max at each of n steps. Dominant time is for feature calculation, which is O(n).

Overall speed-up

Improvements in speed due to the shift from CKY and graph-based models over the past few years have been dramatic, moving from 4/sentences per second (e.g., Charniak parser) to 75 sentences per second (Tratz/Hovy easy-first parser) with little change in parse accuracy. Even faster systems (e.g., the ClearNLP parser, a transition-based parser) are available with some loss of accuracy.