G22.2590 - Natural Language Processing - Spring 2006 Prof. Grishman

Lecture 3 Outline

January 31, 2006

Parsers and their Problems, cont'd

Problems with the top-down backtracking parser Problems with the bottom-up parser Top-down chart parser (Earley Algorithm) (J&M 10.4)

Problem of ambiguity (p. 372)

Even if we can generate all the parses with reasonable efficiency, what do we do with all these parses (all but one of which are wrong)?  We will look at a number of different approaches to the problem
Capturing constraints in a context-free grammar

Part-of-Speech Tagging (J&M chapter 8)

Role of parts-of-speech in grammar:  rules stated in terms of classes of words sharing syntactic properties

How fine should these classes be?
    Range of answers ... different part-of-speech 'tag' sets
    Penn Tag Set ... used to tag Univ. of Pennsylvania Tree Bank (1 million words)  (p. 297)
       (a detailed manual about Penn part-of-speech tagging is available from the Penn Treebank Project web site.)

The tagging task:  determining the tag of each word
    Not trivial:  many common words have several tags
    How?
        A dictionary will tell us which tags are possible for a word, independent of context.
        We could parse the sentence, and see which tags are used in the parses, but that's an expensive
        and difficult process (we might not always get a parse).
        Instead, we develop separate part-of-speech taggers.
    Why?
        Help parsing (reduce ambiguity).
        Resolve pronunciation ambiguities (for text-to-speech).
        Resolve semantic ambiguities.

Rule based part-of-speech tagging (J&M 8.4)
    Ex:  Constraint-grammar tagger
    Needs large tagged corpus for testing

Statistical part-of-speech tagger (J&M 8.5)
    Needs large tagged corpus for training
    Unigram statistics (most common part-of-speech for each word) get us to about 90% accuracy
    For greater accuracy, need some information on adjacent words