G22.2590 - Natural Language Processing - Spring 2008 - Prof. Grishman

Lecture 1 Outline

January 24, 2008


Centrality of Natural Language a primary (and natural) mode of human communication
representation for most recorded human knowledge
a very rich and flexible representation (when compared to most formal representations)
Applications of Natural Language Processing (NLP)
  • machine translation
  • grammar checking
  • dictation (IBM ViaVoice, Dragon Naturally Speaking)
  • information retrieval / web search (primarily statistical -- minimal NLP)
  • question-answering systems
  • information extraction
  • from structured Web documents (Ex:  Dealtime (http://www.dealtime.com/) and Froogle (http://www.froogle.com/) for products;
  • FlipDog, from WhizBang! Labs, sold to Monster.com, for job ads)
  • name taggers (http://www.aktors.org/technologies/annie/)
  • from news (Ex:  NYU Proteus-Bio (screenshot))
  • summarization
  • from news:  Columbia's NewsBlaster
  • interactive systems … data base query, expert systems, help systems
  • Our Goal create systems which can perform such applications: an engineering problem the term "language engineering" has become popular, especially in Europe, to reflect this orientation natural language processing systems are complex, and require good design techniques modular approaches to break the problem up at appropriate points
    formal models which reflect aspects of the structure of language

    Relation to other Fields

    Linguistics goal of linguistics is to describe language provide simple models which can predict language behavior
    understand what is universal about language
    through these formal models, understand how language can be acquired
    formal models from linguistics have been of value in NLP, but its goals are not the same as NLP:
    Psycholinguistics goal is to understand human performance in generating and analyzing language
    little influence on NLP to date some syntactic analyzers which try to mimic human performance & difficulties
    Artificial Intelligence & Machine Learning AI is concerned primarily with generic problem solving strategies & suitable knowledge representations
    there is an inherent link between AI and NLP: some NLP problems require the sort of deep reasoning addressed by AI
    but NLP has found increasing success through avoiding deep reasoning, and so the link has weakened
    Statistics statistical methods and models, originally used in signal processing, information theory, and physics, have become more widely used in NLP
    easily trainable and easily computable models are more attractive now that lots of training data is available
    Analyzing our Needs: setting our agenda

    What functionality do we require in order to address NLP applications?

    Machine Translation

    People have been interested in machine translation since the earliest days of computing. At first, people imagined that machine translation is mostly a "data processing" task … a system looks up the words one at a time in a bilingual dictionary, and then maybe has to fix up the translation a bit. However, there is a lot more to do for machine translation:

    word segmentation: for some languages (such as Japanese and Chinese) there are no spaces between words, so it's not clear what the words are

    morphology: words appear in different forms, indicating singular vs. plural (for nouns), present tense vs. past tense (for verbs), nominative vs. accusative case, etc. English has only a few morphological forms, so it's possible to put them all in a dictionary. This isn't true of most Western languages; for example, a Spanish verb could have over 50 forms.

    syntax: word-for-word translation only works if the word order in the two languages is about the same; if it's not, we need to understand enough about the structure of the two languages (their syntax) to change from one word order to another. English has a rather fixed subject-verb-object order ("SVO"), while many more inflected languages have more variable word order.

    lexical semantics: many words are polysemous … they have multiple meanings. A word will have to be translated differently depending on its meaning in a particular context; otherwise the translation is likely to make little sense. For example, "bill" means both a statement of charges (an "invoice") and a part of a duck (its "beak"). It's not likely that any foreign language has a word with both these senses. If we chose the wrong sense in translating "bill" into a foreign language, it would be like reading the English sentence "At the end of the meal, the waiter presented the beak."

    discourse: in order to create a proper translation we sometimes have to look beyond the individual sentence. That can be true in selecting word senses. The need also arises in translating into English from languages where subject pronouns can be omitted; we need to figure out what the subject actually is, so that we can supply a "he" or "she" or "it" in English.

    Information Extraction

    An information extraction system processes text and extracts information about a specific type of event or relationship. For example, one extraction system we've worked on reads newspaper articles and builds a database of executives who were hired for new management jobs. If it reads a sentence such as

    IBM hired Fred Smith as president. it would create a table entry
    person company position
    Fred Smith IBM president

    We can find some of these items by simple pattern matching, looking for something like

    <word> hired <word> <word> as <word> but this won't get us very far. For better performance, we need name recognition: a company name may be several words ("General Motors"), a person may have a title or middle name ("Mr. Smith", "Fred X. Smith")

    syntax: the information may appear in the passive ("Fred Smith was hired by IBM") or in a relative clause ("Fred Smith, who was hired by IBM"); also, there may be extra modifiers ("IBM yesterday hired Fred Smith as president")

    lexical semantics: there may be lots of synonyms for hired ("appointed", "named", …) which the system should recognize

    discourse - pronouns: if a pronoun appears in a relevant sentence, the system has to figure out what the pronoun refers to ("Fred Smith left Compaq last week. IBM hired him yesterday as president.")

    We will be studying (and testing) information extraction applications over the course of the semester.

    Interactive Command and Query

    A number of systems have been constructed to serve as natural language front ends for data base query. For complex queries, they spare the user the need to learn a formal query language. On the other hand, they run into difficulty if the user keeps asking questions the system cannot answer.

    The system has to translate a natural language query into a formal data base query. It has to be able to accept a wide range of queries or it will be worse than useless (it will be much more frustrating than a formal query language --- it won't really be "natural language"). To do so it needs to analyze

    syntax: it needs to divide the query into phrases which correspond, roughly to different data base attributes: Which customers | bought | green widgets | last week?

    lexical semantics: it needs to figure out how these phrases map into data base relations

    quantifier semantics: if the user says "List all the customers who bought more than five widgets.", the system has to figure out the quantifier structure

    discourse - pronouns and sentence fragments: in asking a sequence of questions, the user is likely to use pronouns and fragments to keep the queries short: "How many widgets did General Motors buy? How many kumquats? Did it buy any tangelos?"

    dialog: a good interactive system will be responsive in its replies, pointing out false assumptions and asking for clarification:

    "How many programmers in the child care department make over $50,000?" "There are no programmers in the child care department." "How many people live in Washington?" "Washington, D.C. or the State of Washington?"

    Natural language is a very rich a powerful communication medium. If we are to build systems which can utilize this medium, we must analyze language at several levels:

    syntax: what is the structure of a sentence?
    semantics: what is the meaning of a sentence (in isolation)?
    discourse: how can a sentence be interpreted in context?
    dialog: how is language used to exchange information?