CSCI-GA.2590 - Natural Language Processing - Spring 2013 - Prof. Grishman

Lecture 1 Outline

January 29, 2013

Introduction

Centrality of Natural Language

Applications of Natural Language Processing (NLP)

Our Goal

create systems which can perform such applications: an engineering problem the term "language engineering" has become popular, especially in Europe, to reflect this orientation natural language processing systems are complex, and require good design techniques modular approaches to break the problem up at appropriate points
formal models which reflect aspects of the structure of language

Relation to other Fields

Linguistics goal of linguistics is to describe language provide simple models which can predict language behavior
understand what is universal about language
through these formal models, understand how language can be acquired
formal models from linguistics have been of value in NLP, but its goals are not the same as NLP:
Psycholinguistics goal is to understand human performance in generating and analyzing language
little influence on NLP to date some syntactic analyzers which try to mimic human performance & difficulties
(symbolic) Artificial Intelligence classical 'symbolic' AI is concerned primarily with generic problem solving strategies & suitable knowledge representations
there is an inherent link between AI and NLP: some NLP problems require the sort of deep reasoning addressed by these AI methods
but NLP (and AI) has found increasing success through avoiding deep reasoning and turning instead to
Statistics and Machine Learning early NLP systems (before 1990) were purely symbolic and hand crafted
statistical methods and models have become more widely used in NLP since the mid 1990's
easily trainable and easily computable models have for some NLP tasks proven much more effective than more complex hand-crafted models;  furthermore, they have become more attractive now that lots of training data is available ("Big Data" initiatives)
Analyzing our Needs: setting our agenda

What functionality do we require in order to address NLP applications?

Machine Translation

People have been interested in machine translation since the earliest days of computing. At first, people imagined that machine translation is mostly a "data processing" task … a system looks up the words one at a time in a bilingual dictionary, and then maybe has to fix up the translation a bit. However, there is a lot more to do for machine translation:

word segmentation: for some languages (such as Japanese and Chinese) there are no spaces between words, so it's not clear what the words are

morphology: words appear in different forms, indicating singular vs. plural (for nouns), present tense vs. past tense (for verbs), nominative vs. accusative case, etc. English has only a few morphological forms, so it's possible to put them all in a dictionary. This isn't true of most Western languages; for example, a Spanish verb could have over 50 forms.

syntax: word-for-word translation only works if the word order in the two languages is about the same; if it's not, we need to understand enough about the structure of the two languages (their syntax) to change from one word order to another. English has a rather fixed subject-verb-object order ("SVO"), while many more inflected languages have more variable word order.

lexical semantics: many words are polysemous … they have multiple meanings. A word will have to be translated differently depending on its meaning in a particular context; otherwise the translation is likely to make little sense. For example, "bill" means both a statement of charges (an "invoice") and a part of a duck (its "beak"). It's not likely that any foreign language has a word with both these senses. If we chose the wrong sense in translating "bill" into a foreign language, it would be like reading the English sentence "At the end of the meal, the waiter presented the beak."

discourse: in order to create a proper translation we sometimes have to look beyond the individual sentence. That can be true in selecting word senses. The need also arises in translating into English from languages where subject pronouns can be omitted; we need to figure out what the subject actually is, so that we can supply a "he" or "she" or "it" in English.

Information Extraction

An information extraction system processes text and extracts information about a specific type of event or relationship. For example, one extraction system we've worked on reads newspaper articles and builds a database of executives who were hired for new management jobs. If it reads a sentence such as

IBM hired Fred Smith as president. it would create a table entry
person company position
Fred Smith IBM president

We can find some of these items by simple pattern matching, looking for something like

<word> hired <word> <word> as <word> but this won't get us very far. For better performance, we need name recognition: a company name may be several words ("General Motors"), a person may have a title or middle name ("Mr. Smith", "Fred X. Smith")

syntax: the information may appear in the passive ("Fred Smith was hired by IBM") or in a relative clause ("Fred Smith, who was hired by IBM"); also, there may be extra modifiers ("IBM yesterday hired Fred Smith as president")

lexical semantics: there may be lots of synonyms for hired ("appointed", "named", …) which the system should recognize

discourse - pronouns: if a pronoun appears in a relevant sentence, the system has to figure out what the pronoun refers to ("Fred Smith left Compaq last week. IBM hired him yesterday as president.")

We will be studying (and testing) information extraction applications over the course of the semester.

Conversational Agents

Conversational agents have been constructed to serve as natural language front ends for such tasks as data base query, order taking, and system control. For complex requests, they spare the user the need to learn a formal query or command language. On the other hand, they run into difficulty if the user keeps asking questions the system cannot answer.

For data base query, thhe system has to translate a natural language query into a formal data base query. It has to be able to accept a wide range of queries or it will be worse than useless (it will be much more frustrating than a formal query language --- it won't really be "natural language"). To do so it needs to analyze

syntax: it needs to divide the query into phrases which correspond, roughly to different data base attributes: Which customers | bought | green widgets | last week?

lexical semantics: it needs to figure out how these phrases map into data base relations

quantifier semantics: if the user says "List all the customers who bought more than five widgets.", the system has to figure out the quantifier structure

discourse - pronouns and sentence fragments: in asking a sequence of questions, the user is likely to use pronouns and fragments to keep the queries short: "How many widgets did General Motors buy? How many kumquats? Did it buy any tangelos?"

dialog: a good interactive system will be responsive in its replies, pointing out false assumptions and asking for clarification:

"How many programmers in the child care department make over $50,000?" "There are no programmers in the child care department." "How many people live in Washington?" "Washington, D.C. or the State of Washington?"
Summary

Natural language is a very rich a powerful communication medium. If we are to build systems which can utilize this medium, we must analyze language at several levels:

syntax: what is the structure of a sentence?
semantics: what is the meaning of a sentence (in isolation)?
discourse: how can a sentence be interpreted in context?
dialog: how is language used to exchange information?