G22.2590 - Natural Language Processing - Spring 2003 - Prof. Grishman
Lecture 1 Outline
January 23, 2003
Centrality of Natural Language
a primary (and natural) mode of human communication
representation for most recorded human knowledge
a very rich and flexible representation (when compared
to most formal representations)
Applications of Natural Language Processing (NLP)
dictation (IBM ViaVoice, Dragon Naturally Speaking)
information retrieval / web search (primarily statistical -- minimal NLP)
from structured Web documents (Ex: Dealtime (http://www.dealtime.com/)
and Froogle (http://www.froogle.com/)
still not good enough for polished translations, but effective for producing
rough drafts for post-editing, or for getting the gist of a text
ex: Systran (http://www.systransoft.com/)
FlipDog, from WhizBang! Labs, sold to Monster.com, for job ads)
from news (Ex: NYU Proteus-Bio (http://factum.cs.nyu.edu/))
from news: Columbia's NewsBlaster (http://www.cs.columbia.edu/nlp/newsblaster/)
interactive systems … data base query, expert systems, help systems
create systems which can perform such applications: an engineering
the term "language engineering" has become popular, especially in
Europe, to reflect this orientation
natural language processing systems are complex, and require good design
modular approaches to break the problem up at appropriate points
limited appeal with written input: people don't like to type a lot;
GUIs have been more effective
telephone-based customer support ... simple querying / reservations
formal models which reflect aspects of the structure of language
Relation to other Fields
goal of linguistics is to describe language
provide simple models which can predict language behavior
understand what is universal about language
through these formal models, understand how language can be acquired
formal models from linguistics have been of value in NLP, but its goals
are not the same as NLP:
- a single counterexample can invalidate a model as a linguistic theory,
but would not significantly lessen its value for NLP
goal is to understand human performance in generating and analyzing
- NLP must address all phenomena which arise in an application, while
linguistics may focus on select phenomena which give insight into the language
little influence on NLP to date
some syntactic analyzers which try to mimic human performance &
Artificial Intelligence & Machine Learning
AI is concerned primarily with generic problem solving strategies
& suitable knowledge representations
there is an inherent link between AI and NLP: some NLP problems require
the sort of deep reasoning addressed by AI
but NLP has found increasing success through avoiding deep reasoning,
and so the link has weakened
statistical methods and models, originally used in signal processing,
information theory, and physics, have become more widely used in NLP
easily trainable and easily computable models are more attractive now
that lots of training data is available
Analyzing our Needs: setting our agenda
What functionality do we require
in order to address NLP applications?
People have been interested in machine
translation since the earliest days of computing. At first, people imagined
that machine translation is mostly a "data processing" task … a system
looks up the words one at a time in a bilingual dictionary, and then maybe
has to fix up the translation a bit. However, there is a lot more to do
for machine translation:
word segmentation: for some
languages (such as Japanese and Chinese) there are no spaces between words,
so it's not clear what the words are
morphology: words appear in
different forms, indicating singular vs. plural (for nouns), present tense
vs. past tense (for verbs), nominative vs. accusative case, etc. English
has only a few morphological forms, so it's possible to put them all in
a dictionary. This isn't true of most Western languages; for example, a
Spanish verb could have over 50 forms.
syntax: word-for-word translation
only works if the word order in the two languages is about the same; if
it's not, we need to understand enough about the structure of the two languages
(their syntax) to change from one word order to another. English has a
rather fixed subject-verb-object order ("SVO"), while many more inflected
languages have more variable word order.
lexical semantics: many words
are polysemous … they have multiple meanings. A word will have to
be translated differently depending on its meaning in a particular context;
otherwise the translation is likely to make little sense. For example,
"bill" means both a statement of charges (an "invoice") and a part of a
duck (its "beak"). It's not likely that any foreign language has a word
with both these senses. If we chose the wrong sense in translating "bill"
into a foreign language, it would be like reading the English sentence
"At the end of the meal, the waiter presented the beak."
discourse: in order to create
a proper translation we sometimes have to look beyond the individual sentence.
That can be true in selecting word senses. The need also arises in translating
into English from languages where subject pronouns can be omitted; we need
to figure out what the subject actually is, so that we can supply a "he"
or "she" or "it" in English.
An information extraction system
processes text and extracts information about a specific type of event
or relationship. For example, one extraction system we've worked on reads
newspaper articles and builds a database of executives who were hired for
new management jobs. If it reads a sentence such as
IBM hired Fred Smith as president.
it would create a table entry
We can find some of these items by
simple pattern matching, looking for something like
<word> hired <word> <word>
but this won't get us very far. For
better performance, we need
name recognition: a company
name may be several words ("General Motors"), a person may have a title
or middle name ("Mr. Smith", "Fred X. Smith")
syntax: the information may
appear in the passive ("Fred Smith was hired by IBM") or in a relative
clause ("Fred Smith, who was hired by IBM"); also, there may be extra modifiers
("IBM yesterday hired Fred Smith as president")
lexical semantics: there may
be lots of synonyms for hired ("appointed", "named", …) which the system
discourse - pronouns: if a
pronoun appears in a relevant sentence, the system has to figure out what
the pronoun refers to ("Fred Smith left Compaq last week. IBM hired him
yesterday as president.")
We will be studying (and testing) information
extraction applications over the course of the semester.
Interactive Command and Query
A number of systems have been constructed
to serve as natural language front ends for data base query. For complex
queries, they spare the user the need to learn a formal query language.
On the other hand, they run into difficulty if the user keeps asking questions
the system cannot answer.
The system has to translate a natural
language query into a formal data base query. It has to be able to accept
a wide range of queries or it will be worse than useless (it will be much
more frustrating than a formal query language --- it won't really be "natural
language"). To do so it needs to analyze
syntax: it needs to divide
the query into phrases which correspond, roughly to different data base
attributes: Which customers | bought | green widgets | last week?
lexical semantics: it needs
to figure out how these phrases map into data base relations
quantifier semantics: if the
user says "List all the customers who bought more than five widgets.",
the system has to figure out the quantifier structure
discourse - pronouns and sentence
fragments: in asking a sequence of questions, the user is likely to
use pronouns and fragments to keep the queries short: "How many widgets
did General Motors buy? How many kumquats? Did it buy any tangelos?"
dialog: a good interactive
system will be responsive in its replies, pointing out false assumptions
and asking for clarification:
"How many programmers in the child
care department make over $50,000?"
"There are no programmers in the
child care department."
"How many people live in Washington?"
"Washington, D.C. or the State
Natural language is a very rich a
powerful communication medium. If we are to build systems which can utilize
this medium, we must analyze language at several levels:
syntax: what is the structure
of a sentence?
semantics: what is the meaning
of a sentence (in isolation)?
discourse: how can a sentence
be interpreted in context?
dialog: how is language used
to exchange information?