Lecture 6: Question Answering
Comment on clustering and page rank: Teoma, which powers AskJeeves,
seems to be claiming to do the following:(a) clustering documents (b)
computing page rank within cluster.
Suggested reading
Scaling Question Answering to the Web
Cody Kwok, Oren Etzioni, and Daniel Weld.
Question Answering in Webclopedia Ed Hovy et al.
(CiteSeer Reference Page)
Web Question Answering: Is More Always Better?
Susan Dumais et al. (CiteSeer Reference Page)
Natural Language Query: Document retrieval
Google etc. "Allow" natural language queries; strip off stop words,
process as ordinary query.
AskJeeves. No clear description offered. Extensive manual editing
of important web pages.
Natural Language Query: Information retrieval
QA: Text(s) vs. query, both in NL.
2 approaches.
1. Interpret entire text into representation. Answer query from
representation.
2. Skim text for promising passages. Search for answer in passages.
All the usual difficulties and issues of NL (e.g. ambiguity, paraphrase,
combining information across sentences, inference).
Problems specific to QA
Answer categorization. Consider e.g. "Who" questions, where
the answer is always a person or set of people (or anthropomorphized
entity, but we'll skip that.)
- Q: Who wrote "Alice in Wonderland?"
A: Lewis Carroll.
Category: Name.
- Q: Who was Lewis Carroll?
A: The author of "Alice in Wonderland"
Category: What is he famous for: Unique identifier?
- Q: Who was Ty Cobb?
A: A famous baseball player.
Category: What is he famous for: Non-unique descriptor.
- Q: Who, actually, was "Lewis Carroll"?.
A: Charles Dodgson.
Category: Actual name associated with alternative name.
- Q: Who is the Attorney General?
A1: John Ashcroft.
A2: The cabinet secretary who heads the Justice Department.
Category: Definition of a title in terms of a role.
- Q: If two cars arrive simultaneously at a four-way stop, who has
the right of way?
A: The driver on the right.
Category: Characteristic in an abstract situation.
- Q: Who is on the Supreme Court?
A: William Rehnquist, Antony Scalia,
Ruth Bader Ginsberg ...
Category: Set of names.
- Q: Who meets at the G-8 summits?
A: Representatives of Britain,
Canada, France, Germany, Italy, Japan, Russia, and the US.
Category: Set of
descriptors.
Note: Genuine ambiguity (i.e. not resolvable by human understander) much
more common here than in most kinds of text. Must be resolved by
user interaction.
"When" and "where" questions just about as problematic to categorize.
"What", "Which", "How", and "Why" much harder. At least
Yes/No questions are easy to categorize.
Non-sentence information in documents (e.g. tables.) Seems as though
it might be easier; actually probably harder.
Rather than try to generate a well-worded answer, QA systems generally
try to return a small segment of text that contains the answer.
The wider the window of text allowed, the more reliably this can be
done. (e.g. in TREC-9 there was "short answer" of 50 bytes and
"long answer" of 250 bytes; long answer often 50% more accurate than
short answer.)
Problems specific to QA on Web
Unreliability of information more a problem than in standard document
retrieval. Engine giving an answer to question naturally read
as endorsement of answer.
Immense volume makes approach of preprocessing all pages into
information store (as in START system)
(Cody Kwok, Oren Etzioni, Dan Weld)
General Components:
- Retrieval engine: Find documents on Web.
- Query formulation: NL question to retrieval query.
- Answer extraction: Answer question from documents retrieved.
Specific components
- Natural language parser: Question -> Parse tree [Preexisting]
- Question classifier: Parse tree -> category
- Query formulator: Parse tree -> Set of queries
- Web search engine: Query => Web pages [Preexisting]
- Summary extraction and ranking: Web page, query -> snippets.
- Answer extraction: Web page, snippet-> candidate answers
- Answer selection: Rank and present answers.
Question classifier
Want to identify three categories of questions
- Nominal: Answer is a noun phrase.
- Numerical: Answer is a number
- Temporal: Answer is a date.
Four syntactic categories:
- wh-adjective: "How" followed by adjective e.g. "How many", "How tall" etc.
Categorized as numerical
- wh-adverb. Interrogative followed by verb. Categorize by type of
interrogative. "When" is temporal, rest are nominal.
- wh-noun. "What" or "which" followed by noun. E.g. "What car,"
"What height", "What year". Categorized by noun. Semantic network "WordNet"
of words
- No interrogative (E.g. "Name a book ...") Assume nominal.
Query formulation:
1. Search for target sentences.
E.g. Question is "Who was the first American in space" search for
"The first American in space was ..." (Presumably also query on
separate keywords?)
2. Various reformulations at different levels of generality.
-
Question to sentence fragment transformation: E.g. "Who was the first
American in space" -> "was the first American in space", "the
first American in space was". "Who shot JFK" -> "shot JFK".
-
Verb conversion: E.g. "When did Nixon visit China?" -> "Nixon visited
China ..." Preexisting inflecting module.
-
Query expansion. Replace adjectoive by attribute noun. E.g.
"How tall is Mt. Everest" -> "The height of Mt. Everest is"
4 search engine queries on average.
Summary extractor
Extract up to 40 word chunks containing query keywords, if possible at sentence
boundaries. Evaluate using giving high scores for many keywords presented
together. Select N best summaries.
Answer extractor
Parse snippets. Collect expected answer type (noun phrase, number, date)
Answer selection
- Score candidates by how close they are to keywords, proper syntactic
position relative to query."
- Cluster by similarity (e.g. "Alan B. Shepard" = "Shepard")
- Score of cluster = sum of scores of elements.
Some sample question
Who is the President of Ghana? Answer: Rawlings.
What is the name of the rare neurological disease with symptoms
such as: involuntary movements (tics), swearing, and incoherent
vocalizations (grunts, shouts etc,)? Answer: Tourettes syndrome.
How far is Yaroslavl from Moscow? Answer: 280 miles.
What country is the biggest producer of tungsten? Answer: China
When did the Jurassic period end? Answer: 144 million years ago.
Evaluation measures
Percentage correct: percentage of questions for which the correct answer is
found.
User effort:
If correct answer appears first at Kth page {
if answer appears in Kth snippet
then sum of lengths of first K snippets
else sum of lengths of first K snippets + length of Kth page. }
Plot percentage correct vs. user effort.
Find MULDER has much lower user effort at given level of percentage
correct.
WordNet is a standard tool for dealing with English words.
Nouns, vebs, adjectives, and adverbs are organized in synonym sets.
Each synonym set has corresponding concepts. Concepts linked by relations.
Both single words and standard phrases (collocations) e.g. man-eating
shark, blue-collar, etc.
Web interface for
WordNet 1.5
Webclopedia
Similar general archicture.
QA typology
Extensive QA hierarchical typology
Each node annotated with examples, typical
patterns of answering text.
Question typology
Patterns collected automatically:
Collection
- Get instances of question and answer of given type (e.g. for type
"BIRTHYEAR", pair "Mozart", "1756"; "Washington", "1732")
- Search for sentences with both terms close together.
- Extract and abstract phrase containing both. Tabulate common
patterns. E.g. "Mozart (1756-1791)" abstracted to
"[person] ([birthyear] - number).
Evaluation
- Collect sentences with query term matching pattern.
- Determine frequency with which pattern supplies correct answer.
Addressed to information often repeated in Web. If a fact is stated
often enough, it will eventually be stated in the form that you expect.
Hence, using a vast collection of documents (Web) reduces the problems
raised by paraphrase.
Steps
Rewrite query. Pattern matching from form of query.
Output. Set of triples [query-string; position of answer w.r.t string
(left, right, anywhere), weight.]
E.g. "Who created the character of Scrooge?" generates
"created the character of Scrooge", LEFT, 5.
"the character of Scrooge was created by" RIGHT, 5
"created" AND "the character" AND "of Scrooge", ANY, 2
"created" AND "character" AND "Scrooge", ANY, 1
6.7 queries per question on average.
Extract and count common answers. Take summaries from
results page (hope that truncation not harmful). Quality
of answer A = sum over documents D of highest rank of pattern yielding
A on summary of D. (Each doc counts only once.)
Dickens: 117
Christmas Carol: 78
Charles Dickens: 75
Disney: 72
Carl Banks: 54 (cartoonist who created Donald Duck's Uncle Scrooge )
A Christmas: 41
uncle: 34
Filter answers by category (weakly).
Combine combinable answers. (E.g. "Dickens" gets folded into
"Charles Dickens"; "A Christmas" and "Christmas Carol" get combined
to "A Christmas Carol"
MetaCrawler
The MetaCrawler Architecture for Resource Aggregation on the Web
by Erik Selberg and Oren Etzioni (1995)
Link to Metacrawler
List of metacrawlers (at SearchEngineWatch).
-
Query reformulation. Query must be reformulated in the syntax of
each service. If query feature not in query language of service,
must approximate
-
Optional background download and check:
- Check liveness of link.
- If query feature was deleted in reformulation, can be checked.
- New features added.
- Collation and Duplicate Removal.
- Reranking: Overall score is function of rank from services