Lecture 6: Question Answering

Comment on clustering and page rank: Teoma, which powers AskJeeves, seems to be claiming to do the following:(a) clustering documents (b) computing page rank within cluster.

Suggested reading

Scaling Question Answering to the Web Cody Kwok, Oren Etzioni, and Daniel Weld.
Question Answering in Webclopedia Ed Hovy et al. (CiteSeer Reference Page)
Web Question Answering: Is More Always Better? Susan Dumais et al. (CiteSeer Reference Page)

Natural Language Query: Document retrieval

Google etc. "Allow" natural language queries; strip off stop words, process as ordinary query.
AskJeeves. No clear description offered. Extensive manual editing of important web pages.

Natural Language Query: Information retrieval

QA: Text(s) vs. query, both in NL.

2 approaches.
1. Interpret entire text into representation. Answer query from representation.
2. Skim text for promising passages. Search for answer in passages.

All the usual difficulties and issues of NL (e.g. ambiguity, paraphrase, combining information across sentences, inference).

Problems specific to QA

Answer categorization. Consider e.g. "Who" questions, where the answer is always a person or set of people (or anthropomorphized entity, but we'll skip that.)

Note: Genuine ambiguity (i.e. not resolvable by human understander) much more common here than in most kinds of text. Must be resolved by user interaction.

"When" and "where" questions just about as problematic to categorize. "What", "Which", "How", and "Why" much harder. At least Yes/No questions are easy to categorize.

Non-sentence information in documents (e.g. tables.) Seems as though it might be easier; actually probably harder.

Rather than try to generate a well-worded answer, QA systems generally try to return a small segment of text that contains the answer. The wider the window of text allowed, the more reliably this can be done. (e.g. in TREC-9 there was "short answer" of 50 bytes and "long answer" of 250 bytes; long answer often 50% more accurate than short answer.)

Problems specific to QA on Web

Unreliability of information more a problem than in standard document retrieval. Engine giving an answer to question naturally read as endorsement of answer.

Immense volume makes approach of preprocessing all pages into information store (as in START system)


(Cody Kwok, Oren Etzioni, Dan Weld)

General Components:

Specific components

Question classifier

Want to identify three categories of questions Four syntactic categories:

Query formulation:

1. Search for target sentences.
E.g. Question is "Who was the first American in space" search for "The first American in space was ..." (Presumably also query on separate keywords?)

2. Various reformulations at different levels of generality.

4 search engine queries on average.

Summary extractor

Extract up to 40 word chunks containing query keywords, if possible at sentence boundaries. Evaluate using giving high scores for many keywords presented together. Select N best summaries.

Answer extractor

Parse snippets. Collect expected answer type (noun phrase, number, date)

Answer selection

Some sample question

Who is the President of Ghana? Answer: Rawlings.
What is the name of the rare neurological disease with symptoms such as: involuntary movements (tics), swearing, and incoherent vocalizations (grunts, shouts etc,)? Answer: Tourettes syndrome.
How far is Yaroslavl from Moscow? Answer: 280 miles.
What country is the biggest producer of tungsten? Answer: China
When did the Jurassic period end? Answer: 144 million years ago.

Evaluation measures

Percentage correct: percentage of questions for which the correct answer is found.
User effort:
If correct answer appears first at Kth page {
  if answer appears in Kth snippet
     then sum of lengths of first K snippets
     else sum of lengths of first K snippets + length of Kth page. }
Plot percentage correct vs. user effort.

Find MULDER has much lower user effort at given level of percentage correct.


WordNet is a standard tool for dealing with English words. Nouns, vebs, adjectives, and adverbs are organized in synonym sets. Each synonym set has corresponding concepts. Concepts linked by relations. Both single words and standard phrases (collocations) e.g. man-eating shark, blue-collar, etc.

Web interface for WordNet 1.5


Similar general archicture.

QA typology

Extensive QA hierarchical typology Each node annotated with examples, typical patterns of answering text.

Question typology

Patterns collected automatically:



Dumais et al. (Microsoft)

Addressed to information often repeated in Web. If a fact is stated often enough, it will eventually be stated in the form that you expect. Hence, using a vast collection of documents (Web) reduces the problems raised by paraphrase.


Rewrite query. Pattern matching from form of query. Output. Set of triples [query-string; position of answer w.r.t string (left, right, anywhere), weight.]

E.g. "Who created the character of Scrooge?" generates

"created the character of Scrooge", LEFT, 5.
"the character of Scrooge was created by" RIGHT, 5
"created" AND "the character" AND "of Scrooge", ANY, 2
"created" AND "character" AND "Scrooge", ANY, 1

6.7 queries per question on average.

Extract and count common answers. Take summaries from results page (hope that truncation not harmful). Quality of answer A = sum over documents D of highest rank of pattern yielding A on summary of D. (Each doc counts only once.)

Dickens: 117
Christmas Carol: 78
Charles Dickens: 75
Disney: 72
Carl Banks: 54 (cartoonist who created Donald Duck's Uncle Scrooge )
A Christmas: 41 uncle: 34

Filter answers by category (weakly).

Combine combinable answers. (E.g. "Dickens" gets folded into "Charles Dickens"; "A Christmas" and "Christmas Carol" get combined to "A Christmas Carol"


The MetaCrawler Architecture for Resource Aggregation on the Web by Erik Selberg and Oren Etzioni (1995)

Link to Metacrawler

List of metacrawlers (at SearchEngineWatch).