Lecture 10: Natural Language Processing on the Web

Required Reading

Scaling Question Answering to the Web Cody Kwok, Oren Etzioni, and Daniel Weld.
Learning Surface Text Patterns for a Question Answering System, Deepak Ravichandran and Eduard Hovy (Citeseer reference page). Unsupervised Named-Entity Extraction from the Web. Oren Etzioni et al.

Further Reading

Question Answering from the Web using Knowledge Annotation and Knowledge Mining Techniques, Jimmy Lin and Boris Katz.
Question Answering in Webclopedia Ed Hovy et al. (CiteSeer Reference Page)
Web Question Answering: Is More Always Better? Susan Dumais et al. (CiteSeer Reference Page)

Rather than just match a keyword question, either ask questionns or do a richer analysis of the Web documents or both.

Google etc. "Allow" natural language queries; strip off stop words, process as ordinary keyword query.
AskJeeves. No clear description offered.

Question answering systems

Structured databases

Aranea (Lin and Katz) knows about 7 data sources (biography.com, Processed version of CIA world fact book, etc.)

HTML structured information on Web pages (tables, lists etc.)

Tables are extremely difficult; harder than NL texts. We'll discuss one system that uses lists below.

Non-textual information

Pictures, sound, etc.

NL text

Query answering architecture

Learning architecture

Question analysis

Questions are actually particularly difficult for NL processing. They encounter all the usual difficulties plus there is a ubiquitous ambiguity in determining what kind of answer is being sought.

Answer categorization. Consider e.g. "Who" questions, where the answer is always a person or set of people (or anthropomorphized entity, but we'll skip that.)

Note: Genuine ambiguity (i.e. not resolvable by human understander) much more common here than in most kinds of text. Must be resolved by user interaction (which, in practice, is generally "Please rephrase the question.")

"When" question are easier; the main ambiguities involve precision (12:4; Jan. 11; 1943; the Mesozoic) and repeated events (Q: When is Thanksgiving? A: The fourth Thursday in November).

"Where" questions are ambiguous in precision and in repeated structures; also metaphorical rather than literal location. (E.g. "Where can I find information about treasury bonds?")

"What", "Which", "How", and "Why" are much harder to analyze. At least Yes/No questions are easy to categorize.

Rather than try to generate a well-worded answer, QA systems generally try to return a small segment of text that contains the answer. The wider the window of text allowed, the more reliably this can be done. (e.g. in TREC-9 there was "short answer" of 50 bytes and "long answer" of 250 bytes; long answer often 50% more accurate than short answer.)

Problems and opportunities for QA specific to Web


(Cody Kwok, Oren Etzioni, Dan Weld)

General Components:

Specific components

Question classifier

Want to identify three categories of questions Four syntactic categories:

Query formulation:

1. Search for target sentences.
E.g. Question is "Who was the first American in space" search for "The first American in space was ..." (Presumably also query on separate keywords?)

2. Various reformulations at different levels of generality.

4 search engine queries on average.

Summary extractor

Extract up to 40 word chunks containing query keywords, if possible at sentence boundaries. Evaluate using giving high scores for many keywords presented together. Select N best summaries.

Answer extractor

Parse snippets. Collect expected answer type (noun phrase, number, date)

Answer selection

Some sample question

Who is the President of Ghana? Answer: Rawlings.
What is the name of the rare neurological disease with symptoms such as: involuntary movements (tics), swearing, and incoherent vocalizations (grunts, shouts etc,)? Answer: Tourettes syndrome.
How far is Yaroslavl from Moscow? Answer: 280 miles.
What country is the biggest producer of tungsten? Answer: China
When did the Jurassic period end? Answer: 144 million years ago.

Evaluation measures

Percentage correct: percentage of questions for which the correct answer is found.
User effort:
If correct answer appears first at Kth page {
  if answer appears in Kth snippet
     then sum of lengths of first K snippets
     else sum of lengths of first K snippets + length of Kth page. }
Plot percentage correct vs. user effort.

Find MULDER has much lower user effort at given level of percentage correct.


WordNet is a standard tool for dealing with English words. Nouns, vebs, adjectives, and adverbs are organized in synonym sets. Each synonym set has corresponding concepts. Concepts linked by relations. Both single words and standard phrases (collocations) e.g. man-eating shark, blue-collar, etc.

Web interface for WordNet 1.5


Similar general archicture.

QA typology

Extensive QA hierarchical typology Each node annotated with examples, typical patterns of answering text.

Question typology

Patterns collected automatically:



Dumais et al. (Microsoft)

Addressed to information often repeated in Web. If a fact is stated often enough, it will eventually be stated in the form that you expect. Hence, using a vast collection of documents (Web) reduces the problems raised by paraphrase.


Rewrite query. Pattern matching from form of query. Output. Set of triples [query-string; position of answer w.r.t string (left, right, anywhere), weight.]

E.g. "Who created the character of Scrooge?" generates

"created the character of Scrooge", LEFT, 5.
"the character of Scrooge was created by" RIGHT, 5
"created" AND "the character" AND "of Scrooge", ANY, 2
"created" AND "character" AND "Scrooge", ANY, 1

6.7 queries per question on average.

Extract and count common answers. Take summaries from results page (hope that truncation not harmful). Quality of answer A = sum over documents D of highest rank of pattern yielding A on summary of D. (Each doc counts only once.)

Dickens: 117
Christmas Carol: 78
Charles Dickens: 75
Disney: 72
Carl Banks: 54 (cartoonist who created Donald Duck's Uncle Scrooge )
A Christmas: 41 uncle: 34

Filter answers by category (weakly).

Combine combinable answers. (E.g. "Dickens" gets folded into "Charles Dickens"; "A Christmas" and "Christmas Carol" get combined to "A Christmas Carol"

Problems with patterns

False positives: Easily misled by unobserved syntax (particularly if proximity is allowed.) E.g. pattern "P is Q" matches "It has been falsely alleged that P is Q".
Pattern "P is ... Q" matches "P is not Q".

In practice, you get much weirder things, especially with patterns that are automatically generated. E.g. (from Ravichandran and Hovy)
Q: "When was Lyndon Johnson born?"
A: "who ran for both re"

What's going on? The pattern for BIRTHDATE was "[NAME] -" (as in "Lyndon Johnson 1908-1973") and the text was "Tower gained national attention in 1960 when he lost to democratic Sen. Lyndon B. Johson who ran for both re-election and and the vice presidency."

Of course, this can be blocked, either by a simple filter or by a more restrictive rule, but (a) even if you're constructing filters by hand it's hard to be sure you have enough, and that none are overly strong (b) it's hugely harder to make sure that you have the right level of generality or the right filters if you're generating rules automatically.

False negatives: The opportunities are endless:

KnowItAll (Etzioni et al.)

Task: To collect as many instances as possible of various categories. (cities, states, countries, actors, and films.)

Note: This is a much easier problem than the Q/A problem:

Domain-independent extractor and assessor rules.

Extractor rule:

Predicate: Class1
Pattern: NP1 "such as" NPList2 
Constraints: head(NP1)=plural(label(Class1)) properNoun(head(each(NPList2))) Bindings: Class1(head(each(NPList2)))
E.g. For the class "City" the pattern is "cities such as NPList2" "cities such as" can be used as a search string. The pattern would match "cities such as Providence, Pawtucket, and Cranston" and would label each of Providence, Pawtucket, and Cranston as cities.

Subclass extractors: Look for instances of a subclass rather than the superclass. E.g. it is easier to find people described as "physicists", "biologists", "chemists" etc. rather than "scientists."

List extractor rules.

Assessor Collection of high-precision, searchable patterns. E.g. "[INSTANCE] is a [CATEGORY]" ("Kalamazoo is a city.") There will not be very many of these on the Web, but if there are a few, that is sufficient evidence.


Learning synonyms for category names: E.g. learn "town" as a synonym for "city"; "nation" as a synonym for "country" etc.
Method: Run the extractor rules in the opposite direction. E.g. Look for patterns of the form "[CLASS](pl.) such as [INST1], [INST2] ..." where some of the instances are known to be cities.

Learning patterns

Some of the best patterns learned:
the cities of [CITY]
headquartered in [CITY]
for the city of [CITY]
in the movie [FILM]
[FILM] the movie starring
movie review of [FILM]
and physicist [SCIENTIST]
physicist [SCIENTIST]
[SCIENTIST], a British scientist

Learning Subclasses

Subclass patterns:
[SUPER] such as [SUB]
such [SUPER] as [SUB]
[SUB] and other [SUPER]
[SUPER] especially [SUB]
[SUB1] and [SUB2] (e.g. "physicists and chemists")

Learning list-pattern extractors

Looks for repeated substructures within an HTML subtree with many instances of the category in a particular place.
E.g. the pattern " CITY " will detect CITY as the first element of a row in a table. (Allows wildcards in the argument to an HTML tag.) Predict that the leaves of the subtree are all instances.
Hugely effective strategy; increases overall number retrieved by a factor of 7, and increases "extraction rate" (number retrieved per query) by a factor of 40.

Results: Found 151,016 cities of which 78,157 were correct: precision = 0.52. At precision = 0.8, found 33,000. At precision = 0.9, found 20,000

Advanced Query Languages

Squeal: A Structured Query language for the Web Ellen Spertus and Lynn Andrea Stein

Presents Web as relational database. SQL syntax for queries.

Tables (This is a simplified account.)

Page table: url, contents, bytes, when.
Tag table: url, tag_id, name (e.g. H1), startOffset, endOffset
Att(ribute) table: tag\_id, name, value
Link table: source_url, anchor, destination_url, hstruct, lstruct (header/list structure in which the link occurs).
Parse table: url_value, component (either "host", "port", "path", or "ref"), value, depth.


// Example 1: What pages contain the word "hypertext" and contain a picture?
FROM page p, tag t
WHERE p.contents LIKE "%hypertext%" 
AND t.url = p.url 
AND t.name = "IMG"

// Example 2: What tags appear on the page "http://www9.org"?

FROM tag
WHERE url = "http://www9.org"

// Example 3: What are the values of the SRC attributes associated with IMG tabs
on "http:www9.org"

SELECT a.value
FROM att a, tag t
WHERE t.url = "http://www9.org"
AND t.name = "IMG"
AND a.tag_id = t.tag_id
AND a.name="SRC"

// Example 4: What pages are pointed to by "http://www9.org"?
SELECT destination_url
FROM link
WHERE source_url = "http://www9.org"

// Example 5: What pages are pointed to via hyperlinks with anchor test 
"Web conference"
SELECT destination_url
FROM link
WHERE anchor = "Web conference"


Download specified pages or call standard search engine as appropriate. Follow links if necessary and download further pages. Parse all downloaded pages into temporary local database. Answer query from database.

e.g. Example 2 What tags appear on the page "http://www9.org"?

FROM tag
WHERE url = "http://www9.org"
Download page, parse, answer question.

// Example 1: What pages contain the word "hypertext" and contain a picture?

FROM page p, tag t
WHERE p.contents LIKE "%hypertext%" 
AND t.url = p.url 
AND t.name = "IMG"
Call search engine with query "hypertext"; download files (all files? seems hard to believe); create local database; answer queries from database.

Note that a complete database of this kind could be easily constructed in the course of creating a Web search index. That is, if Google wanted to support this query engine, it could do so at little additional cost.