## Lecture 6: Evaluation / Clustering

### Precision and Recall

D = set of all documents
Q = set of documents retrieved
R = set of relevant documents

QR -- True positives.
Q(D-R) -- False positives. (Irrelevant documents retrieved)
(D-Q)R -- False negatives. (Relavant documents omitted)
(D-Q)(D-R) -- True negatives. (Irrelevant documents omitted)

Percentage correct = (|QR| + |(D-Q)(D-R)|) / D.
Not a good measure; counts false positives and false negatives equally.
E.g. suppose |R|=3.
Q1 has two relevant documents and three irrelevant documents.
Q2 returns one irrelevant document.
Then both are making the same number of errors (4), but clearly Q1 is better than Q2.

Standard measures in IR (in fact, in all applications where the objective is to find a set of solutions):
Precision = |QR| / |Q| -- fraction of retrieved documents that are relevant = 1 - (fraction of retrieved documents that are false positives).
Recall = |QR| / |R| -- fraction of relevant documents that are retrieved = 1 - (fraction of relevant documents that are false negatives).

In the above example Q1 has precision 2/5 and recall 2/3. Q2 has precision and recall = 0.

If Q1 subset Q2, then Recall(Q2) > = Recall(Q1). Prec(Q2) can be either greater or less than Q2. If you consider the precision over the first K documents returned for K = 1, 2, ... then the precision goes up every time dK is relevant and down every time it is irrelevant, so graph is sawtoothed. But on the whole precision tends to go down, so there is a trade-off between recall and precision as you get more documents.

Smoothed precision: Plot precision only at points when documents have been found; interpolate in between. Set precision(0)=1. Then precision can be monotonically decreasing, and will tend to be so except possibly at beginning.

Probabilistic model. Suppose that the matcher returns a measure of the "quality" of the document for the query. Suppose that this measured quality has some value in the following sense:
If q1 > q2, then Prob(d in R | qual(d)=q1) > Prob(d in R | qual(d) = q2)
Let QT = { d | qual(d) > = T }.
Then the expected value of precision(QT) is a decreasing function of T. The expected value of recall(QT) is an increasing function of T, but concave downward.

Choices other than threshhold also tend to trade off precision vs. recall. (Of course, if they don't trade-off then just go with the better of the two: win-win.) E.g. stemming, inclusion of synonyms tends to increase recall at cost of precision.

### Problems with Precision and Recall

• Users generally don't care much about recall. (Not entirely clear that precision is exactly the right measure either.)
• Measuring recall involves identifying R. Always difficult; on Web, extremely difficult.
• Doesn't take into account order of answers.
• Doesn't take into account degree of relevance.
• Two numbers (two curves, as functions of T). Would prefer one number.
• Zero values
• What do you mean by "relevance" anyway?

### Alternative measures

F-measure: Harmonic mean of precision and recall:
1/F = average(1/p,1/r)
F = 2pr/(p+r).
If either p or r is small then F is small. If p and r are close then F is about the average of p and r.

Generalized precision: Value of information obtained for user / cost of examining results.

Generalized recall: Value of information obtained / Value of optimal results (or: value of entire Web for user's current need)

Average precision: Average of precision at 20% recall, 50%, 80%. Or average of precision at recall = 0%, 10%, 20% ... 90%, 100%. (Since recall is does not attain these value exactly, and since recall remains constant until next relevant document found, and thus same value of recall can have several values of precision, take max precision or avg precision, and interpolate. Similarly, precision at recall = 0% is extrapolated.)

Precision over first K documents: (or average relevance over first K documents). User model: User will only read first K documents.

Rank of Kth relevant document (or rank such that sum of relevance = K). User model: User will read until he has gotten K relevant documents (or documents whose total relevance is K).

Weighted precision: sumK in Q rel(dK) / |Q|.

Weighted recall: sumK in Q rel(dK) / sumK in R rel(dK) /

Order diminishing sum: Value of search is sumK rel(dK) * pK. User model: User starts reading at beginning, at each step continues with probability p.

Total content precision: Total information relevant to query in pages retrieved divided by |Q| (or divided by total reading time). Of course, this is hard to quantify. You can, for example, prepare a list of questions on the subject matter, and measure "Total information" as the fraction of questions that can be answered from the retrieved texts/

Total content recall: Total information relevant to query in pages retrieved divided by total information relevant to query in Web.

#### Estimating R

• The crawler has not indexed it.
• The retriever doesn't recognize that it is relevant.

First case is hopeless. (We will talk later about how to estimate the size of this set, but no way to estimate # of relevant documents.)

Second case:

• Broaden query as far as feasible. Include disjunction of lots of related words, synonyms, alternative spellings. Set threshhold as low as possible.
• "Seed" web with relevant documents. Or identify specific subset of documents that you know to be indexed (e.g. articles from specific journals.) See what fraction are retrieved, and extrapolate. (E.g. if 40% of seed documents are retrieved and 200 documents total are retrieved, estimate 500 relevant documents total.) Hard to be confident that seed documents are representative of existing documents, relative to query and engine.
• For accessible subcollection (e.g. newsgroup): randomly sample subcollection and count relevant documents.

#### Relevance

• Relevance to query as stated
• Relevance to actual user question. (Includes limitations of query language; user skill at formulating question).
• Interest to user after the fact.
• Originality; new information contained (depends on other docs retrieved).
• Authority: User has confidence in information contained. (Much more an issue in Web than in traditional IR.)

Measured:

• By questionaire.
• By clicking through.
• By external independent index.

### Experimental design

Observational studies: users observed as they use web for their own purposes. Ecologically valid The more interference, the less ecologically valid. (Just informing users that they are observed alters their behavior; however, there can be privacy issues if they are observed without being informed.)

Controlled experiment: Users carry out task specified by experimenter in controlled setting. Much more information per task, much more demanding of user, possible to design narrowly focussed experiment, less clearly representative of "normal" use.

#### Significance of experiment

Failures and errors occur for the following causes:
• 1. Web content
• 1.A No information on Web
• 1.B Incorrect/outdated information
• 1.C Information hard to find on site. Page poorly designed, bad links, site too big, etc.
• 1.D Redundancy between web pages.
• 2. Crawler problem
• 2.A Document not indexed
• 2.B URL out of date (document moved)
• 2.D URL content changed
• 2.E Multiple copies of identical/near-identical page
• 3. Retrieval problem
• 3.A False positive
• 3.B False negative
• 3.C Misjudge importance of page
• 3.D Misordering of pages
• 3.E Wrong page on Web site (e.g. link to internal page when top-level page would be better or vice versa)
• 3.F Engine too slow
• 4. Query language problem
• 4.A Insufficiently expressive
• 4.B Ill-defined.
• 4.C Too complicated / misleading
• 5. User problem
• 5.A Poor choice of query terms
• 5.B Ineffective use of query language
• 5.C Ineffective use of browser
• 5.D Information overload -- too many pages causes user to give up.
• 5.E Impatient
• 5.F Distracted
• 6. Results page problem
• 6.A Confusing format
• 7. Browsing problem
• 7.C Page cannot be correctly displayed
• 8. Page unsuitable to user
• 8.A Too elementary
• 8.C Wrong language
• 8.D Pornographic
• 9. Non-text media
• 9.A Poor quality at web site.
• 9.B Browser doesn't know format.
• 9.C Poor quality at browser.
• 9.D Limited textual information. (In almost all real cases, such media are indexed by related text.)
• 9.E Format unsuited to user (e.g. limited vision/hearing).
Different experiments detect different combinations of these. For example:

Tester specifies query; test subject read first 30 pages; labels each page "relevant", "irrelevant", or category of failure (e.g. "bad link"; "too long to download" etc.) This tests separately 3.A possibly combined with 3.E; 2.C; 7.A; 7.B.

Tester is aware (not through search engine) of a valuable page; runs a variety of queries; tabulates fraction of queries for which page is in top 100. Combines 2.A, 3.B, 3.C, 3.D.

Tester specifies list of questions in some subject area to be answered in fixed time period; test subjects use search engine as best they can. This combines pretty much all possible errors.

## Clustering: Introduction

The Northern Light search engine organizes its responses by topic. (Unfortunately, it is no longer publically available.) Examples:

In 2002, the query "Jaguar" generated the topics: "Mark Brunnel" (a quarterback for the Jacksonville Jaguars), "National Football League", "Jaguar cars", "Jacksonville Jaguars", "Ligaments" (presumably because these tend to get injured), "Tennesse Titans" "Management", "Cleveland Browns", "Careers and Opportunities", "Green Bay Packers", "Surgery", "football", "automobile industry", "aerospace and defense industry", "cars", "business and professional services", "endangered species" "military aircraft" ...

(Curious though irrelevant observation: As of 2004, in Google, searching under "jaguar" gives first (mostly) the automobile, and to a lesser extent, the MAC OS version. The first page for the animal is #8, the second is #18. Searching under "jaguars" gives a mixture of the animal and football teams; the first page for the car is #25. Searching under "jaguar jaguars" gives only pages for S-type Jaguars (the car) down through page 56; page 57 is the animal.

The results in 2002 were quite different. At that time, searching under "jaguar" gave first (mostly) the automobile; the animal turned up first at #15. Searching under "jaguars" gave football teams for at least the first 50. Searching under "jaguar jaguars" gives mostly pages for the animal. I have no idea to what extent this change reflects changes in Google versus changes in the Web.)

The query "Hepburn" generated topics "Audrey Hepburn", "Katharine Hepburn", "Company information", "Assets and liabilities", "Spencer Tracy", ...

Clearly, this is useful. Clearly there is room for improvement in the grouping algorithm (not to mention in our apparent priorities.) Northern Light provides no information on how it does clustering. There are two issues here: One is doing the clustering, the other is extracting an identifying phrase. We will just talk about the first.

### Source of clustering

Generally, the set of Web pages returned in answer to a query fall into multiple clusters for one of two reasons (as with virtually everything, the boundary between these two is fuzzy.)
• Polysemy. A query term has more than one meaning. Either it is an ambiguous word, such as "bat" (the creature or the club); or it is a name of more than one thing, such as "Washington" (the city, the state, the President etc.); or it can be either a name or a word, such as "Banks". Here, the ideal clustering criterion is generally clear; achieving this clustering can be anywhere from easy (e.g. separating "jaguar" into the car, the football team, the creature, the software) to extremely difficult (e.g. separating "President Bush" into the 41st vs. the 43rd president.)
• Multiple aspects of a single topic. Applies more or less to virtually everything with a substantial web presence. The problem is that, in any topic of any richness, there are likely to be multiple ways of organizing the topic into ``aspects''.

### Applications

• Structuring search results
• Suggesting related pages
• Automatic directory construction / update.
• Finding near identical pages:
• Finding mirror pages (e.g. for propagating updates)
• Eliminate near-duplicates from results page
• Plagiarism detection
• Lost and found (find identical pages at different URL's at different times.)

### Cluster structure

• Hierarchical vs. flat.
• Overlap:
• Disjoint partitioning. E.g. partition congressmen by state.
• Multiple dimensions of partitioning, each disjoint. E.g. partition congressmen by state; by party; by House vs. Senate.
• Arbitrary overlap. E.g. partition article by author. Geographical regions (France, the Alps, German-speaking regions). Problem: No natural bound on number of categories.
• Exhaustive vs. non-exhaustive.
Note: disjoint and exhaustive decomposition = tree.
Note difference between geometric point not part of any cluster (unclustered point) and document not part of any cluster (general subject matter). True analogy perhaps: document = region.
• Outliers: What to do?
• How many clusters? How large?

### Information source

• Text content
• Usage: Clickthrough logs give association between query and page.

### Position of clustering module

• At indexing time
• At query time applied to papers.
• At query time applied to snippets. Turns out that, experimentally, most clustering algorithms do almost as well given only snippets; in fact, some do better with snippets than with the whole text.

### Textual similarity criterion

• Vector measure. Each document is considered as vector normalized to length 1.
• Overlap measure. Similarity of documents Q and R is |Q intersect R| / |Q union R|
Note: Safe to presume that cluster of documents is a convex region geometrically. That is, if subject S includes DOC1 with combination V1 of words and DOC2 with combination V2 of words then there could exist a document in S with combination p*V1 + (1-p)*V2 of words.

### Source of clustering in search results

• Polysemy. "bat", "Washington", "Banks".
Clustering criterion clear in principle, hard or easy in practice. (e.g. "President Bush")
• Multiple aspects of a single topic. Practically any rich topic. Many possible dimension for clusters. Little agreement among human subjects. No ideal form of clustering. Qn: Can we generate a system of clusters that seems plausible and useful? Ultimately amounts to general problem of Web page / information structuring.