## Lecture 4: Near Duplicate Pages / Evaluation

### Near Duplicate Pages

See MR&S section 19.6. One point is not very clearly explained there. You can't actually choose a random permutation over the set { 0 ... 263-1}, because it would take about 269 bits just to write down such a thing. But you don't have to specify the whole permutation, just its value on the shingle keys that actually occur.

So you carry out the following algorithm:

```/* PermutationCount is a mapping D1xD2 -> N, where D1, D2 are document
ID's and N is the number of permutations π for which xπ1 = xπ2
in the notation of MR&S. If no record for D1 x D2, then 0.
PermutationCount is implemented as a hash table */

NumDocs: number of documents;
Docs[1 ... NumDocs]: Document set.
Shingles[1 ... NumDocs]; /* Shingles[I] is the set of shingles in Docs[I]
hashed to a 64 bit number */

procedure NextPermutation () {

X[1 ... NumDocs] /* X[I] =  xπI */
H = emptyhashtable /* H is hash table with Key, Value pairs < Q, Pi(Q) >
where Q is the 64 bit hash of a shingle */
for (I = 1 ... NumDocs)
for (Q in Shingles[I]) {
P = retrieve(H,Q);
if (P == Null) {
P = random 64 bit number;
store(H,Q,P); }
if (P < X[I]) X[I] = P;
} endfor
endfor

P2Docs = emptyhashtable; /* P2Docs is hash table with Key, Value pairs < P,L >
where L is the list of docs I s.t.  xπi = P;
for (I= 1 ... NumDocs) {
L = retrieve(P2Docs,X[I]);
for (J in L) {
C = retrieve(PermutationCount, < I,J >))
if (C == Null) C = 0;
store(PermutationCount(< I,J > ,C+1))
} endfor
} endfor

end NextPermutation

procedure TestForDups() {
NumPerms = 200;

for (I=1 ... NumDocs) Shingles[I] = set of hashes of shingles in Docs[I];
PermutationCount = emptyhashtable;
for (P=1 ... NumPerms) NextPermutation();
any < I,J > for which retrieve(I,J) > 0.80 * NumPerms is a duplicate pair;
}
```
This can get into trouble if there is a "boilerplate shingle" (e.g. automatically produced by a web document authoring tool in every page) which appears in more than sqrt{NumDocs) documents. If so, then the inner loop in the last part of NextPermutation may iterate more than O(NumDocs) times in total. So once a shingle gets too many documents associated, you ignore it, now and in future (you keep a permanent table). With that fix, the algorithm clearly runs in time O(NumPerms * total size of document set).

## Evaluation of Web Search

MR&S Chapter 8.

A Taxonomy of Web Search Andrei Broder, SIGIR Forum, 36:2, 2002, 3-10.

"The retrieval effectiveness of web search engines: considering results descriptions," Dirk Lewandowski, Journal of Documentation, 64:6 2008, 915-937.

Evaluating search engines by modeling the relationship between relevance and clicks. B. Carterette and R. Jones, NIPS 2007.

## Evaluating the results returned for one query

### Precision and Recall

D = set of all documents
Q = set of documents retrieved
R = set of relevant documents

Standard measures in IR (in fact, in all applications where the objective is to find a set of solutions):
Precision = |QR| / |Q| -- fraction of retrieved documents that are relevant = 1 - (fraction of retrieved documents that are false positives).
Recall = |QR| / |R| -- fraction of relevant documents that are retrieved = 1 - (fraction of relevant documents that are false negatives).

E.g. suppose there are actually 8 relevant pages. The query engine returns 5 pages of which 3 are relevant. Then Precision = 3/5 = 0.6. Recall= 3/8 = 0.375.

If Q1 subset Q2, then Recall(Q2) >= Recall(Q1). Prec(Q2) can be either greater or less than Q2. If you consider the precision over the first K documents returned for K = 1, 2, ... then the precision goes up every time dK is relevant and down every time it is irrelevant, so graph is sawtoothed. But on the whole precision tends to go down, so there is a trade-off between recall and precision as you get more documents.

### Probabilistic definition of precision/recall

Pick a document d at random from the collection. Let Q be the event that d is retrieved in response to a given query and let R be the event that d is relevant.

Precision = Prob(R|Q).
Recall = Prob(Q|R).

### Measures

Recall is pretty much irrelevant for web search because (a) users almost never want all relevant documents (b) it is impossible to determine what is the set of relevant documents. So most of traditional IR evaluation theory goes out the window.
• A. K-precision. Precision over first K results (e.g. K=3, K=10), regardless of order. (Problematic if there are fewer than K relevant pages on the web.)
• B. Discounted precision. W(1)*R(1) + W(2)*R(2) + ... where R(I) is the relevance of the Ith result and W is a descreasing sequence of weights
• C. Rank of Kth relevant document (e.g. K = 1). (Larger is bad.) (Problematic if fewer than K relevant documents returned.)
Note that these correspond to different user models:
• User looks at all relevant documents in first K results.
• User starts at top, decides at each step whether to continue based on weighted coin flip. (Weight may depend on I but does not depend on whether or not he has seen any relevant documents.) Measure = expected value of docs he reads.
• User reads from top until he has read K relevant documents.

### Relevance

Relevance is taken relative to an information need, not to a query. The gold standard here is a reference librarian who can interact with the user to find out what he/she actually needs to know.

Note that the effectiveness of the query engine on a given query will often be inherently limited because of the inadequacy of the query as an expression of the information need. This inadequacy, in turn, may have several sources:

• The technical limitations of the query interface.
• The incompetence / impatience of the user formulating the query.
• The information need may be:
• More naturally expressed in some non-verbal mode e.g. image.
• Involve meta-data. E.g. "I need an article on the subject that was published in a peer-reviewed scientific journal". "I need a web page that my 6 year old will understand"
• Inherently involve multiple documents. E.g. "I want one article that argues in favor of balancing the budget and a second that refutes the first."
• Involve characteristics whose relation to the text is very complex. "I want a precedent that will help me win my case." "I want to find a proof that the Chebyshev center of a bounded region is unique."

Some of these are more problems for the serious researcher; others apply to the man-in-the-street user as well.

Also the phrase "relevant to an information need" is actually ambiguous. Does it mean "has to do with the same subject as the information need" or "goes some way toward satisfying the information need".

Difficulty formulating good information need exactly. If narrow ("What year was Babe Ruth traded to the Yankees?") one page usually suffices. If broad ("Give me information about high blood pressure") relevance becomes hard to judge.

Other relevance issues:

• Scale:
• Boolean
• Bounded scale: 0,1,2,3.
• Fraction of information need met (number between 0 and 1)
• Amount of relevant information provided (unbounded number).
• How to count when part of the page is relevant?
• How to count pages with false/unreliable information?
• Web page, web document (which may be multi-page), or web site?
• Sum of relevance, or marginal additional relevance E.g. if you have 4 pages all of which have exactly the same information, does that count as 4 good pages or as 1. Note: no way to do this fairly, without somehow quantifying total informational content. E.g. there is no reason that 4 pages, each with a different quarter of the information, should in total be better than 1 page with all the information.

#### Usefulness of page as part of larger informational search

A page may be useful even if it does not directly address the informational need:
• Hub page that points to good authorities. Generally, a page that leads the user via browsing to good information.
• Page leads to information source outside web. E.g.
• The text may give meta-information about the information need. E.g. "Nothing is known about X". "Y is the expert/best book on X"
• The page may provide useful guidance for reformulating the query (e.g. gives synonyms or related words).

### Snippets

(Lewandowski 2008) 4 cases:
• 1. Document relevant, snippet relevant.
• 2. Document relevant, snippet unpromising.
• 3. Document irrelevant, snippet promising.
• 4. Document irrelevant, snippet unpromising.
Problems:
• How do we incorporate this into our measure of relevance?
• One's first thought is to say that (2) represents a failure of the snippet extractor. On second thought, that may not be the case. The document may be relevant even though there does not exist a convincingly interesting snippet, particularly with multi-word queries.
• One's first thought is to say that (3) does not represent a failure of the snippet extractor, whose job it is, after all, to find the segment of the text most relevant to the query, whether or not that is representative of the document. On the other hand, maybe that's not it's job. E.g. with a scientific paper, you might do better to return both the snippet and the paper abstract, so that the user can see whether the paper as a whole is relevant.

## Comparing search engines overall

Formulate a set of information needs, evaluate the search engines on all, combine the results.
• Balance subject matter? By frequency in actual query space or what?
• How to combine (the eternal multi-judge problem)? The obvious thing is to add, and that has its advantages, but there are also drawbacks.
• Separate by query mode? Broder suggests three category of query:
• Informational. Searching for any and all relevant web pages.
• Transactional. Searching for an interactive web page (e.g. shopping).
but these categories are very ill-defined and unstable in the literature.

## Experimental design

#### Measure relevance

Experimenter formulates need, poses queries, evaluates pages for relevance. Or pays other people to do some of these.
• Same people formulate/pose/evaluate?
• Results page randomized for order?
• Identity of search engine anonymized?

Crowdsourcing
Crowdsourcing for relevance evaluation Omar Alonso, Daniel Rose, Benjamin Stewart, ACM SIGIR 42:2 2008.
Pay randoms on the web 1 penny per answer. This seems utterly unreliable to me.

#### User log analysis.

Click-through data in user logs. Advantages: Natural data, lots of data. Disadvantage: difficulty of interpretation. Confounding effects. E.g. are people clicking on the first item returned because it is most relevant (i.e. looks most relevant from the snippet) or just because it is first? If they click on only one, is that because the remaining pages are not relevant or because their need was satisfied, or because they gave up? You can correct for some of this, to some extent, by feeding people results in randomized order, but you don't want to do a lot of that to naive users.

#### Survey

Ask users to fill out a survey about their purposes and their success. (Broder, 2002).

#### Protocol analysis

Ask users, engaged in a task of some complexity, to talk out loud as they work; explain how they formulate and reformulate queries, how they choose links to follow, how they decide whether to continue.

Well known dangers:

• The process of giving a protocol can alter behavior.
• People give rationalizations for their behaviors that are sincere but demonstrably false.
Data is rich but limited and hard to interpret.

#### Other experimental designs

Subjects are given a list of questions to answer. Compare how long they take to answer the questions with one search engine as opposed to another, or study how they use the search engines.

Subjects are given a task, asked to use two different search engines and compare the quality of their results.

#### Behavioral studies of relevance

Relevance: A Review of the Literature and a Framework for Thinking on the Notion in Information Science. Part III: Behavior and Effects of Relevance Tefko Saracevic. Journal of the American Society for Information Science and Technology 58(13):2126-2144, 2007.

### Significance of evaluation

Failures and errors occur for the following reasons:
• 1. Web content
• 1.A No information on Web
• 1.B Incorrect/outdated information
• 1.C Information hard to find on site. Page poorly designed, bad links, site too big, etc.
• 1.D Redundancy between web pages.
• 2. Crawler problem
• 2.A Document not indexed
• 2.B URL out of date (document moved)
• 2.D URL content changed
• 2.E Multiple copies of identical/near-identical page
• 3. Retrieval problem
• 3.A False positive
• 3.B False negative
• 3.C Misjudge importance of page
• 3.D Misordering of pages
• 3.E Wrong page on Web site (e.g. link to internal page when top-level page would be better or vice versa)
• 3.F Engine too slow
• 4. Query language problem
• 4.A Insufficiently expressive
• 4.B Ill-defined.
• 4.C Too complicated / misleading
• 5. User problem (particularly in observational studies)
• 5.A Poor choice of query terms
• 5.B Ineffective use of query language
• 5.C Ineffective use of browser
• 5.D Information overload -- too many pages causes user to give up.
• 5.E Impatient
• 5.F Distracted
• 6. Results page problem
• 6.A Confusing format
• 7. Browsing problem