Lecture 4: Near Duplicate Pages / Evaluation

Near Duplicate Pages

See MR&S section 19.6. One point is not very clearly explained there. You can't actually choose a random permutation over the set { 0 ... 263-1}, because it would take about 269 bits just to write down such a thing. But you don't have to specify the whole permutation, just its value on the shingle keys that actually occur.

So you carry out the following algorithm:

/* PermutationCount is a mapping D1xD2 -> N, where D1, D2 are document
ID's and N is the number of permutations π for which xπ1 = xπ2 
in the notation of MR&S. If no record for D1 x D2, then 0. 
PermutationCount is implemented as a hash table */

NumDocs: number of documents;
Docs[1 ... NumDocs]: Document set.
Shingles[1 ... NumDocs]; /* Shingles[I] is the set of shingles in Docs[I]
                            hashed to a 64 bit number */

procedure NextPermutation () {

X[1 ... NumDocs] /* X[I] =  xπI */
H = emptyhashtable /* H is hash table with Key, Value pairs < Q, Pi(Q) > 
                      where Q is the 64 bit hash of a shingle */
for (I = 1 ... NumDocs)
  for (Q in Shingles[I]) {
     P = retrieve(H,Q);
     if (P == Null) {
       P = random 64 bit number;
       store(H,Q,P); }
     if (P < X[I]) X[I] = P;
    } endfor

P2Docs = emptyhashtable; /* P2Docs is hash table with Key, Value pairs < P,L >
                         where L is the list of docs I s.t.  xπi = P;
for (I= 1 ... NumDocs) {
  L = retrieve(P2Docs,X[I]);
  for (J in L) {
      C = retrieve(PermutationCount, < I,J >))
      if (C == Null) C = 0;
      store(PermutationCount(< I,J > ,C+1))
    } endfor
} endfor 

end NextPermutation

procedure TestForDups() {
NumPerms = 200;

for (I=1 ... NumDocs) Shingles[I] = set of hashes of shingles in Docs[I]; 
PermutationCount = emptyhashtable;
for (P=1 ... NumPerms) NextPermutation();
any < I,J > for which retrieve(I,J) > 0.80 * NumPerms is a duplicate pair;
This can get into trouble if there is a "boilerplate shingle" (e.g. automatically produced by a web document authoring tool in every page) which appears in more than sqrt{NumDocs) documents. If so, then the inner loop in the last part of NextPermutation may iterate more than O(NumDocs) times in total. So once a shingle gets too many documents associated, you ignore it, now and in future (you keep a permanent table). With that fix, the algorithm clearly runs in time O(NumPerms * total size of document set).

Evaluation of Web Search

MR&S Chapter 8.

Additional reading

A Taxonomy of Web Search Andrei Broder, SIGIR Forum, 36:2, 2002, 3-10.

"The retrieval effectiveness of web search engines: considering results descriptions," Dirk Lewandowski, Journal of Documentation, 64:6 2008, 915-937.

The retrieval effectiveness of search engines on navigational queries. Dirk Lewandowski

Evaluating search engines by modeling the relationship between relevance and clicks. B. Carterette and R. Jones, NIPS 2007.

Evaluating the results returned for one query

Precision and Recall

D = set of all documents
Q = set of documents retrieved
R = set of relevant documents

Standard measures in IR (in fact, in all applications where the objective is to find a set of solutions):
Precision = |QR| / |Q| -- fraction of retrieved documents that are relevant = 1 - (fraction of retrieved documents that are false positives).
Recall = |QR| / |R| -- fraction of relevant documents that are retrieved = 1 - (fraction of relevant documents that are false negatives).

E.g. suppose there are actually 8 relevant pages. The query engine returns 5 pages of which 3 are relevant. Then Precision = 3/5 = 0.6. Recall= 3/8 = 0.375.

If Q1 subset Q2, then Recall(Q2) >= Recall(Q1). Prec(Q2) can be either greater or less than Q2. If you consider the precision over the first K documents returned for K = 1, 2, ... then the precision goes up every time dK is relevant and down every time it is irrelevant, so graph is sawtoothed. But on the whole precision tends to go down, so there is a trade-off between recall and precision as you get more documents.

Probabilistic definition of precision/recall

Pick a document d at random from the collection. Let Q be the event that d is retrieved in response to a given query and let R be the event that d is relevant.

Precision = Prob(R|Q).
Recall = Prob(Q|R).


Recall is pretty much irrelevant for web search because (a) users almost never want all relevant documents (b) it is impossible to determine what is the set of relevant documents. So most of traditional IR evaluation theory goes out the window. Note that these correspond to different user models:


Relevance is taken relative to an information need, not to a query. The gold standard here is a reference librarian who can interact with the user to find out what he/she actually needs to know.

Note that the effectiveness of the query engine on a given query will often be inherently limited because of the inadequacy of the query as an expression of the information need. This inadequacy, in turn, may have several sources:

Some of these are more problems for the serious researcher; others apply to the man-in-the-street user as well.

Also the phrase "relevant to an information need" is actually ambiguous. Does it mean "has to do with the same subject as the information need" or "goes some way toward satisfying the information need".

Difficulty formulating good information need exactly. If narrow ("What year was Babe Ruth traded to the Yankees?") one page usually suffices. If broad ("Give me information about high blood pressure") relevance becomes hard to judge.

Other relevance issues:

Usefulness of page as part of larger informational search

A page may be useful even if it does not directly address the informational need:


(Lewandowski 2008) 4 cases: Problems:

Comparing search engines overall

Formulate a set of information needs, evaluate the search engines on all, combine the results.

Experimental design

Measure relevance

Experimenter formulates need, poses queries, evaluates pages for relevance. Or pays other people to do some of these.

Crowdsourcing for relevance evaluation Omar Alonso, Daniel Rose, Benjamin Stewart, ACM SIGIR 42:2 2008.
Pay randoms on the web 1 penny per answer. This seems utterly unreliable to me.

User log analysis.

Click-through data in user logs. Advantages: Natural data, lots of data. Disadvantage: difficulty of interpretation. Confounding effects. E.g. are people clicking on the first item returned because it is most relevant (i.e. looks most relevant from the snippet) or just because it is first? If they click on only one, is that because the remaining pages are not relevant or because their need was satisfied, or because they gave up? You can correct for some of this, to some extent, by feeding people results in randomized order, but you don't want to do a lot of that to naive users.


Ask users to fill out a survey about their purposes and their success. (Broder, 2002).

Protocol analysis

Ask users, engaged in a task of some complexity, to talk out loud as they work; explain how they formulate and reformulate queries, how they choose links to follow, how they decide whether to continue.

Well known dangers:

Data is rich but limited and hard to interpret.

Other experimental designs

Subjects are given a list of questions to answer. Compare how long they take to answer the questions with one search engine as opposed to another, or study how they use the search engines.

Subjects are given a task, asked to use two different search engines and compare the quality of their results.

Behavioral studies of relevance

Relevance: A Review of the Literature and a Framework for Thinking on the Notion in Information Science. Part III: Behavior and Effects of Relevance Tefko Saracevic. Journal of the American Society for Information Science and Technology 58(13):2126-2144, 2007.

Significance of evaluation

Failures and errors occur for the following reasons: