Lecture 4: Similarity Queries / Evaluation

Required Reading

Chakrabarti, sec. 3.2.1, pp. 53-56; secs. 3.3.1, 3.3.2, pp. 68-72.

Additional Reading

Syntactic Clustering of the Web Broder, Glassman, Manasse, and Zweig, 1998 (WWW6)

Similarity of two pages

Let T(D) be the set of terms in D (excluding stop-words). Measure the similarity of D1 to D2 as
|T(D1) intersect T(D2)| / |T(D1) union T(D2)| (Jacquard coefficient).

Algorithm for finding most similar pages in a collection (Jacquard coefficient)

for (each document D in collection C) 
   for (each term T in D)
       write record (T,D) to file F1;
sort F1 by T; 
aggregate records from F1 into form (T,S) where S is the set of documents
   containing T (discard terms where |S|=1).
for (each record (T,S))
   for each pair D1,D2 in S
     write record (D1,D2,1) to file F2;
sort F2 by D1,D2;
find number of records (D1,D2,1). This is equal to |T(D1) intersect T(D2)|;
|T(D1) union T(D2)| = |T(D1)| + |T(D2)| - |T(D1) intersect T(D2)|
r = |T(D1) intersect T(D2)| / |T(D1) union T(D2)|;
save pairs with largest values of r (or with r greater than threshhold).
This can be adapted to the vector model as follows:

Algorithm for finding most similar pages in a collection (vector model)

for (each document D in collection C) 
   for (each term T in D) {
       W := coefficient of T in normalized vector of D;
       write record (T,D,W) to file F1;
sort F1 by T; 
aggregate records from F1 into form (T,S) where S is the set of pairs
   [D,W], where (T,D,W) has been written on F1
for (each record (T,S))
   for each (pair [D1,W1], [D2,W2] in S)
     write record (D1,D2,W1*W2) to file F2;
sort F2 by D1,D2;
for (each pair of document (D1,D2) written to F2) {
   similarity(D1,D2) = sum of X over all records [D1,D2,X] in F2;
return the most similar pairs of documents, or pairs of documents with
   similarity greater than threshhold.

Estimating the Jacquard coefficient of sets A and B using random sampling:

Step 1: Choose a value m and a random function P(W) from elements to integers.

Step 2: For set X, let V(X) = { x in X | P(x) mod m = 0} be a sample of X. 
        This is called the  sketch  of X. 
Step 3: | V(A) intersect V(B) | / |V(A) union V(B) | is an estimate of the similarity of A and B.

Note: if you just sample A and sample B, get a substantial underestimate of the resemblance. Also note that you can get an independent estimate by using a different random function (or relatively prime modulus), so by using several of these you can get high precision and confidence.

Detection of identical or near-identical pages

Shingle : K consecutive words.
Fix a shingle size K. Let S(A) be the set of shingles in A and let S(B) be the set of shingles in B.

Compute |S(D1) intersect S(D2)| / |S(D1) union S(D2)| (Jacquard coefficient).

(Rather than save actual shingle, you can use a fingerprint of the shingle.)

Implementation uses shingle size = 10, m = 25, 40 bit sketch.

High-level algorithm from Broder et al.
Step 1: Normalize by removing HTML and converting to lower-case.
Step 2: Calculate sketch of shingle set of each document.
Step 3: For each pair of documents, compare sketches to see if they exceed resemblance threshhold.
Step 4: Find connected components.

Further issues:

Evaluation of Web Search

Precision and Recall

D = set of all documents
Q = set of documents retrieved
R = set of relevant documents

QR -- True positives.
Q(D-R) -- False positives. (Irrelevant documents retrieved)
(D-Q)R -- False negatives. (Relevant documents omitted)
(D-Q)(D-R) -- True negatives. (Irrelevant documents omitted)

Percentage correct = (|QR| + |(D-Q)(D-R)|) / D.
Not a good measure; counts false positives and false negatives equally.
E.g. suppose |R|=3.
Q1 has two relevant documents and three irrelevant documents.
Q2 returns one irrelevant document.
Then both are making the same number of errors (4), but clearly Q1 is better than Q2.

Standard measures in IR (in fact, in all applications where the objective is to find a set of solutions):
Precision = |QR| / |Q| -- fraction of retrieved documents that are relevant = 1 - (fraction of retrieved documents that are false positives).
Recall = |QR| / |R| -- fraction of relevant documents that are retrieved = 1 - (fraction of relevant documents that are false negatives).

In the above example Q1 has precision 2/5 and recall 2/3. Q2 has precision and recall = 0.

If Q1 subset Q2, then Recall(Q2) >= Recall(Q1). Prec(Q2) can be either greater or less than Q2. If you consider the precision over the first K documents returned for K = 1, 2, ... then the precision goes up every time dK is relevant and down every time it is irrelevant, so graph is sawtoothed. But on the whole precision tends to go down, so there is a trade-off between recall and precision as you get more documents.

Smoothed precision: Plot precision only at points when documents have been found; interpolate in between. Set precision(0)=1. Then precision can be monotonically decreasing, and will tend to be so except possibly at beginning.

Probabilistic model. Suppose that the matcher returns a measure of the "quality" of the document for the query. Suppose that this measured quality has some value in the following sense:
If q1 > q2, then Prob(d in R | qual(d)=q1) > Prob(d in R | qual(d) = q2)
Let QT = { d | qual(d) > = T }.
Then the expected value of precision(QT) is a decreasing function of T. The expected value of recall(QT) is an increasing function of T, but concave downward.

Other tradeoffs

Choices other than threshhold also tend to trade off precision vs. recall. (Of course, if they don't trade-off then just go with the better of the two: win-win.) E.g. stemming, inclusion of synonyms tends to increase recall at cost of precision.

Problems with Precision and Recall

Alternative measures

F-measure: Harmonic mean of precision and recall:
1/F = average(1/p,1/r)
F = 2pr/(p+r).
If either p or r is small then F is small. If p and r are close then F is about the average of p and r.

Generalized precision: Value of information obtained for user / cost of examining results.

Generalized recall: Value of information obtained / Value of optimal results (or: value of entire Web for user's current need)

Average precision: Average of precision at 20% recall, 50%, 80%. Or average of precision at recall = 0%, 10%, 20% ... 90%, 100%. (Since recall is does not attain these value exactly, and since recall remains constant until next relevant document found, and thus same value of recall can have several values of precision, take max precision or avg precision, and interpolate. Similarly, precision at recall = 0% is extrapolated.)

Precision over first K documents: (or average relevance over first K documents). User model: User will only read first K documents.

Rank of Kth relevant document (or rank such that sum of relevance = K). User model: User will read until he has gotten K relevant documents (or documents whose total relevance is K).

Weighted precision: sumK in Q rel(dK) / |Q|.

Weighted recall: sumK in Q rel(dK) / sumK in R rel(dK)

Order diminishing sum: Value of search is sumK rel(dK) * pK. User model: User starts reading at beginning, at each step continues with probability p.

Total content precision: Total information relevant to query in pages retrieved divided by |Q| (or divided by total reading time). Of course, this is hard to quantify. You can, for example, prepare a list of questions on the subject matter, and measure "Total information" as the fraction of questions that can be answered from the retrieved texts.

Total content recall: Total information relevant to query in pages retrieved divided by total information relevant to query in Web.

Estimating R

A document is not found for one of two reasons.

First case is hopeless. (We will talk later about how to estimate the size of this set, but no way to estimate # of relevant documents.)

Second case:



Experimental design

Observational studies: users observed as they use web for their own purposes. Ecologically valid. The more interference, the less ecologically valid. (Just informing users that they are observed alters their behavior; however, there can be privacy issues if they are observed without being informed.)

Controlled experiment: Users carry out task specified by experimenter in controlled setting. Much more information per task, much more demanding of user, possible to design narrowly focussed experiment, less clearly representative of "normal" use.

Task design

What is a task? A single query or a research process?

How are queries chosen? Just taking the most frequent queries probably does not give a good test; these are neither very interesting nor very high-minded. Plus the search engines may well have hand-edited the answers. The same issue arises generally with regard to creating a corpus of benchmark problems. There must be some theory, but I don't know it.

Significance of experiment

Failures and errors occur for the following reasons: