2. Google's spelling correction is done entirely on the basis of user logs; that is, it is observed how users correct their own spelling in consecutive queries.
3. Ranking (query dependent) is partly based on user behavior. If users querying on Q generally decide to click on page P, then that ups the relevance of P to Q.
4. The problem of dealing with "spam" --- garbage, duplicated pages, and attempts to trick the page rank algorithm --- eats up a substantial fraction of Google's CPU cycles, and, generally, is the hardest problem Google has to deal with.
In 2007, the query "Jaguar" generated the following top-level clusters: "Parts", "Photos", "Club", "Reviews", "Panthera onca", "Atari game", "Mac", "Jacksonville Jaguars", and 13 other topics. (Like most clustering systems, it oversplits.) It also has subclusters. These can be very small. For instance, "Panthera Onca" is divided into the subclusters "Western Hemisphere", "Big Cats", "Class, Kingdom" and "Other topics" each with 2 pages.
The query "Hepburn" gives the top-level clusters "Katharine Hepburn", "Photos" [mostly Audrey, some Katharine, some random], "Winner,Oscar", "Family", "Hepburn Biography", "Father, Born", "Van Heemstra Hepburn-Ruston" [Audrey], "Amazon.com", "Review, Allwatchers", "Canadian", and so on. Oddly, there is no cluster labelled "Audrey Hepburn". Clearly there is room for improvement here.
(Curious though irrelevant observation: As of 2007, the top pages returned by Google for the query "jaguar" were the car and the animal. The top 3 were the car, #4 was the animal, and of the top 30, 10 were for the animal, 12 were for the car, 2 for a quantum chemistry package, and 1 each for a model company; "USS Jaguar" a kid's Star Trek site; the Mac OS; Jaguar Computer Systems; Jaguar Wright, a pop singer; flickr, the Yahoo photo collection which among other things clusters photos.) The query "jaguars" gives mostly the football team and the animal; the first page for the car is #16. The query "jaguar jaguars" gives mostly the car and the animal, though a substantially different set of pages from the query "jaguar".
As of 2004, in Google, searching under "jaguar" gives first (mostly) the automobile, and to a lesser extent, the MAC OS version. The first page for the animal is #8, the second is #18. Searching under "jaguars" gives a mixture of the animal and football teams; the first page for the car is #25. Searching under "jaguar jaguars" gives only pages for S-type Jaguars (the car) down through page 56; page 57 is the animal.
The results in 2002 were quite different. At that time, searching under "jaguar" gave first (mostly) the automobile; the animal turned up first at #15. Searching under "jaguars" gave football teams for at least the first 50. Searching under "jaguar jaguars" gives mostly pages for the animal. I have no idea to what extent this change reflects changes in Google versus changes in the Web.)
Clearly, this is useful. Clearly there is room for improvement in the grouping algorithm (not to mention in our apparent priorities.) Clusty provides no information on how it does clustering. There are two issues here: One is doing the clustering, the other is extracting an identifying phrase. We will just talk about the first.
K-means-cluster (in S : set of vectors : k : integer) { let C[1] ... C[k] be a random partition of S into k parts; repeat { for i := 1 to k { X[i] := centroid of C[i]; C[i] := empty } for j := 1 to N { X[q] := the closest to S[j] of X[1] ... X[k] add S[j] to C[q] } } until the change to C (or the change to X) is small enough. }
Example: (Note: I have started here with random points rather than a random partition.)
Features:
Sum of the squared distance steadily decreases over the iterations.
Hill-climbing algorithm.
Problems:
Variant 1: Update centroids incrementally. Claims to give better results.
Variant 2: Bisecting K-means algorithm:
for I := 1 to K-1 do { pick a leaf cluster C to split; for (J := 1 to ITER) use 2-Means clustering to split C into two subclusters C1 and C2; choose the best of the above splits and make it permanent }Generates a binary clustering hierarchy. Works better (so claimed) than K-means.
Variant 3: Allow overlapping clusters: If distances from P to C1 and to C2 are close enough then put P in both clusters.
First, the computation of the new centroid requires only the points added to the cluster, the points taken out of the cluster, and the number of points in the cluster.
Second, computing the change made to the cluster made by moving the cetroid can be made more efficient as follows:
Let X_{J}(T) be the centroid of the Jth cluster on the Tth iteration,
and let V_{I} be the Ith point. Note that
dist(V_{I},X_{J}(T+1)) <=
dist(V_{I},X_{J}(T)) +
dist(X_{J}(T),X_{J}(T+1).)
Therefore you can maintain two arrays:
RADIUS(J) = an overestimate of the maximum value of dist(V_{I},X_{J}), where V_{I} is in cluster J.
DIST(J,Z) = an underestimate of the minimum value of dist(V_{I},X_{Z}), where V_{I} is in cluster J.
which you update as follows:
RADIUS(J) := RADIUS(J) + dist(X_{J}(T),X_{J}(T+1));
DIST(J,Z) := DIST(J,Z) - dist(X_{Z}(T),X_{Z}(T+1))
- dist(X_{J}(T),X_{J}(T+1));
then as long as RADIUS(J) < DIST(J,Z) you can be sure that none of
the points in cluster J should be moved to cluster Z. (You update these
values more exactly whenever a point is moved.)
This can be approximated in linear time as follows:
medoid(C,X) { U := the point in C furthest from X; V :- the point in C furthest from U; Y := the point in C that minimizes d(Y,U) + d(Y,V) + |d(Y,U)-d(Y,V)| return(Y) }Note that d(X,U)+d(X,V) is minimal, and equal to d(U,V) for points X on the line between U and V, and for points on that line |d(X,U)-d(X,V)| is minimal when X is at the center of the line. [Geraci et al. 2006] call this the "medoid" of C.
{ V := random point in S; CENTERS := { V }; for (U in S) { LEADER[U] := V; D[U] = d(U,V); } for (I = 1 to K) { X := the point in S-CENTERS with maximal value of D[X]; C := emptyset; for (U in S-CENTERS) if (d(U,X) < D[U]) add X to C; X := medoid(C,X); add X to CENTERS; for (U in S-CENTERS) { if (d(U,X) < D[U]) { LEADER[U] := X; D[U] := d(U,X) } } } return(CENTERS) }(I think this is right; it is not quite clear to me from the paper when the "medoids" are updated.)
The final value of CENTERS is the set of centers of clusters. For any U in CENTERS, the cluster centered at U is the set of points X for which LEADER[X]=U.
Features:
1. Time=O(nk).
2. Uses only sparse vectors
3. The "maximal radius measure"
max_{U in S} min_{X in CENTERS}d(U,X)
is within a factor of 2 of optimal.
{ put every point in a cluster by itself for I := 1 to N-1 do { let C1,C2 be the most mergeable pair of clusters; create C parent of C1, C2 } }Various measures of "mergeable" used.
Characteristics
pick a starting point D in S; CC = { { D } } } /* Set of clusters: Initially 1 cluster containing D */ for Di in S do { C := the cluster in CC "closest" to Di if similarity(C,Di) > threshhold then add Di to C; else add { Di } to CC; }Features:
Features:
1. "Soft" assignment to clusters. That is, if a point U fits two models M1,
M2 fairly well, then it has a reasonable probability of being in either.
Viewed the other way, U provides some degree of "evidentiary support" to
both M1 and M2.
2. Allows overlapping models of different degrees of precision. In particular there can be particular narrow-diameter models against a broad background.
Applying this to documents requires a stochastic model of document generation. Even for quite simple documents, the statistics become quite hairy. See Chakrabarti.
Example: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse too" }
Suffix tree can be constructed in linear time.
Step 2: Score nodes. For node N in suffix tree, let D(N) = set of documents in subtree of N. Let P(N) be the phrase labelling N. Define score of N, s(N) = |D(N)| * f(|P(N)|). f(1) is small; f(K) = K for K = 2 ... 6; f(K) = 6 for K > 6.
Step 3: Find clusters.
A. Construct an undirected graph whose vertices are nodes of the suffix tree. There is an arc from N1 to N2 if the following conditions holds:
B. Each connected component of this graph is a cluster. Score of cluster computed from scores of nodes, overlap function. Top 10 clusters returned. Cluster can be described using phrases of its ST nodes.
In machine learning, this is known as classification rather than clustering. We will discuss this later.
Formal measures: Normalize all vectors to length 1. Assume fixed number of clusters.
Variable number of clusters: Any of the above + suitable penalty for more clusters. Note: Without penalty, these are all optimized if each point is a cluster by itself.
Formal measures test adequacy of clustering algorithm, but not relevance of measures to actual significance.
Comparison to Gold Standard
Gold standard can be either an existing clustering system (e.g. Open
Directory) or
user tests.
Various statistical and information-theoretic measures.