Lecture 5: Clustering

Four things I learned from Wilson Hsieh

1. Publications by researchers at Google can be found at http://research.google.com/pubs/papers.html

2. Google's spelling correction is done entirely on the basis of user logs; that is, it is observed how users correct their own spelling in consecutive queries.

3. Ranking (query dependent) is partly based on user behavior. If users querying on Q generally decide to click on page P, then that ups the relevance of P to Q.

4. The problem of dealing with "spam" --- garbage, duplicated pages, and attempts to trick the page rank algorithm --- eats up a substantial fraction of Google's CPU cycles, and, generally, is the hardest problem Google has to deal with.

Required Reading

Chakrabarti chap. 4 through section 4.2.

Further Reading

Chakrabarti sec. 4.4.
Grouper: A Dynamic Clustering Interface to Web Search Results Oren Zamir and Oren Etzioni
A Comparison of Document Clustering Techniques by Michael Steinbach, George Karypis, and Vipin Kumar
Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution Filippo Geraci, Marco Pellegrini, Marco Maggini, and Fabrizio Sebastiani.

Clustering: Introduction

The Clusty search engine organizes its responses by topic. Example:

In 2007, the query "Jaguar" generated the following top-level clusters: "Parts", "Photos", "Club", "Reviews", "Panthera onca", "Atari game", "Mac", "Jacksonville Jaguars", and 13 other topics. (Like most clustering systems, it oversplits.) It also has subclusters. These can be very small. For instance, "Panthera Onca" is divided into the subclusters "Western Hemisphere", "Big Cats", "Class, Kingdom" and "Other topics" each with 2 pages.

The query "Hepburn" gives the top-level clusters "Katharine Hepburn", "Photos" [mostly Audrey, some Katharine, some random], "Winner,Oscar", "Family", "Hepburn Biography", "Father, Born", "Van Heemstra Hepburn-Ruston" [Audrey], "Amazon.com", "Review, Allwatchers", "Canadian", and so on. Oddly, there is no cluster labelled "Audrey Hepburn". Clearly there is room for improvement here.

(Curious though irrelevant observation: As of 2007, the top pages returned by Google for the query "jaguar" were the car and the animal. The top 3 were the car, #4 was the animal, and of the top 30, 10 were for the animal, 12 were for the car, 2 for a quantum chemistry package, and 1 each for a model company; "USS Jaguar" a kid's Star Trek site; the Mac OS; Jaguar Computer Systems; Jaguar Wright, a pop singer; flickr, the Yahoo photo collection which among other things clusters photos.) The query "jaguars" gives mostly the football team and the animal; the first page for the car is #16. The query "jaguar jaguars" gives mostly the car and the animal, though a substantially different set of pages from the query "jaguar".

As of 2004, in Google, searching under "jaguar" gives first (mostly) the automobile, and to a lesser extent, the MAC OS version. The first page for the animal is #8, the second is #18. Searching under "jaguars" gives a mixture of the animal and football teams; the first page for the car is #25. Searching under "jaguar jaguars" gives only pages for S-type Jaguars (the car) down through page 56; page 57 is the animal.

The results in 2002 were quite different. At that time, searching under "jaguar" gave first (mostly) the automobile; the animal turned up first at #15. Searching under "jaguars" gave football teams for at least the first 50. Searching under "jaguar jaguars" gives mostly pages for the animal. I have no idea to what extent this change reflects changes in Google versus changes in the Web.)

Clearly, this is useful. Clearly there is room for improvement in the grouping algorithm (not to mention in our apparent priorities.) Clusty provides no information on how it does clustering. There are two issues here: One is doing the clustering, the other is extracting an identifying phrase. We will just talk about the first.

Source of clustering

Generally, the set of Web pages returned in answer to a query fall into multiple clusters for one of two reasons (as with virtually everything, the boundary between these two is fuzzy.)

Applications

Cluster structure

Information source

Position of clustering module

Textual similarity criterion

Note: Safe to presume that cluster of documents is a convex region geometrically. That is, if subject S includes DOC1 with combination V1 of words and DOC2 with combination V2 of words then there could exist a document in S with combination p*V1 + (1-p)*V2 of words.

Source of clustering in search results

Clustering algorithms

K-means algorithm

K-means-cluster (in S : set of vectors : k : integer)
{  let C[1] ... C[k] be a random partition of S into k parts;
   repeat {
            for i := 1 to k {
               X[i] := centroid of C[i];
               C[i] := empty 
              } 
            for j := 1 to N {
               X[q] := the closest to S[j] of X[1] ... X[k]
               add S[j] to C[q]
              } }
    until the change to C (or the change to X) is small enough.
}


Example: (Note: I have started here with random points rather than a random partition.)

Features:

Problems:

Variant 1: Update centroids incrementally. Claims to give better results.

Variant 2: Bisecting K-means algorithm:

for I := 1 to K-1 do {
    pick a leaf cluster C to split;
    for (J := 1 to ITER) 
        use 2-Means clustering to split C into two subclusters C1 and C2;
    choose the best of the above splits and make it permanent
}
Generates a binary clustering hierarchy. Works better (so claimed) than K-means.

Variant 3: Allow overlapping clusters: If distances from P to C1 and to C2 are close enough then put P in both clusters.

Efficient implementation of K-means clustering

As we wrote the algorithm before, on each iteration you have to compute the distance from every point to every centroid. If there are many points, and K is reasonably large, and the division into clusters has become fairly stable, then the update procedure can be made more efficient as follows:

First, the computation of the new centroid requires only the points added to the cluster, the points taken out of the cluster, and the number of points in the cluster.

Second, computing the change made to the cluster made by moving the cetroid can be made more efficient as follows:

Let XJ(T) be the centroid of the Jth cluster on the Tth iteration, and let VI be the Ith point. Note that
dist(VI,XJ(T+1)) <= dist(VI,XJ(T)) + dist(XJ(T),XJ(T+1).)

Therefore you can maintain two arrays:

RADIUS(J) = an overestimate of the maximum value of dist(VI,XJ), where VI is in cluster J.

DIST(J,Z) = an underestimate of the minimum value of dist(VI,XZ), where VI is in cluster J.

which you update as follows:

RADIUS(J) := RADIUS(J) + dist(XJ(T),XJ(T+1));
DIST(J,Z) := DIST(J,Z) - dist(XZ(T),XZ(T+1)) - dist(XJ(T),XJ(T+1));
then as long as RADIUS(J) < DIST(J,Z) you can be sure that none of the points in cluster J should be moved to cluster Z. (You update these values more exactly whenever a point is moved.)

Sparse centroids

The centroid of a collection of documents contains a non-zero component corresponding to every term that appears in any of the documents. Therefore it is a much less sparse vector than the individual documents, which can incur computational costs. One solution is to limit the number of non-zero terms, or threshhold them. Another is to replace the centroid of a set C by the ``most central'' point in U. There are a number of ways to define this; for instance choose the point X that minimizes maxV in C d(X,V) or that minimizes sumV in C d(X,V)2

This can be approximated in linear time as follows:

medoid(C,X) {
  U := the point in C furthest from X;
  V :- the point in C furthest from U;
  Y := the point in C that minimizes d(Y,U) + d(Y,V) + |d(Y,U)-d(Y,V)|
return(Y)
}
Note that d(X,U)+d(X,V) is minimal, and equal to d(U,V) for points X on the line between U and V, and for points on that line |d(X,U)-d(X,V)| is minimal when X is at the center of the line. [Geraci et al. 2006] call this the "medoid" of C.

Furthest Point First (FPF) Algorithm

[Geraci et al., 2006]
{ V := random point in S;
  CENTERS := { V };
  for (U in S) {
    LEADER[U] := V;
    D[U] = d(U,V);
   }
  for (I = 1 to K) {
     X := the point in S-CENTERS with maximal value of D[X];
     C := emptyset;
     for (U in S-CENTERS) 
         if (d(U,X) < D[U]) add X to C;
     X := medoid(C,X); 
     add X to CENTERS;
     for (U in S-CENTERS) {
        if (d(U,X) < D[U]) {
          LEADER[U] := X;
          D[U] := d(U,X)
         }
       }
    }
return(CENTERS)
}

(I think this is right; it is not quite clear to me from the paper when the "medoids" are updated.)

The final value of CENTERS is the set of centers of clusters. For any U in CENTERS, the cluster centered at U is the set of points X for which LEADER[X]=U.

Features:
1. Time=O(nk).
2. Uses only sparse vectors
3. The "maximal radius measure" maxU in S minX in CENTERSd(U,X) is within a factor of 2 of optimal.

Agglomerative Hierarchical Clustering Technique

{ put every point in a cluster by itself
  for I := 1 to N-1 do {
     let C1,C2 be the most mergeable pair of clusters;
     create C parent of C1, C2
  } }
Various measures of "mergeable" used.

Characteristics

Conflicting claims about quality of agglomerative vs. K-means.

One-pass Clustering

pick a starting  point D in S;
CC = { { D } } } /* Set of clusters: Initially 1 cluster containing D */
for Di in S do {
    C := the cluster in CC "closest" to Di
    if similarity(C,Di) > threshhold 
      then add Di to C;
      else add { Di } to CC;
}
Features:

Mixture Models Clustering

Assert that data is generated by a weighted combination of parameterized random processes. Find the weights, parameter values that best fit the data. In particular a cluster is a distribution (e.g. Gaussian) around a center.

Features:
1. "Soft" assignment to clusters. That is, if a point U fits two models M1, M2 fairly well, then it has a reasonable probability of being in either. Viewed the other way, U provides some degree of "evidentiary support" to both M1 and M2.

2. Allows overlapping models of different degrees of precision. In particular there can be particular narrow-diameter models against a broad background.

Applying this to documents requires a stochastic model of document generation. Even for quite simple documents, the statistics become quite hairy. See Chakrabarti.

STC (Suffix Tree Clustering) algorithm

Step 1: Construct suffix tree. Suffix tree: S is a set of strings. (In our case, each elt of S is a sentence, viewed as a string of words.) A compact tree containing all suffixes of strings in S.

Example: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse too" }

Suffix tree can be constructed in linear time.

Step 2: Score nodes. For node N in suffix tree, let D(N) = set of documents in subtree of N. Let P(N) be the phrase labelling N. Define score of N, s(N) = |D(N)| * f(|P(N)|). f(1) is small; f(K) = K for K = 2 ... 6; f(K) = 6 for K > 6.

Step 3: Find clusters.

A. Construct an undirected graph whose vertices are nodes of the suffix tree. There is an arc from N1 to N2 if the following conditions holds:

B. Each connected component of this graph is a cluster. Score of cluster computed from scores of nodes, overlap function. Top 10 clusters returned. Cluster can be described using phrases of its ST nodes.

Example

Query "salsa" submitted to MetaCrawler (consults with several search engines and combines answer.) Returns 246 documents in 15 clusters, of which the top are

Features

Clustering using query log

(Beeferman and Berger, 2000) Log records query, links that were clicked through.
Create bipartite graph where query terms connect to links.
Cluster of pages = connected component.
Also gives cluster of query terms -- useful to suggest alternative queries to user.

Known clusters

You have a known system of clusters with examples (e.g. Yahoo). The problem is to place a new page into the proper cluster(s).

In machine learning, this is known as classification rather than clustering. We will discuss this later.

Evaluating Clusters

Formal measures: Normalize all vectors to length 1. Assume fixed number of clusters.

Variable number of clusters: Any of the above + suitable penalty for more clusters. Note: Without penalty, these are all optimized if each point is a cluster by itself.

Formal measures test adequacy of clustering algorithm, but not relevance of measures to actual significance.

Comparison to Gold Standard
Gold standard can be either an existing clustering system (e.g. Open Directory) or user tests.

Various statistical and information-theoretic measures.