Lecture 5: Clustering

Required Reading

MR&S, chaps. 16 and 17.

Further Reading

Chakrabarti sec. 4.4.
A Survey of Web clustering engines, Claudio Carpineto et al. ACM Computing Surveys, 41:3 2009.
Enhancing cluster labelling using Wikipedia, David Carmel, Haggai Roitman, Naama Zwerdling, SIGIR '09.
Grouper: A Dynamic Clustering Interface to Web Search Results Oren Zamir and Oren Etzioni
A Comparison of Document Clustering Techniques by Michael Steinbach, George Karypis, and Vipin Kumar
Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution Filippo Geraci, Marco Pellegrini, Marco Maggini, and Fabrizio Sebastiani. Evaluating Hierarchical Clustering of Search Results J. Cigarran et al., SPIRE 2005, LNCS 3772.

Probabilistic definition of precision/recall

Pick a document d at random from the collection. Let Q be the event that d is retrieved in response to a given query and let R be the event that d is relevant.

Precision = Prob(R|Q).
Recall = Prob(Q|R).

Clustering: Introduction

The Yippy search engine organizes its responses by topic. Example:

In 2011, the query "Jaguar" generated the following top-level clusters: "Parts", "Pictures", "Vehicles in the greater", "Panthera onca", "History", "Cars for Sale", "Land Rover", "Club", Animal", "Sports cars" etc. all related either to the car or the cat. "Jaguars" added "Jacksonville" and "Houston".

The query "Hepburn" gives the top-level clusters "Katharine", "Audrey Hepburn", "Photos" [Audrey and Katharine about equal, some random], "TV", "Posters" "Katherine". "History", "Definition", "Holiday, Eleuthera", and "Review, Allwatchers".


1. Improve presentation of search results. User can find relevant pages faster.

2. Internal to search engine: For a search engine that returns a ranked list, you want a distribution of topics. E.g. search on "CMU" should and does return Central Michigan U and Central Methodist U in addition to Carnegie Mellon, i even though the Carnegie pages rank higher.


Why do the web pages returned for a query cluster?

These can be orthogonal, making the job that much harder. E.g. under "Hesse" you have articles on the German state and the author Hermann Hesse and on each you have articles in English and in German.

Cluster hypothesis: Documents in the same cluster behave similarly with respect to relevance for information needs.

General issues

Clustering algorithms

K-means algorithm

K-means-cluster (in S : set of vectors : k : integer)
{  let C[1] ... C[k] be a random partition of S into k parts;
   repeat {
            for i := 1 to k {
               X[i] := centroid of C[i];
               C[i] := empty 
            for j := 1 to N {
               X[q] := the closest to S[j] of X[1] ... X[k]
               add S[j] to C[q]
              } }
    until the change to C (or the change to X) is small enough.

Example: (Note: I have started here with random points rather than a random partition.)



Variant 1: Update centroids incrementally. Claims to give better results.

Variant 2: Bisecting K-means algorithm:

for I := 1 to K-1 do {
    pick a leaf cluster C to split;
    for (J := 1 to ITER) 
        use 2-Means clustering to split C into two subclusters C1 and C2;
    choose the best of the above splits and make it permanent
Generates a binary clustering hierarchy. Works better (so claimed) than K-means.

Variant 3: Allow overlapping clusters: If distances from P to C1 and to C2 are close enough then put P in both clusters.

Efficient implementation of K-means clustering

As we wrote the algorithm before, on each iteration you have to compute the distance from every point to every centroid. If there are many points, and K is reasonably large, and the division into clusters has become fairly stable, then the update procedure can be made more efficient as follows:

First, the computation of the new centroid requires only the points added to the cluster, the points taken out of the cluster, and the number of points in the cluster.

Second, computing the change made to the cluster made by moving the centroid can be made more efficient as follows:

Let XJ(T) be the centroid of the Jth cluster on the Tth iteration, and let VI be the Ith point. Note that
dist(VI,XJ(T+1)) <= dist(VI,XJ(T)) + dist(XJ(T),XJ(T+1).)

Therefore you can maintain two arrays:

RADIUS(J) = an overestimate of the maximum value of dist(VI,XJ), where VI is in cluster J.

DIST(J,Z) = an underestimate of the minimum value of dist(VI,XZ), where VI is in cluster J.

which you update as follows:

RADIUS(J) := RADIUS(J) + dist(XJ(T),XJ(T+1));
DIST(J,Z) := DIST(J,Z) - dist(XZ(T),XZ(T+1)) - dist(XJ(T),XJ(T+1));
then as long as RADIUS(J) < DIST(J,Z) you can be sure that none of the points in cluster J should be moved to cluster Z. (You update these values more exactly whenever a point is moved.)

Sparse centroids

The centroid of a collection of documents contains a non-zero component corresponding to every term that appears in any of the documents. Therefore it is a much less sparse vector than the individual documents, which can incur computational costs. One solution is to limit the number of non-zero terms, or threshhold them. Another is to replace the centroid of a set C by the ``most central'' point in U. There are a number of ways to define this; for instance choose the point X that minimizes maxV in C d(X,V) or that minimizes sumV in C d(X,V)2

This can be approximated in linear time as follows:

medoid(C,X) {
  U := the point in C furthest from X;
  V :- the point in C furthest from U;
  Y := the point in C that minimizes d(Y,U) + d(Y,V) + |d(Y,U)-d(Y,V)|
Note that d(X,U)+d(X,V) is minimal, and equal to d(U,V) for points X on the line between U and V, and for points on that line |d(X,U)-d(X,V)| is minimal when X is at the center of the line. [Geraci et al. 2006] call this the "medoid" of C.

Finding the right value of K

Let S be a system of clusters.
Let S[d] be the cluster containing document d.
Let m(C) be the median of cluster C.
For a system of clusters S, let RSS(S) = Σd (d-m(S(d)))2

For K=1,2, ... compute SK = best system with K clusters.
Plot RSS(SK) vs. K (see MR&S figure 16.8 p. 337)
Look for a "knee" in plot. Choose that as value of K.
Question to which I don't know the answer: Is this curve always concave upward?

Furthest Point First (FPF) Algorithm

[Geraci et al., 2006]
{ V := random point in S;
  CENTERS := { V };
  for (U in S) {
    LEADER[U] := V;
    D[U] = d(U,V);
  for (I = 1 to K) {
     X := the point in S-CENTERS with maximal value of D[X];
     C := emptyset;
     for (U in S-CENTERS) 
         if (d(U,X) < D[U]) add X to C;
     X := medoid(C,X); 
     add X to CENTERS;
     for (U in S-CENTERS) {
        if (d(U,X) < D[U]) {
          LEADER[U] := X;
          D[U] := d(U,X)

(I think this is right; it is not quite clear to me from the paper when the "medoids" are updated.)

The final value of CENTERS is the set of centers of clusters. For any U in CENTERS, the cluster centered at U is the set of points X for which LEADER[X]=U.

1. Time=O(nk).
2. Uses only sparse vectors
3. The "maximal radius measure" maxU in S minX in CENTERSd(U,X) is within a factor of 2 of optimal.

Agglomerative Hierarchical Clustering Technique

{ put every point in a cluster by itself
  for I := 1 to N-1 do {
     let C1,C2 be the most mergeable pair of clusters;
     create C parent of C1, C2
  } }
Various measures of "mergeable" used. Choose C1 != C2 so as to maximize:


Conflicting claims about quality of agglomerative vs. K-means.

One-pass Clustering

pick a starting  point D in S;
CC = { { D } } } /* Set of clusters: Initially 1 cluster containing D */
for Di in S do {
    C := the cluster in CC "closest" to Di
    if similarity(C,Di) > threshhold 
      then add Di to C;
      else add { Di } to CC;

Mixture Models Clustering

Assert that data is generated by a weighted combination of parameterized random processes. Find the weights, parameter values that best fit the data. In particular a cluster is a distribution (e.g. Gaussian) around a center.

1. "Soft" assignment to clusters. That is, if a point U fits two models M1, M2 fairly well, then it has a reasonable probability of being in either. Viewed the other way, U provides some degree of "evidentiary support" to both M1 and M2.

2. Allows overlapping models of different degrees of precision. In particular there can be particular narrow-diameter models against a broad background.

Applying this to documents requires a stochastic model of document generation. Even for quite simple documents, the statistics become quite hairy. See Chakrabarti.

STC (Suffix Tree Clustering) algorithm

Step 1: Construct suffix tree. Suffix tree: S is a set of strings. (In our case, each elt of S is a sentence, viewed as a string of words.) A compact tree containing all suffixes of strings in S.

Example: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse too" }

Suffix tree can be constructed in linear time.

Step 2: Score nodes. For node N in suffix tree, let D(N) = set of documents in subtree of N. Let P(N) be the phrase labelling N. Define score of N, s(N) = |D(N)| * f(|P(N)|). f(1) is small; f(K) = K for K = 2 ... 6; f(K) = 6 for K > 6.

Step 3: Find clusters.

A. Construct an undirected graph whose vertices are nodes of the suffix tree. There is an arc from N1 to N2 if the following conditions holds:

B. Each connected component of this graph is a cluster. Score of cluster computed from scores of nodes, overlap function. Top 10 clusters returned. Cluster can be described using phrases of its ST nodes.


Query "salsa" submitted to MetaCrawler (consults with several search engines and combines answer.) Returns 246 documents in 15 clusters, of which the top are


Clustering using query log

(Beeferman and Berger, 2000) Log records query, links that were clicked through.
Create bipartite graph where query terms connect to links.
Cluster of pages = connected component.
Also gives cluster of query terms -- useful to suggest alternative queries to user.

Evaluating Clusters

Overall internal measures

Measure quality of the algorithm and structure of the point set, not the reasonableness of the measure.

Variable number of clusters: Any of the above + suitable penalty for more clusters. Note: Without penalty, these are all optimized if each point is a cluster by itself.

Minimum description length. Incorporates penalty for large K and works for hierarchy.

Formal measures test adequacy of clustering algorithm, but not relevance of measures to actual significance.

Relative to an informational need

Compare the cost (i.e. user time) of browsing a clustered structure for relevant documents to cost of browsing the list. A cluster is relevant if it contains a relevant document. Assume that the labels on the clusters allow the user to judge which clusters are relevant. A cluster appears if it is a sibling of a relevant cluster. A document appears if it is a sibling of a relevant document or cluster. Charge A for each cluster that appears and B for each document that appears. Overall generalized precision is
(A * (number of clusters that appear) + B * (number of docs that appear)) / B* (number of relevant docs retrieved).

Problem: If one document is put into an irrelevant cluster, then that can have a large cost in precision, because now all the sibling and clusters come along. More reasonable to incur a small cost in recall. Another problem: clusters are given in ranked order (usually by number of docs, but still).

Alternative solutions that gives a precision/recall curve:

1. Assume that whenever a list of clusters appears, the user goes through clusters that appear in order, going down each relevant cluster. Plot the amount of effort spent by the time he reaches the Kth relevant document.

2. Assume instead that the labels on the clusters are good enough that the user can carry out an optimal depth first search, relative to, say, overall weighted precision. If weights are geometrically decreasing, this is easily computed; if not, determining the optimal is probably NP-hard. on the clusters allow perfect browsing, relative to the clusters; that is that the user can judge which clusters do and do not contain relevant

Comparison to Gold Standard

You have a "gold standard" set of documents with clusters.

Assigning labels to clusters

Statistics based

Choose most frequent / most highly weighted / most discriminative terms in each cluster. Often gives highly unsuggestive results.

Best document based

Find document in the cluster closest to the centroid; use the title, or the snippet for the query. Very hit and miss.

Wikipedia based

Find most similar Wikipedia article. Use title.

Evaluating label assignments

Criteria: 1. Run label assignment algorithm over gold standard clustering. Accept if assigned label is marked as a synonym for true label in WordNet.

2. Human evaluation, of the kinds discussed in lecture 4.

Aside on WordNet

Two online data collections are increasingly used in all kinds of ways in IR and particularly in web mining. One is Wikipedia. The other is WordNet. If you are at all serious about IR or web mining, you should spend some time checking it out.