Lecture 5: Clustering
Required Reading
MR&S, chaps. 16 and 17.
Further Reading
Chakrabarti sec. 4.4.
A Survey of Web clustering engines, Claudio Carpineto et al.
ACM Computing Surveys, 41:3 2009.
Enhancing cluster labelling using Wikipedia,
David Carmel, Haggai Roitman, Naama Zwerdling, SIGIR '09.
Grouper: A Dynamic Clustering Interface to Web Search Results
Oren Zamir and Oren Etzioni
A Comparison of Document Clustering Techniques by
Michael Steinbach, George Karypis, and Vipin Kumar
Cluster Generation and Cluster Labelling for Web Snippets: A Fast and
Accurate Hierarchical Solution
Filippo Geraci, Marco Pellegrini, Marco Maggini, and Fabrizio Sebastiani.
Evaluating Hierarchical Clustering of Search Results
J. Cigarran et al., SPIRE 2005, LNCS 3772.
Probabilistic definition of precision/recall
Pick a document d at random from the collection. Let Q be the event that
d is retrieved in response to a given query and let R be the event that
d is relevant.
Precision = Prob(R|Q).
Recall = Prob(Q|R).
Clustering: Introduction
The Yippy search
engine organizes its responses by topic. Example:
In 2011,
the query "Jaguar" generated the following top-level clusters:
"Parts", "Pictures", "Vehicles in the greater", "Panthera onca", "History",
"Cars for Sale", "Land Rover", "Club", Animal", "Sports cars" etc.
all related either to the car or the cat. "Jaguars" added "Jacksonville"
and "Houston".
The query "Hepburn" gives the top-level clusters "Katharine", "Audrey
Hepburn", "Photos"
[Audrey and Katharine about equal, some random], "TV", "Posters" "Katherine".
"History", "Definition", "Holiday, Eleuthera", and "Review, Allwatchers".
Motivation
1. Improve presentation of search results. User can find relevant pages faster.
2. Internal to search engine: For a search engine that returns a ranked list,
you want a distribution of topics. E.g. search on "CMU" should and does return
Central Michigan U and Central Methodist U in addition to Carnegie Mellon, i
even though the Carnegie pages rank higher.
Outline
- Why do web pages returned for a query tend to cluster?
- Motivation for computing clusters. Just discuss effective search
results page.
- General issues
- Clustering algorithms
- Cluster evaluation metrics
- Cluster labelling algorithms
- Label evaluation metrics
Why do the web pages returned for a query cluster?
- Multiple languages.
- Polysemy (lexical ambiguity): A word has multiple meanings (jaguar,
Hepburn). The criterion for separating is usually clear --- that is,
human evaluators will have a large degree of agreement. Actually carrying out
the separation may be difficult (e.g. "President Bush").
- Different aspects of the same topic. E.g. for a query on a product,
there are vendors, reviews, manuals, explanations, etc. Here the
division is less clear-cut, and there will be much less agreement among
human evaluators.
These can be orthogonal, making the job that much harder. E.g.
under "Hesse" you have articles on the German state and the author Hermann
Hesse and on each you have articles in English and in German.
Cluster hypothesis: Documents in the same cluster behave similarly
with respect to relevance for information needs.
General issues
- Structure collection offline vs. structure query results online.
- Cluster based on document vs. cluster based on snippet.
- Cluster structure:
- Flat disjoint
- Flat overlapping
- Tree: All documents at leaves.
- Tree: Documents both at leaves and at internal nodes (if more general).
- DAG
- (Orthogonally) multiple dimensions. E.g. language vs. meaning vs. usage.
- Exhaustive vs. non-exhaustive.
Note difference between geometric point not part of any cluster (unclustered
point) and document not part of any cluster (general subject matter).
True analogy perhaps: document = region.
- Outliers: Singleton clusters vs. include in nearest cluster vs.
"Other".
- How many clusters? How large?
- Information source: text, link analysis, usage. (Not mutually
exclusive, of course).
Clustering algorithms
- Decompositional (top-down)
- Agglomerative (bottom-up)
- Mixture model (statistical)
K-means algorithm
K-means-cluster (in S : set of vectors : k : integer)
{ let C[1] ... C[k] be a random partition of S into k parts;
repeat {
for i := 1 to k {
X[i] := centroid of C[i];
C[i] := empty
}
for j := 1 to N {
X[q] := the closest to S[j] of X[1] ... X[k]
add S[j] to C[q]
} }
until the change to C (or the change to X) is small enough.
}
Example: (Note: I have started here with random points rather than
a random partition.)
Features:
Problems:
- Have to guess K.
- Local minimum. Example: In diagram below, if K=2, and you start
with centroids B and E, converges on the two clusters {A,B,C}, {D,E,F}
- Disjoint and exhaustive decomposition.
- Starvation: Complete starvation of C[j], or starvation to single
outlier.
- Assumes that clusters are spherical in vector space. Hence
particularly sensitive to coordinate changes (e.g. changes in weighting)
- Worst-case running time is superpolynomial. At least c^{sqrt(n)}
for some constant c>1 (see
On the Worst Case Complexity of the k-means Method by David
Arthur and Sergei Vassilvitskii, 2005.) However, (a) in practice it generally
converges quickly; (b) if there are dubious points near the boundary between,
then it's not very important to get the assignment exactly right.
Variant 1: Update centroids incrementally. Claims to give better results.
Variant 2: Bisecting K-means algorithm:
for I := 1 to K-1 do {
pick a leaf cluster C to split;
for (J := 1 to ITER)
use 2-Means clustering to split C into two subclusters C1 and C2;
choose the best of the above splits and make it permanent
}
Generates a binary clustering hierarchy. Works better (so claimed) than
K-means.
Variant 3: Allow overlapping clusters: If distances from P to C1 and
to C2 are close enough then put P in both clusters.
Efficient implementation of K-means clustering
As we wrote the algorithm before, on each iteration you have to compute
the distance from every point to every centroid.
If there are many points, and K is reasonably large, and the division into
clusters has become fairly stable, then the update procedure can be
made more efficient as follows:
First, the computation of the new centroid requires only the points
added to the cluster, the points taken out of the cluster, and the
number of points in the cluster.
Second, computing the change made to the cluster made by moving the centroid
can be made more efficient as follows:
Let X_{J}(T) be the centroid of the Jth cluster on the Tth iteration,
and let V_{I} be the Ith point. Note that
dist(V_{I},X_{J}(T+1)) <=
dist(V_{I},X_{J}(T)) +
dist(X_{J}(T),X_{J}(T+1).)
Therefore you can maintain two arrays:
RADIUS(J) = an overestimate of the maximum value of
dist(V_{I},X_{J}), where V_{I} is in cluster J.
DIST(J,Z) = an underestimate of the minimum value of
dist(V_{I},X_{Z}), where V_{I} is in cluster J.
which you update as follows:
RADIUS(J) := RADIUS(J) + dist(X_{J}(T),X_{J}(T+1));
DIST(J,Z) := DIST(J,Z) - dist(X_{Z}(T),X_{Z}(T+1))
- dist(X_{J}(T),X_{J}(T+1));
then as long as RADIUS(J) < DIST(J,Z) you can be sure that none of
the points in cluster J should be moved to cluster Z. (You update these
values more exactly whenever a point is moved.)
Sparse centroids
The centroid of a collection of documents contains a non-zero component
corresponding to every term that appears in any of the documents. Therefore
it is a much less sparse vector than the individual documents, which can
incur computational costs. One solution is to limit the number of non-zero
terms, or threshhold them. Another is to
replace the centroid of a set C by the ``most central''
point in U. There are a number of ways to define this; for instance
choose the point X that minimizes max_{V in C} d(X,V) or
that minimizes sum_{V in C} d(X,V)^{2}
This can be approximated in linear time as follows:
medoid(C,X) {
U := the point in C furthest from X;
V :- the point in C furthest from U;
Y := the point in C that minimizes d(Y,U) + d(Y,V) + |d(Y,U)-d(Y,V)|
return(Y)
}
Note that
d(X,U)+d(X,V) is minimal, and equal to d(U,V) for points X on the line
between U and V, and for points on that line |d(X,U)-d(X,V)| is minimal
when X is at the center of the line.
[Geraci et al. 2006] call this the "medoid" of C.
Finding the right value of K
Let S be a system of clusters.
Let S[d] be the cluster containing document d.
Let m(C) be the median of cluster C.
For a system of clusters S, let
RSS(S) = Σ_{d} (d-m(S(d)))^{2}
For K=1,2, ... compute S_{K} = best system with K clusters.
Plot RSS(S_{K}) vs. K (see MR&S figure 16.8 p. 337)
Look for a "knee" in plot. Choose that as value of K.
Question to which I don't know the answer: Is this curve always concave
upward?
Furthest Point First (FPF) Algorithm
[Geraci et al., 2006]
{ V := random point in S;
CENTERS := { V };
for (U in S) {
LEADER[U] := V;
D[U] = d(U,V);
}
for (I = 1 to K) {
X := the point in S-CENTERS with maximal value of D[X];
C := emptyset;
for (U in S-CENTERS)
if (d(U,X) < D[U]) add X to C;
X := medoid(C,X);
add X to CENTERS;
for (U in S-CENTERS) {
if (d(U,X) < D[U]) {
LEADER[U] := X;
D[U] := d(U,X)
}
}
}
return(CENTERS)
}
(I think this is right; it is not quite clear to me from the paper when the
"medoids" are updated.)
The final value of CENTERS is the set of centers of clusters. For any U in
CENTERS, the cluster centered at U is the set of points X for which
LEADER[X]=U.
Features:
1. Time=O(nk).
2. Uses only sparse vectors
3. The "maximal radius measure"
max_{U in S} min_{X in CENTERS}d(U,X)
is within a factor of 2 of optimal.
Agglomerative Hierarchical Clustering Technique
{ put every point in a cluster by itself
for I := 1 to N-1 do {
let C1,C2 be the most mergeable pair of clusters;
create C parent of C1, C2
} }
Various measures of "mergeable" used. Choose C1 != C2 so as to maximize:
- Maximum similarity between d1 in C1 and d2 in C2.
Then this is basically Kruskal's algorithm for minimum spanning tree;
runs in almost linear time.
However, not a good clustering measure.
- Minimum similarity between d1 in C1 and d2 in C2 (= diameter
of C1 and C2)
- Similarity of centroids of C1 and C2 =
average similarity of d1 in C1 and d2 in C2.
- Average similarity of d1, d2 in C1 union C2.
Characteristics
- Creates complete binary tree of clusters
- Various ways to determine "mergeability".
- Deterministic
- O(N^{2}) mergeability calculations.
Conflicting claims about quality of agglomerative vs. K-means.
One-pass Clustering
pick a starting point D in S;
CC = { { D } } } /* Set of clusters: Initially 1 cluster containing D */
for Di in S do {
C := the cluster in CC "closest" to Di
if similarity(C,Di) > threshhold
then add Di to C;
else add { Di } to CC;
}
Features:
- Running time: O(KN) (K = number of clusters)
- Fixed threshhold
- Order dependent. Can rerun with different order.
- Disjoint, exhaustive clusters
- Low precision
Mixture Models Clustering
Assert that data is generated by a weighted combination of parameterized
random processes. Find the weights, parameter values that best fit the data.
In particular a cluster is a distribution (e.g. Gaussian) around a center.
Features:
1. "Soft" assignment to clusters. That is, if a point U fits two models M1,
M2 fairly well, then it has a reasonable probability of being in either.
Viewed the other way, U provides some degree of "evidentiary support" to
both M1 and M2.
2. Allows overlapping models of different degrees of precision. In particular
there can be particular narrow-diameter models against a broad background.
Applying this to documents requires a stochastic model of document generation.
Even for quite simple documents, the statistics become quite hairy.
See Chakrabarti.
STC (Suffix Tree Clustering) algorithm
Step 1: Construct suffix tree.
Suffix tree: S is a set of strings. (In our case, each elt
of S is a sentence, viewed as a string of words.)
A compact tree containing all
suffixes of strings in S.
- Rooted directed tree.
- Each edge is labelled with a non-empty substring of S. Label of
node N = concatenation of labels of edges from root to N.
- Compact: no two edges out of the same node have edge-labels
that begin with same word.
- For suffix Q in S, there is a node with label Q.
- Each node for a suffix Q of string Z in S labelled with index of Z
and starting position of Q in Z.
Example: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse
too" }
Suffix tree can be constructed in linear time.
Step 2: Score nodes.
For node N in suffix tree, let D(N) = set of documents in subtree of N.
Let P(N) be the phrase labelling N. Define score of N, s(N) =
|D(N)| * f(|P(N)|). f(1) is small; f(K) = K for K = 2 ... 6;
f(K) = 6 for K > 6.
Step 3: Find clusters.
A. Construct an undirected graph whose vertices are nodes of the suffix tree.
There is an arc from N1 to N2 if the following conditions holds:
- Either N1 or N2 is among the 500 top-scoring nodes.
- | D(N1) intersect D(N2) | / max(|N1|,|N2|) > 0.5.
B. Each connected component of this graph is a cluster. Score of
cluster computed from scores of nodes, overlap function.
Top 10 clusters returned.
Cluster can be described
using phrases of its ST nodes.
Example
Query "salsa" submitted to MetaCrawler (consults with several
search engines and combines answer.) Returns 246 documents in 15 clusters,
of which the top are
- Puerto Rico; Latin Music (8 docs)
- Follow Up Post; York Salsa Dancers (20 docs)
- music; entertainment; latin; artists (40 docs)
- hot; food; chiles; sauces; condiments; companies (79 docs)
- pepper; onion; tomatoes (41 docs)
Features
- Overlapping clusters.
- Non-exhaustive
- Linear time.
- High precision.
Clustering using query log
(Beeferman and Berger, 2000)
Log records query, links that were clicked through.
Create bipartite graph where query terms connect to links.
Cluster of pages = connected component.
Also gives cluster of query terms -- useful to suggest alternative queries
to user.
Evaluating Clusters
Overall internal measures
Measure quality of the algorithm and structure of the point set, not
the reasonableness of the measure.
- Minimize average diameter of clusters.
(Diameter of cluster C = max distance between two points in C)
- Minimize max diameter of clusters
- Minimize average radius of cluster (max distance from point to center)
- Minimize average distance between points in cluster.
- Minimize means-squared distance from point to center.
Maximum likelihood estimate if clusters generated by normal distribution
around center with fixed variance.
Variable number of clusters:
Any of the above + suitable penalty for more clusters. Note: Without penalty,
these are all optimized if each point is a cluster by itself.
Minimum description length. Incorporates penalty for large K and works
for hierarchy.
Formal measures test adequacy of clustering algorithm, but not
relevance of measures to actual significance.
Relative to an informational need
Compare the cost (i.e. user time) of browsing a clustered structure for
relevant documents to cost of browsing the list. A cluster is relevant
if it contains a relevant document.
Assume that the labels on the clusters allow the user to judge
which clusters are relevant. A cluster appears if it is a sibling of
a relevant cluster. A document appears if it is a sibling of a relevant
document or cluster. Charge A for each cluster that appears and B
for each document that appears. Overall generalized precision is
(A * (number of clusters that appear) + B * (number of docs that appear)) /
B* (number of relevant docs retrieved).
Problem: If one document is put into an irrelevant cluster, then that can
have a large cost in precision, because now all the sibling and clusters
come along. More reasonable to incur a small cost in recall. Another problem:
clusters are given in ranked order (usually by number of docs, but still).
Alternative solutions that gives a precision/recall curve:
1. Assume that
whenever a list of clusters appears, the user goes through clusters
that appear in order, going down each relevant cluster. Plot the amount
of effort spent by the time he reaches the Kth relevant document.
2. Assume instead that the labels on the clusters are good enough that
the user can carry out an optimal depth first search, relative
to, say, overall weighted precision. If weights are geometrically decreasing,
this is easily computed; if not, determining the optimal is probably NP-hard.
on the clusters allow perfect browsing, relative to the clusters; that is
that the user can judge which clusters do and do not contain relevant
Comparison to Gold Standard
You have a "gold standard" set of documents with clusters.
- Purity: For each computed cluster C, let M(C) the true cluster that
best matches C. For document d, let C(d) be the computed cluster containing
d and let T(d) be the true cluster containing d. Then
Purity = fraction of d for which M(C(d)) = T(d).
Problem: Reaches an optimum value of 1 when each document is in
a singleton cluster.
Solution: Penalize for large values of K.
Question: How much to penalize?
- Normalized mutual information: Information theoretic measure of
how well the computed clusters and the true clusters predict one another,
normalized by the amount of information inherent in the two clustering
systems.
- Precision-recall. Consider the two clustering systems (true and computed)
as categorizations of pairs of documents: D1 and D2 are in the same cluster
vs. in different clusters. Then treat this as a retrieval evaluation problem
as in lecture 4.
E.g. Precision = fraction of pairs of documents that are actually in the
same category out of those that are computed to be.
Recall = fraction of pairs of documents that are computed to be in the
same category out of those that are actually in the same category.
Note that oversplitting gains precision at the cost of recall. Strict
monotonicity of both precision and recall with respect to splitting.
Assigning labels to clusters
Statistics based
Choose most frequent / most highly weighted / most discriminative terms
in each cluster. Often gives highly unsuggestive results.
Best document based
Find document in the cluster closest to the centroid; use the title, or the
snippet for the query. Very hit and miss.
Wikipedia based
Find most similar Wikipedia article. Use title.
Evaluating label assignments
Criteria:
- How well does the label describe the documents in the cluster,
relative to the query? (I.e. if the query is "jaguar" then the
label "jaguar" may be a good one for some cluster, just looking at the
documents, but not in the context of the query. Whereas the label
may be perfectly reasonable for the same documents for some other query
e.g. "feline species")
- How well does the set of labels partition the space of results?
1. Run label assignment algorithm over gold standard clustering. Accept if
assigned label is marked as a synonym for true label in WordNet.
2. Human evaluation, of the kinds discussed in lecture 4.
Aside on WordNet
Two online data collections are increasingly used in all kinds of ways in
IR and particularly in web mining. One is Wikipedia. The other is
WordNet. If you are at all
serious about IR or web mining, you should spend some time checking it out.