Lecture 4: Clustering
Required Reading
Grouper: A Dynamic Clustering Interface to Web Search Results
Oren Zamir and Oren Etzioni
A Comparison of Document Clustering Techniques by
Michael Steinbach, George Karypis, and Vipin Kumar
Applications
- Structuring search results
- Suggesting related pages
- Automatic directory construction / update.
- Finding near identical pages:
- Finding mirror pages (e.g. for propagating updates)
- Eliminate near-duplicates from results page
- Plagiarism detection
- Lost and found (find identical pages at different URL's at different
times.)
Cluster structure
- Hierarchical vs. flat.
- Overlap:
- Disjoint partitioning. E.g. partition congressmen by state.
- Multiple dimensions of partitioning, each disjoint.
E.g. partition congressmen by state; by party; by House vs. Senate.
- Arbitrary overlap. E.g. partition article by author. Geographical
regions (France, the Alps, German-speaking regions). Problem: No
natural bound on number of categories.
- Exhaustive vs. non-exhaustive.
Note: disjoint and exhaustive decomposition = tree.
Note difference between geometric point not part of any cluster (unclustered
point) and document not part of any cluster (general subject matter).
True analogy perhaps: document = region.
- Outliers: What to do?
- How many clusters? How large?
Information source
- Text content
- Links?
- Usage: Clickthrough logs give association between query and page.
Position of clustering module
- At indexing time
- At query time applied to papers.
- At query time applied to snippets. Turns out that, experimentally,
most clustering algorithms do almost as well given only snippets; in fact,
some do better with snippets than with the whole text.
Textual similarity criterion
- Vector measure. Each document is considered as vector normalized
to length 1.
- Overlap measure. Similarity of documents Q and R is
|Q intersect R| / |Q union R|
Note: Safe to presume that cluster of documents is a convex region
geometrically. That is, if subject S includes DOC1 with combination V1
of words and DOC2 with combination V2 of words then there could exist
a document in S with combination p*V1 + (1-p)*V2 of words.
Source of clustering in search results
- Polysemy. "bat", "Washington", "Banks".
Clustering criterion clear in principle, hard or easy in practice.
(e.g. "President Bush")
-
Multiple aspects of a single topic. Practically any rich topic.
Many possible dimension for clusters. Little agreement among human
subjects. No ideal form of clustering. Qn: Can we generate a system
of clusters that seems plausible and useful? Ultimately amounts
to general problem of Web page / information structuring.
Clustering algorithms
- Decompositional (top-down)
- Agglomerative (bottom-up)
Decompositional algorithms are almost always based on vector space
(only terms in which to see high-level structure.)
Any decompositional clustering algorithm can be made hierarchical by
recursive application.
K-means algorithm
K-means-cluster (in S : set of vectors : k : integer)
{ let X[1] ... X[k] be k random points in S;
repeat {
for j := 1 to N {
X[q] := the closest to S[j] of X[1] ... X[k]
add S[j] to C[q]
}
for i := 1 to k {
X[i] := centroid of C[i];
C[i] := empty
} }
until the change to X is small enough.
}
Features:
-
Objective function:
sum of the squared distance from each point to centroid of cluster
(= average cosine of angles between points)
Steadily decreases over the iterations.
Hill-climbing algorithm.
-
Disjoint and exhaustive decomposition.
-
For fixed number of iterations, linear in N.
Clever optimization reduces recomputation of X[q] if small change to S[j].
Second loop much shorter than O(kN) after the first couple of iterations.
-
"Anytime" algorithm: S[j] always a decomposition of S into convex subregions.
-
Random starting point: Multiple runs may give different results, choose
"best"
Problems:
- Have to guess K.
- Local minimum. Example: In diagram below, if K=2, and you start
with centroids B and E, converges on the two clusters {A.B.C}, {D,E,F}
- Disjoint and exhaustive decomposition.
- Starvation: Complete starvation of S[j], or starvation to single
outlier.
- Assumes that clusters are spherical in vector space. Hence
particularly sensitive to coordinate changes (e.g. changes in weighting)
Variant 1: Update centroids incrementally. Claims to give better results.
Variant 2: Bisecting K-means algorithm:
for I := 1 to K-1 do {
pick a leaf cluster C to split;
for J := 1 to ITER do split C into two subclusters C1 and C2;
choose the best of the above splits and make it permanent
}
Generates a binary clustering hierarchy. Works better (so claimed) than
K-means.
Variant 3: Allow overlapping clusters: If distances from P to C1 and
to C2 are close enough then put P in both clusters.
Agglomerative Hierarchical Clustering Technique
{ put every point in a cluster by itself
for I := 1 to N-1 do {
let C1,C2 be the most mergeable pair of clusters;
create C parent of C1, C2
} }
Various measures of "mergeable" used.
(If "mergeability" is the distance between two closest elements of
C1, C2, then this is Kruskal's algorithm for minimum spanning tree;
however, this is not
, in practice, one of the measures used.)
Characteristics
- Creates complete binary tree of clusters
- Various ways to determine "mergeability".
- Deterministic
- O(N2) running time.
Conflicting claims about quality of agglomerative vs. K-means.
One-pass Clustering
pick a starting point D in S;
CC = { { D } } } /* Set of clusters: Initially 1 cluster containing D */
for Di in S do {
C := the cluster in CC "closest" to Di
if similariity(C,Di) > threshhold
then add Di to C;
else add { Di } to CC;
}
Features:
- Running time: O(KN) (K = number of clusters)
- Fixed threshhold
- Order dependent. Can rerun with different order.
- Disjoint, exhaustive clusters
- Low precision
STC (Suffix Tree Clustering) algorithm
Step 1: Construct suffix tree.
Suffix tree: S is a set of strings. (In our case, each elt
of S is a sentence, viewed as a string of words.)
A compact tree containing all
suffixes of strings in S.
- Rooted directed tree.
- Each edge is labelled with a non-empty substring of S. Label of
node N = concatenation of labels of edges from root to N.
- Compact: no two edges out of the same node have edge-labels
that begin with same word.
- For suffix Q in S, there is a node with label Q.
- Each node for a suffix Q of string Z in S labelled with index of Z
and starting position of Q in Z.
Example: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse
too" }
Suffix tree can be constructed in linear time.
Step 2: Score nodes.
For node N in suffix tree, let D(N) = set of documents in subtree of N.
Let P(N) be the phrase labelling N. Define score of N, s(N) =
|D(N)| * f(|P(N)|). f(1) is small; f(K) = K for K = 2 ... 6;
f(K) = 6 for K > 6.
Step 3: Find clusters.
A. Construct an undirected graph whose vertices are nodes of the suffix tree.
There is an arc from N1 to N2 if the following conditions holds:
- Either N1 or N2 is among the 500 top-scoring nodes.
- | D(N1) intersect D(N2) | / max(|N1|,|N2|) > 0.5.
B. Each connected component of this graph is a cluster. Score of
cluster computed from scores of nodes, overlap function.
Top 10 clusters returned.
Cluster can be described
using phrases of its ST nodes.
Example
Query "salsa" submitted to MetaCrawler (consults with several
search engines and combines answer.) Returns 246 documents in 15 clusters,
of which the top are
- Puerto Rico; Latin Music (8 docs)
- Follow Up Post; York Salsa Dancers (20 docs)
- music; entertainment; latin; artists (40 docs)
- hot; food; chiles; sauces; condiments; companies (79 docs)
- pepper; onion; tomatoes (41 docs)
Features
- Overlapping clusters.
- Non-exhaustive
- Linear time.
- High precision.
Clustering using query log
(Beeferman and Berger, 2000)
Log records query, links that were clicked through.
Create bipartite graph where query terms connect to links.
Cluster of pages = connected component.
Also gives cluster of query terms -- useful to suggest alternative queries
to user.
Detection of identical or near-identical pages
Syntactic Clustering of the Web
Broder, Glassman, Manasse, and Zweig, 1998 (WWW6)
Shingle : K consecutive words.
Fix a shingle size K. Let S(A) be the set of shingles in A and
let S(B) be the set of shingles in B.
The resemblance of A and B
is defined as | S(A) intersect S(B) | / | S(A) union S(B) |.
The containment of A in B
is defined as | S(A) intersect S(B) | / | S(A) |.
Estimate the resemblance of sets A and B using random sampling:
Step 1: Choose a value m and a random function P(W) from elements to integers.
Step 2: For set X, let V(X) = { x in X | P(x) mod m = 0} be a sample of X.
This is called the sketch of X.
Step 3: | V(A) intersect V(B) | / |V(A) union V(B) |
is an estimate of the resemblance of A and B.
|V(A) intersect V(B)| / V(A) is an estimate of the containment of A in B
Note: if you just sample A and sample B, get a substantial underestimate of
the resemblance.
Implementation uses shingle size = 10, m = 25, 40 bit sketch.
High-level algorithm
Step 1: Normalize by removing HTML and converting to lower-case.
Step 2: Calculate sketch of shingle set of each document.
Step 3: For each pair of documents, compare sketches to see if they
exceed resemblance threshhold.
Step 4: Find connected components.
Implementation of Step 3. Obviously all-pairs comparison is not feasible;
however, most pairs have no shingles in common. Also want to reduce disk
I/O since this cannot be done in-memory.
3.1. Use merge-sort to generate sorted file F1 of < shingle, docID > pairs.
3.2. Eliminate shingles that appear in only one docID (most).
3.3. for each shingle S in F1,
for each pair of docs docID1, docID2 associated with S in F1,
write record < docID1, docID2 > to F2.
F2 now has one record of form < docID1, docID2 > for each shingle
shared by doc1 and doc2.
3.4. Merge-sort F2, combining and keeping count of identical elements.
Generate file F3 of record < docID1, docID, count of common shingles >
Further issues:
- Eliminate very common shingles (more than 1000 documents). Almost
all automatically generated by standard programs.
- Eliminate truly identical documents by computing and comparing fingerprint
of entire document.
Known clusters
You have a known system of clusters with examples (e.g. Yahoo).
The problem is to place a new page into the proper cluster(s).
In machine learning, this is known as classification rather
than clustering. Lots of algorithms.