## Lecture 7: Clustering

Chakrabart chap. 4 through section 4.2.

Grouper: A Dynamic Clustering Interface to Web Search Results Oren Zamir and Oren Etzioni
A Comparison of Document Clustering Techniques by Michael Steinbach, George Karypis, and Vipin Kumar

### Clustering algorithms

• Decompositional (top-down)
• Agglomerative (bottom-up)

Decompositional algorithms are almost always based on vector space (only terms in which to see high-level structure.)

Any decompositional clustering algorithm can be made hierarchical by recursive application.

### K-means algorithm

```K-means-cluster (in S : set of vectors : k : integer)
{  let C[1] ... C[k] be a random partition of S into k parts;
repeat {
for i := 1 to k {
X[i] := centroid of C[i];
C[i] := empty
}
for j := 1 to N {
X[q] := the closest to S[j] of X[1] ... X[k]
} }
until the change to C (or the change to X) is small enough.
}
```

Features:

• Objective function: minimize the sum of the squared distance from each point to centroid of cluster.
For unit vectors, this is equivalent to maximizing the average dot product S[i] dot S[j} where S[i] and S[j] are in the same cluster.

Sum of the squared distance steadily decreases over the iterations.
Hill-climbing algorithm.

• Disjoint and exhaustive decomposition.
• For fixed number of iterations, linear in N. Clever optimization reduces recomputation of X[q] if small change to S[j]. Second loop much shorter than O(kN) after the first couple of iterations.
• "Anytime" algorithm: S[j] always a decomposition of S into convex subregions.
• Random starting point: Multiple runs may give different results, choose "best"

Problems:

• Have to guess K.
• Local minimum. Example: In diagram below, if K=2, and you start with centroids B and E, converges on the two clusters {A.B.C}, {D,E,F}

• Disjoint and exhaustive decomposition.
• Starvation: Complete starvation of C[j], or starvation to single outlier.
• Assumes that clusters are spherical in vector space. Hence particularly sensitive to coordinate changes (e.g. changes in weighting)

Variant 1: Update centroids incrementally. Claims to give better results.

Variant 2: Bisecting K-means algorithm:

```for I := 1 to K-1 do {
pick a leaf cluster C to split;
for J := 1 to ITER do split C into two subclusters C1 and C2;
choose the best of the above splits and make it permanent
}
```
Generates a binary clustering hierarchy. Works better (so claimed) than K-means.

Variant 3: Allow overlapping clusters: If distances from P to C1 and to C2 are close enough then put P in both clusters.

Full mean vector would get very long (= number of different words in all the documents.) Solution: truncate after first M terms. (Typically M=50 or 100.)

### Agglomerative Hierarchical Clustering Technique

```{ put every point in a cluster by itself
for I := 1 to N-1 do {
let C1,C2 be the most mergeable pair of clusters;
create C parent of C1, C2
} }
```
Various measures of "mergeable" used.
• Minimum distance between d1 in C1 and d2 in C2. Then this is basically Kruskal's algorithm for minimum spanning tree; runs in almost linear time. However, not a good clustering measure.
• Average distance between d1, d2 in C1 union C2. Quickly computable.
• Maximum distance between d1 in C1 and d2 in C2 (= diameter of C1 and C2)

Characteristics

• Creates complete binary tree of clusters
• Various ways to determine "mergeability".
• Deterministic
• O(N2) running time.
Conflicting claims about quality of agglomerative vs. K-means.

### One-pass Clustering

```pick a starting  point D in S;
CC = { { D } } } /* Set of clusters: Initially 1 cluster containing D */
for Di in S do {
C := the cluster in CC "closest" to Di
if similarity(C,Di) > threshhold
else add { Di } to CC;
}
```
Features:
• Running time: O(KN) (K = number of clusters)
• Fixed threshhold
• Order dependent. Can rerun with different order.
• Disjoint, exhaustive clusters
• Low precision

### STC (Suffix Tree Clustering) algorithm

Step 1: Construct suffix tree. Suffix tree: S is a set of strings. (In our case, each elt of S is a sentence, viewed as a string of words.) A compact tree containing all suffixes of strings in S.
• Rooted directed tree.
• Each edge is labelled with a non-empty substring of S. Label of node N = concatenation of labels of edges from root to N.
• Compact: no two edges out of the same node have edge-labels that begin with same word.
• For suffix Q in S, there is a node with label Q.
• Each node for a suffix Q of string Z in S labelled with index of Z and starting position of Q in Z.

Example: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse too" }

Suffix tree can be constructed in linear time.

Step 2: Score nodes. For node N in suffix tree, let D(N) = set of documents in subtree of N. Let P(N) be the phrase labelling N. Define score of N, s(N) = |D(N)| * f(|P(N)|). f(1) is small; f(K) = K for K = 2 ... 6; f(K) = 6 for K > 6.

Step 3: Find clusters.

A. Construct an undirected graph whose vertices are nodes of the suffix tree. There is an arc from N1 to N2 if the following conditions holds:

• Either N1 or N2 is among the 500 top-scoring nodes.
• | D(N1) intersect D(N2) | / max(|N1|,|N2|) > 0.5.

B. Each connected component of this graph is a cluster. Score of cluster computed from scores of nodes, overlap function. Top 10 clusters returned. Cluster can be described using phrases of its ST nodes.

#### Example

Query "salsa" submitted to MetaCrawler (consults with several search engines and combines answer.) Returns 246 documents in 15 clusters, of which the top are
• Puerto Rico; Latin Music (8 docs)
• Follow Up Post; York Salsa Dancers (20 docs)
• music; entertainment; latin; artists (40 docs)
• hot; food; chiles; sauces; condiments; companies (79 docs)
• pepper; onion; tomatoes (41 docs)

#### Features

• Overlapping clusters.
• Non-exhaustive
• Linear time.
• High precision.

### Clustering using query log

(Beeferman and Berger, 2000) Log records query, links that were clicked through.
Create bipartite graph where query terms connect to links.
Cluster of pages = connected component.
Also gives cluster of query terms -- useful to suggest alternative queries to user.

### Detection of identical or near-identical pages

Syntactic Clustering of the Web Broder, Glassman, Manasse, and Zweig, 1998 (WWW6)

Shingle : K consecutive words.
Fix a shingle size K. Let S(A) be the set of shingles in A and let S(B) be the set of shingles in B.
The resemblance of A and B is defined as | S(A) intersect S(B) | / | S(A) union S(B) |.
The containment of A in B is defined as | S(A) intersect S(B) | / | S(A) |.
Estimate the resemblance of sets A and B using random sampling:

```Step 1: Choose a value m and a random function P(W) from elements to integers.

Step 2: For set X, let V(X) = { x in X | P(x) mod m = 0} be a sample of X.
This is called the  sketch  of X.
Step 3: | V(A) intersect V(B) | / |V(A) union V(B) |
is an estimate of the resemblance of A and B.

|V(A) intersect V(B)| / V(A) is an estimate of the containment of A in B
```

Note: if you just sample A and sample B, get a substantial underestimate of the resemblance.

Implementation uses shingle size = 10, m = 25, 40 bit sketch.

High-level algorithm
Step 1: Normalize by removing HTML and converting to lower-case.
Step 2: Calculate sketch of shingle set of each document.
Step 3: For each pair of documents, compare sketches to see if they exceed resemblance threshhold.
Step 4: Find connected components.

Implementation of Step 3. Obviously all-pairs comparison is not feasible; however, most pairs have no shingles in common. Also want to reduce disk I/O since this cannot be done in-memory.

```3.1. Use merge-sort to generate sorted file F1 of < shingle, docID > pairs.
3.2. Eliminate shingles that appear in only one docID (most).
3.3. for each shingle S in F1,
for each pair of docs docID1, docID2 associated with S in F1,
write record < docID1, docID2 > to F2.
F2 now has one record of form < docID1, docID2 > for each shingle
shared by doc1 and doc2.
3.4. Merge-sort F2, combining and keeping count of identical elements.
Generate file F3 of record < docID1, docID, count of common shingles >
```
Further issues:
• Eliminate very common shingles (more than 1000 documents). Almost all automatically generated by standard programs.
• Eliminate truly identical documents by computing and comparing fingerprint of entire document.

### Known clusters

You have a known system of clusters with examples (e.g. Yahoo). The problem is to place a new page into the proper cluster(s).

In machine learning, this is known as classification rather than clustering. Lots of algorithms.

### Evaluating Clusters

Formal measures: Normalize all vectors to length 1. Assume fixed number of clusters.

• Minimize average diameter of clusters. (Diameter of cluster C = max distance between two points in C)
• Minimize average distance between points in cluster.
• Minimize means-squared distance from point to centroid. Maximum likelihood estimate if clusters generated by normal distribution around centroid.
• Average over cluster C of longest edge in minimum spanning tree for C.

Variable number of clusters: Any of the above + suitable penalty for more clusters.

Formal measures test adequacy of clustering algorithm, but not relevance of measures to actual significance.

Ask subjects to cluster, compare systems of clusterings.

• Let E1, E2 be within-cluster arcs in cluster systems 1, 2. Then measure |E1 intersect E2| / |E1 union E2|. (Note that, if two system both contain 2 clusters, then the above is at least 1/3; on average 1/2.)
• For any cluster C in system 1, let m(C) be closest cluster in system 2. Compute weighted average of m(C) over all C.

Ask subjects to evaluate similarity of all pairs of documents. Correlate these similarities with clustering (e.g. average similarity within cluster / average similarity between clusters)

Ask subjects whether system of clustering seems natural or useful.

For clustering of responses to query:

• Max precision over all clusters. (Hearst and Pedersen) (User model: User can easily identify most relevant cluster, only examines that one.)
• Variant: Sort clusters in decreasing order of precision. Examine them in order, down to a fixed number of documents. Compute precision over these.