Clustering Algorithms

• Decompositional (top-down)
• Agglomerative (bottom-up)

Any decompositional clustering algorithm can be made hierarchical by recursive application.

K-means algorithm

```K-means-cluster (in S : set of vectors : k : integer)
{  let C[1] ... C[k] be a random partition of S into k parts;
repeat {
for i := 1 to k {
X[i] := centroid of C[i];
C[i] := empty
}
for j := 1 to N {
X[q] := the closest to S[j] of X[1] ... X[k]
add S[j] to C[q]
} }
until the change to C (or the change to X) is small enough.
}
```

Example: (Note: I have started here with random points rather than a random partition.)

Features:

• Objective function: minimize the sum of the squared distance from each point to centroid of cluster.
For unit vectors, this is equivalent to maximizing the average dot product S[i] dot S[j} where S[i] and S[j] are in the same cluster.

Sum of the squared distance steadily decreases over the iterations.
Hill-climbing algorithm.

• Disjoint and exhaustive decomposition.
• For fixed number of iterations, linear in N. Clever optimization reduces recomputation of X[q] if small change to S[j]. Second loop much shorter than O(kN) after the first couple of iterations.
• "Anytime" algorithm: S[j] always a decomposition of S into convex subregions.
• Random starting point: Multiple runs may give different results, choose "best"

Problems:

• Have to guess K.
• Local minimum. Example: In diagram below, if K=2, and you start with centroids B and E, converges on the two clusters {A.B.C}, {D,E,F}

• Disjoint and exhaustive decomposition.
• Starvation: Complete starvation of C[j], or starvation to single outlier.
• Assumes that clusters are spherical in vector space. Hence particularly sensitive to coordinate changes (e.g. changes in weighting)

Variant 1: Update centroids incrementally. Claims to give better results.

Variant 2: Bisecting K-means algorithm:

```for I := 1 to K-1 do {
pick a leaf cluster C to split;
for J := 1 to ITER do split C into two subclusters C1 and C2;
choose the best of the above splits and make it permanent
}
```
Generates a binary clustering hierarchy. Works better (so claimed) than K-means.

Variant 3: Allow overlapping clusters: If distances from P to C1 and to C2 are close enough then put P in both clusters.

Full mean vector would get very long (= number of different words in all the documents.) Solution: truncate after first M terms. (Typically M=50 or 100.)

Agglomerative Hierarchical Clustering Technique

```{ put every point in a cluster by itself
for I := 1 to N-1 do {
let C1,C2 be the most mergeable pair of clusters;
create C parent of C1, C2
} }
```
Various measures of "mergeable" used.
• Minimum distance between d1 in C1 and d2 in C2. Then this is basically Kruskal's algorithm for minimum spanning tree; runs in almost linear time. However, not a good clustering measure.
• Average distance between d1, d2 in C1 union C2. Quickly computable.
• Maximum distance between d1 in C1 and d2 in C2 (= diameter of C1 and C2)

Characteristics

• Creates complete binary tree of clusters
• Various ways to determine "mergeability".
• Deterministic
• O(N2) running time.
Conflicting claims about quality of agglomerative vs. K-means.

One-pass Clustering

```pick a starting  point D in S;
CC = { { D } } } /* Set of clusters: Initially 1 cluster containing D */
for Di in S do {
C := the cluster in CC "closest" to Di
if similarity(C,Di) > threshhold
then add Di to C;
else add { Di } to CC;
}
```
Features:
• Running time: O(KN) (K = number of clusters)
• Fixed threshhold
• Order dependent. Can rerun with different order.
• Disjoint, exhaustive clusters
• Low precision