A Comparison of Document Clustering Techniques by Michael Steinbach, George Karypis, and Vipin Kumar

- Structuring search results
- Suggesting related pages
- Automatic directory construction / update.
- Finding near identical pages:
- Finding mirror pages (e.g. for propagating updates)
- Eliminate near-duplicates from results page
- Plagiarism detection
- Lost and found (find identical pages at different URL's at different times.)

- Hierarchical vs. flat.
- Overlap:
- Disjoint partitioning. E.g. partition congressmen by state.
- Multiple dimensions of partitioning, each disjoint. E.g. partition congressmen by state; by party; by House vs. Senate.
- Arbitrary overlap. E.g. partition article by author. Geographical regions (France, the Alps, German-speaking regions). Problem: No natural bound on number of categories.

- Exhaustive vs. non-exhaustive.

Note: disjoint and exhaustive decomposition = tree.

Note difference between geometric point not part of any cluster (unclustered point) and document not part of any cluster (general subject matter). True analogy perhaps: document = region. - Outliers: What to do?
- How many clusters? How large?

- Text content
- Links?
- Usage: Clickthrough logs give association between query and page.

- At indexing time
- At query time applied to papers.
- At query time applied to snippets. Turns out that, experimentally, most clustering algorithms do almost as well given only snippets; in fact, some do better with snippets than with the whole text.

- Vector measure. Each document is considered as vector normalized to length 1.
- Overlap measure. Similarity of documents Q and R is |Q intersect R| / |Q union R|

- Polysemy. "bat", "Washington", "Banks".

Clustering criterion clear in principle, hard or easy in practice. (e.g. "President Bush") - Multiple aspects of a single topic. Practically any rich topic. Many possible dimension for clusters. Little agreement among human subjects. No ideal form of clustering. Qn: Can we generate a system of clusters that seems plausible and useful? Ultimately amounts to general problem of Web page / information structuring.

- Decompositional (top-down)
- Agglomerative (bottom-up)

Decompositional algorithms are almost always based on vector space (only terms in which to see high-level structure.)

Any decompositional clustering algorithm can be made hierarchical by recursive application.

K-means-cluster (in S : set of vectors : k : integer) { let X[1] ... X[k] be k random points in S; repeat { for j := 1 to N { X[q] := the closest to S[j] of X[1] ... X[k] add S[j] to C[q] } for i := 1 to k { X[i] := centroid of C[i]; C[i] := empty } } until the change to X is small enough. }

Features:

- Objective function: sum of the squared distance from each point to centroid of cluster (= average cosine of angles between points) Steadily decreases over the iterations. Hill-climbing algorithm.
- Disjoint and exhaustive decomposition.
- For fixed number of iterations, linear in N. Clever optimization reduces recomputation of X[q] if small change to S[j]. Second loop much shorter than O(kN) after the first couple of iterations.
- "Anytime" algorithm: S[j] always a decomposition of S into convex subregions.
- Random starting point: Multiple runs may give different results, choose "best"

Problems:

- Have to guess K.
- Local minimum. Example: In diagram below, if K=2, and you start
with centroids B and E, converges on the two clusters {A.B.C}, {D,E,F}
- Disjoint and exhaustive decomposition.
- Starvation: Complete starvation of S[j], or starvation to single outlier.
- Assumes that clusters are spherical in vector space. Hence particularly sensitive to coordinate changes (e.g. changes in weighting)

Variant 1: Update centroids incrementally. Claims to give better results.

Variant 2: Bisecting K-means algorithm:

for I := 1 to K-1 do { pick a leaf cluster C to split; for J := 1 to ITER do split C into two subclusters C1 and C2; choose the best of the above splits and make it permanent }Generates a binary clustering hierarchy. Works better (so claimed) than K-means.

Variant 3: Allow overlapping clusters: If distances from P to C1 and to C2 are close enough then put P in both clusters.

{ put every point in a cluster by itself for I := 1 to N-1 do { let C1,C2 be the most mergeable pair of clusters; create C parent of C1, C2 } }Various measures of "mergeable" used. (If "mergeability" is the distance between two closest elements of C1, C2, then this is Kruskal's algorithm for minimum spanning tree; however, this is

Characteristics

- Creates complete binary tree of clusters
- Various ways to determine "mergeability".
- Deterministic
- O(N
^{2}) running time.

pick a starting point D in S; CC = { { D } } } /* Set of clusters: Initially 1 cluster containing D */ for Di in S do { C := the cluster in CC "closest" to Di if similariity(C,Di) > threshhold then add Di to C; else add { Di } to CC; }Features:

- Running time: O(KN) (K = number of clusters)
- Fixed threshhold
- Order dependent. Can rerun with different order.
- Disjoint, exhaustive clusters
- Low precision

- Rooted directed tree.
- Each edge is labelled with a non-empty substring of S. Label of node N = concatenation of labels of edges from root to N.
- Compact: no two edges out of the same node have edge-labels that begin with same word.
- For suffix Q in S, there is a node with label Q.
- Each node for a suffix Q of string Z in S labelled with index of Z and starting position of Q in Z.

Example: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse too" }

Suffix tree can be constructed in linear time.

**Step 2:** Score nodes.
For node N in suffix tree, let D(N) = set of documents in subtree of N.
Let P(N) be the phrase labelling N. Define score of N, s(N) =
|D(N)| * f(|P(N)|). f(1) is small; f(K) = K for K = 2 ... 6;
f(K) = 6 for K > 6.

** Step 3: ** Find clusters.

A. Construct an undirected graph whose vertices are nodes of the suffix tree. There is an arc from N1 to N2 if the following conditions holds:

- Either N1 or N2 is among the 500 top-scoring nodes.
- | D(N1) intersect D(N2) | / max(|N1|,|N2|) > 0.5.

B. Each connected component of this graph is a cluster. Score of cluster computed from scores of nodes, overlap function. Top 10 clusters returned. Cluster can be described using phrases of its ST nodes.

- Puerto Rico; Latin Music (8 docs)
- Follow Up Post; York Salsa Dancers (20 docs)
- music; entertainment; latin; artists (40 docs)
- hot; food; chiles; sauces; condiments; companies (79 docs)
- pepper; onion; tomatoes (41 docs)

- Overlapping clusters.
- Non-exhaustive
- Linear time.
- High precision.

Create bipartite graph where query terms connect to links.

Cluster of pages = connected component.

Also gives cluster of query terms -- useful to suggest alternative queries to user.

Shingle : K consecutive words.

Fix a shingle size K. Let S(A) be the set of shingles in A and
let S(B) be the set of shingles in B.

The * resemblance * of A and B
is defined as | S(A) intersect S(B) | / | S(A) union S(B) |.

The * containment * of A in B
is defined as | S(A) intersect S(B) | / | S(A) |.

Estimate the resemblance of sets A and B using random sampling:

Step 1: Choose a value m and a random function P(W) from elements to integers. Step 2: For set X, let V(X) = { x in X | P(x) mod m = 0} be a sample of X. This is called thesketchof X.

Step 3: | V(A) intersect V(B) | / |V(A) union V(B) | is an estimate of the resemblance of A and B.

|V(A) intersect V(B)| / V(A) is an estimate of the containment of A in B

Note: if you just sample A and sample B, get a substantial underestimate of the resemblance.

Implementation uses shingle size = 10, m = 25, 40 bit sketch.

High-level algorithm

Step 1: Normalize by removing HTML and converting to lower-case.

Step 2: Calculate sketch of shingle set of each document.

Step 3: For each pair of documents, compare sketches to see if they
exceed resemblance threshhold.

Step 4: Find connected components.

Implementation of Step 3. Obviously all-pairs comparison is not feasible; however, most pairs have no shingles in common. Also want to reduce disk I/O since this cannot be done in-memory.

3.1. Use merge-sort to generate sorted file F1 of < shingle, docID > pairs. 3.2. Eliminate shingles that appear in only one docID (most).Further issues:

3.3. for each shingle S in F1, for each pair of docs docID1, docID2 associated with S in F1, write record < docID1, docID2 > to F2. F2 now has one record of form < docID1, docID2 > for each shingle shared by doc1 and doc2. 3.4. Merge-sort F2, combining and keeping count of identical elements. Generate file F3 of record < docID1, docID, count of common shingles >

- Eliminate very common shingles (more than 1000 documents). Almost all automatically generated by standard programs.
- Eliminate truly identical documents by computing and comparing fingerprint of entire document.

In machine learning, this is known as * classification * rather
than clustering. Lots of algorithms.