Lecture 4: Clustering

Required Reading

Grouper: A Dynamic Clustering Interface to Web Search Results Oren Zamir and Oren Etzioni
A Comparison of Document Clustering Techniques by Michael Steinbach, George Karypis, and Vipin Kumar

Applications

Cluster structure

Information source

Position of clustering module

Textual similarity criterion

Note: Safe to presume that cluster of documents is a convex region geometrically. That is, if subject S includes DOC1 with combination V1 of words and DOC2 with combination V2 of words then there could exist a document in S with combination p*V1 + (1-p)*V2 of words.

Source of clustering in search results

Clustering algorithms

Decompositional algorithms are almost always based on vector space (only terms in which to see high-level structure.)

Any decompositional clustering algorithm can be made hierarchical by recursive application.

K-means algorithm

K-means-cluster (in S : set of vectors : k : integer)
{  let X[1] ... X[k] be k random points in S;
   repeat {
            for j := 1 to N {
               X[q] := the closest to S[j] of X[1] ... X[k]
               add S[j] to C[q]
              }
            for i := 1 to k {
               X[i] := centroid of C[i];
               C[i] := empty 
              } }
    until the change to X is small enough.
}

Features:

Problems:

Variant 1: Update centroids incrementally. Claims to give better results.

Variant 2: Bisecting K-means algorithm:

for I := 1 to K-1 do {
    pick a leaf cluster C to split;
    for J := 1 to ITER do split C into two subclusters C1 and C2;
    choose the best of the above splits and make it permanent
}
Generates a binary clustering hierarchy. Works better (so claimed) than K-means.

Variant 3: Allow overlapping clusters: If distances from P to C1 and to C2 are close enough then put P in both clusters.

Agglomerative Hierarchical Clustering Technique

{ put every point in a cluster by itself
  for I := 1 to N-1 do {
     let C1,C2 be the most mergeable pair of clusters;
     create C parent of C1, C2
  } }
Various measures of "mergeable" used. (If "mergeability" is the distance between two closest elements of C1, C2, then this is Kruskal's algorithm for minimum spanning tree; however, this is not , in practice, one of the measures used.)

Characteristics

Conflicting claims about quality of agglomerative vs. K-means.

One-pass Clustering

pick a starting  point D in S;
CC = { { D } } } /* Set of clusters: Initially 1 cluster containing D */
for Di in S do {
    C := the cluster in CC "closest" to Di
    if similariity(C,Di) > threshhold 
      then add Di to C;
      else add { Di } to CC;
}
Features:

STC (Suffix Tree Clustering) algorithm

Step 1: Construct suffix tree. Suffix tree: S is a set of strings. (In our case, each elt of S is a sentence, viewed as a string of words.) A compact tree containing all suffixes of strings in S.

Example: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse too" }

Suffix tree can be constructed in linear time.

Step 2: Score nodes. For node N in suffix tree, let D(N) = set of documents in subtree of N. Let P(N) be the phrase labelling N. Define score of N, s(N) = |D(N)| * f(|P(N)|). f(1) is small; f(K) = K for K = 2 ... 6; f(K) = 6 for K > 6.

Step 3: Find clusters.

A. Construct an undirected graph whose vertices are nodes of the suffix tree. There is an arc from N1 to N2 if the following conditions holds:

B. Each connected component of this graph is a cluster. Score of cluster computed from scores of nodes, overlap function. Top 10 clusters returned. Cluster can be described using phrases of its ST nodes.

Example

Query "salsa" submitted to MetaCrawler (consults with several search engines and combines answer.) Returns 246 documents in 15 clusters, of which the top are

Features

Clustering using query log

(Beeferman and Berger, 2000) Log records query, links that were clicked through.
Create bipartite graph where query terms connect to links.
Cluster of pages = connected component.
Also gives cluster of query terms -- useful to suggest alternative queries to user.

Detection of identical or near-identical pages

Syntactic Clustering of the Web Broder, Glassman, Manasse, and Zweig, 1998 (WWW6)

Shingle : K consecutive words.
Fix a shingle size K. Let S(A) be the set of shingles in A and let S(B) be the set of shingles in B.
The resemblance of A and B is defined as | S(A) intersect S(B) | / | S(A) union S(B) |.
The containment of A in B is defined as | S(A) intersect S(B) | / | S(A) |.
Estimate the resemblance of sets A and B using random sampling:

Step 1: Choose a value m and a random function P(W) from elements to integers.

Step 2: For set X, let V(X) = { x in X | P(x) mod m = 0} be a sample of X. 
        This is called the  sketch  of X. 
Step 3: | V(A) intersect V(B) | / |V(A) union V(B) | is an estimate of the resemblance of A and B.
|V(A) intersect V(B)| / V(A) is an estimate of the containment of A in B

Note: if you just sample A and sample B, get a substantial underestimate of the resemblance.

Implementation uses shingle size = 10, m = 25, 40 bit sketch.

High-level algorithm
Step 1: Normalize by removing HTML and converting to lower-case.
Step 2: Calculate sketch of shingle set of each document.
Step 3: For each pair of documents, compare sketches to see if they exceed resemblance threshhold.
Step 4: Find connected components.

Implementation of Step 3. Obviously all-pairs comparison is not feasible; however, most pairs have no shingles in common. Also want to reduce disk I/O since this cannot be done in-memory.

3.1. Use merge-sort to generate sorted file F1 of < shingle, docID > pairs. 
3.2. Eliminate shingles that appear in only one docID (most). 
3.3. for each shingle S in F1, for each pair of docs docID1, docID2 associated with S in F1, write record < docID1, docID2 > to F2. F2 now has one record of form < docID1, docID2 > for each shingle shared by doc1 and doc2. 3.4. Merge-sort F2, combining and keeping count of identical elements. Generate file F3 of record < docID1, docID, count of common shingles >
Further issues:

Known clusters

You have a known system of clusters with examples (e.g. Yahoo). The problem is to place a new page into the proper cluster(s).

In machine learning, this is known as classification rather than clustering. Lots of algorithms.