Decompositional algorithms are almost always based on vector space (only terms in which to see high-level structure.)
Any decompositional clustering algorithm can be made hierarchical by recursive application.
K-means-cluster (in S : set of vectors : k : integer) { let C[1] ... C[k] be a random partition of S into k parts; repeat { for i := 1 to k { X[i] := centroid of C[i]; C[i] := empty } for j := 1 to N { X[q] := the closest to S[j] of X[1] ... X[k] add S[j] to C[q] } } until the change to C (or the change to X) is small enough. }
Features:
Sum of the squared distance steadily decreases over the iterations.
Hill-climbing algorithm.
Problems:
Variant 1: Update centroids incrementally. Claims to give better results.
Variant 2: Bisecting K-means algorithm:
for I := 1 to K-1 do { pick a leaf cluster C to split; for J := 1 to ITER do split C into two subclusters C1 and C2; choose the best of the above splits and make it permanent }Generates a binary clustering hierarchy. Works better (so claimed) than K-means.
Variant 3: Allow overlapping clusters: If distances from P to C1 and to C2 are close enough then put P in both clusters.
Full mean vector would get very long (= number of different words in all the documents.) Solution: truncate after first M terms. (Typically M=50 or 100.)
{ put every point in a cluster by itself for I := 1 to N-1 do { let C1,C2 be the most mergeable pair of clusters; create C parent of C1, C2 } }Various measures of "mergeable" used.
Characteristics
pick a starting point D in S; CC = { { D } } } /* Set of clusters: Initially 1 cluster containing D */ for Di in S do { C := the cluster in CC "closest" to Di if similarity(C,Di) > threshhold then add Di to C; else add { Di } to CC; }Features:
Example: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse too" }
Suffix tree can be constructed in linear time.
Step 2: Score nodes. For node N in suffix tree, let D(N) = set of documents in subtree of N. Let P(N) be the phrase labelling N. Define score of N, s(N) = |D(N)| * f(|P(N)|). f(1) is small; f(K) = K for K = 2 ... 6; f(K) = 6 for K > 6.
Step 3: Find clusters.
A. Construct an undirected graph whose vertices are nodes of the suffix tree. There is an arc from N1 to N2 if the following conditions holds:
B. Each connected component of this graph is a cluster. Score of cluster computed from scores of nodes, overlap function. Top 10 clusters returned. Cluster can be described using phrases of its ST nodes.
Shingle : K consecutive words.
Fix a shingle size K. Let S(A) be the set of shingles in A and
let S(B) be the set of shingles in B.
The resemblance of A and B
is defined as | S(A) intersect S(B) | / | S(A) union S(B) |.
The containment of A in B
is defined as | S(A) intersect S(B) | / | S(A) |.
Estimate the resemblance of sets A and B using random sampling:
Step 1: Choose a value m and a random function P(W) from elements to integers. Step 2: For set X, let V(X) = { x in X | P(x) mod m = 0} be a sample of X. This is called the sketch of X.
Step 3: | V(A) intersect V(B) | / |V(A) union V(B) | is an estimate of the resemblance of A and B.
|V(A) intersect V(B)| / V(A) is an estimate of the containment of A in B
Note: if you just sample A and sample B, get a substantial underestimate of the resemblance.
Implementation uses shingle size = 10, m = 25, 40 bit sketch.
High-level algorithm
Step 1: Normalize by removing HTML and converting to lower-case.
Step 2: Calculate sketch of shingle set of each document.
Step 3: For each pair of documents, compare sketches to see if they
exceed resemblance threshhold.
Step 4: Find connected components.
Implementation of Step 3. Obviously all-pairs comparison is not feasible; however, most pairs have no shingles in common. Also want to reduce disk I/O since this cannot be done in-memory.
3.1. Use merge-sort to generate sorted file F1 of < shingle, docID > pairs. 3.2. Eliminate shingles that appear in only one docID (most).Further issues:
3.3. for each shingle S in F1, for each pair of docs docID1, docID2 associated with S in F1, write record < docID1, docID2 > to F2. F2 now has one record of form < docID1, docID2 > for each shingle shared by doc1 and doc2. 3.4. Merge-sort F2, combining and keeping count of identical elements. Generate file F3 of record < docID1, docID, count of common shingles >
In machine learning, this is known as classification rather than clustering. Lots of algorithms.
Formal measures: Normalize all vectors to length 1. Assume fixed number of clusters.
Variable number of clusters: Any of the above + suitable penalty for more clusters.
Formal measures test adequacy of clustering algorithm, but not relevance of measures to actual significance.
Ask subjects to cluster, compare systems of clusterings.
Ask subjects to evaluate similarity of all pairs of documents. Correlate these similarities with clustering (e.g. average similarity within cluster / average similarity between clusters)
Ask subjects whether system of clustering seems natural or useful.
For clustering of responses to query: