First, the computation of the new centroid requires only the points added to the cluster, the points taken out of the cluster, and the number of points in the cluster.

Second, computing the change made to the cluster made by moving the cetroid can be made more efficient as follows:

Let C_{J}(T) be the centroid of the Jth cluster on the Tth iteration,
and let X_{I} be the Ith point. Note that

dist(X_{I},C_{J}(T+1)) >
dist(X_{I},C_{J}(T)) +
dist(C_{J}(T),C_{J}(T+1).)

Therefore you can maintain two arrays:

RADIUS(J) = an overestimate of the maximum value of
dist(X_{I},C_{J}), where X_{I} is in cluster J.

DIST(J,Z) = an underestimate of the minimum value of
dist(X_{I},C_{Z}), where X_{I} is in cluster J.

which you update as follows:

RADIUS(J) := RADIUS(J) + dist(C_{J}(T),C_{J}(T+1));

DIST(J,Z) := DIST(J,Z) + dist(C_{Z}(T),C_{Z}(T+1));
then as long as RADIUS(J) < DIST(J,Z) you can be sure that none of
the points in cluster J should be moved to cluster Z. (You update these
values more exactly whenever a point * is * moved.)

**Formal measures:**
Normalize all vectors to length 1. Assume fixed number of clusters.

- Minimize average diameter of clusters. (Diameter of cluster C = max distance between two points in C)
- Minimize average distance between points in cluster.
- Minimize means-squared distance from point to centroid. Maximum likelihood estimate if clusters generated by normal distribution around centroid.
- Average over cluster C of longest edge in minimum spanning tree for C.

Variable number of clusters: Any of the above + suitable penalty for more clusters.

Formal measures test adequacy of clustering algorithm, but not relevance of measures to actual significance.

Ask subjects to cluster, compare systems of clusterings.

- Let E1, E2 be within-cluster arcs in cluster systems 1, 2. Then measure |E1 intersect E2| / |E1 union E2|. (Note that, if two system both contain 2 clusters, then the above is at least 1/3; on average 1/2.)
- For any cluster C in system 1, let m(C) be closest cluster in system 2. Compute weighted average over C of similarity(m(C),C).

Ask subjects to evaluate similarity of all pairs of documents. Correlate these similarities with clustering (e.g. average similarity within cluster / average similarity between clusters)

Ask subjects whether system of clustering seems natural or useful.

For clustering of responses to query:

- Max precision over all clusters. (Hearst and Pedersen) (User model: User can easily identify most relevant cluster, only examines that one.)
- Variant: Sort clusters in decreasing order of precision. Examine them in order, down to a fixed number of documents. Compute precision over these.

- D contains T.
- Documents that are similar to D contain T. Document D1 is "similar" to D if D1 contains many terms that are in D.
- D contains terms "related" to T. Term T1 is related to T if T1 appears in many documents that contain T.
- Many documents similar to D contain terms related to T.

Task: Predict whether user U likes movie M.

**Algorithm 1: Nearest neighbors over users:**

Find K users whose taste in movies
is closest to U and who have expressed an opinion on M. Combine
their votes about M. (Either simple sum or weighted sum by closeness.)

**Algorithm 1.A: Clustering users**

Divide users into clusters, based on their movie preferences.

Take vote on M over U's cluster.

**Algorithm 2 (dual of algorithm 1): Nearest neighbors over movies**

Find K movies whose fan-list is most similar to those of M's fan-list
(presumably, movies that are similar to M in some way) and about which
U has expressed an opinion. Combine U's opinions about these K movies
(either simple sum or weighted sum by closeness).

**Algorithm 2.A: Clustering movies**

Divide movies into clusters, based on their fan-list.

Average U's votes over M's cluster.

General issues in either of these:

- Are votes Boolean or numeric?
- Are there negative votes, or only positive votes and non-votes?

As always, induction from only positive examples is harder (both to do and to evaluate) than induction from positive and negative examples. - What is the measure of closeness?

In practice, one generally uses NN over the larger of the two sets (e.g. if your database has more users than movies, use algorithm 1) That way, you are doing learning in a smaller dimensional space over a larger training set, while the other way, you are doing learning larger dimensional space over a smaller training set.

On the other hand, it is not *impossible* that learning over
the smaller set could work better than learning over the larger one.
Consider a dataset
constructed as follows:

1. Create a user * movie matrix and assign ratings at random.

2. Duplicate each movie row.

Then 1-NN over movies predicts ratings perfectly, while users are entirely
uncorrellated.

** Algorithm 3: Cluster both movies and users** .

1. Choose fixed number #M of movie clusters and number #U of user clusters. 2. Use algorithm 1A to cluster users and algorithm 2A to cluster movies. 3. repeat { 4. for each user U_{I}, 5. define a vector UU_{I}[1...#M] s.t. 6. UU_{I}[J] = number of movies in Jth cluster that U_{I}likes; 7. cluster the vectors UU_{I}; 8. for each movie M_{J}, 9. define a vector MM_{J}[1...#U] s.t. 10. MM_{J}[I] = number of users in Ith cluster that like M_{J}; cluster the vectors MM_{J}; } 11. until nothing changes.

** Algorithm 4: Probabilistic clustering.** (adapted from Chakrabarti
p. 116.)

Assume that there are Z movie clusters and N user clusters.

Let A_{U,C} be the probability that user U belong to cluster C.

Let B_{M,D} be the probability that movie M belongs to cluster D.

Let P_{C,D} be the probability that a user in cluster C likes
a movie in cluster D.

1
Then the probability that person U likes movie M is proportional to

sumUse a hill-climbing technique to find the values of A,B,P that best fit the data. Use these values of A,B,P to predict whether U likes M._{C,D}A_{U,C}P_{C,D}B_{M,D}

Note: Danger of overfitting with all these parameters to play with.

** Algorithm 5: ** Weighted majority

Imagine a space of experts. Each expert implements a very simple recommendation
strategy. Let R be a number (learning rate) between 0 and 1. If an
expert E has thus far been right P times and wrong Q times, then the
credibility of E is (2-R)^{P} * R^{Q}.

(?? Doesn't seem right, as this can go to zero even for experts with some
informational value. I would have thought (1/R)^{P} * R^{Q}.
The above is the formula from Cohen and Pan.)

Sum experts votes weighted by credibility; recommend item with highest score.

Also, some users are much easier to recommend for than others. Users who like popular movies (by definition, there are a lot of these) are much easier to recommend for than users with odd tastes. Users who like a lot of movies are easier to recommend for than users who are picky. Hence, there is a lot of variance in success rates from one user to another. Dually, movies that are popular are much safer recommendations than movies that are unpopular.

Web Collaborative Filtering: Recommending Music by Spidering the Web William W. Cohen and Wei Fan, 2000/

Input: Downloads of music at ATT Labs over 3 months from server log.

Test set =
IP addresses that first connected in August. Training set = rest of data.

User U likes artist A iff U has downloaded at least one file of A's music.

Thus, Boolean and only positive data. (Problematic, since many users download
only a few files.)

Test set: 23,438 downloads to 1028 IP addresses. 981 artists.

Training set: 5095 downloads to 353 IP addresses.

Usual difficulty identifying user with IP address.

Conceptually, each time the user is recorded in the log as downloading a file, you imagine that you compute a recommendation for him. If he "accepts" the recommendation -- i.e. it is recorded further on in the log that he downloaded this same artist -- then that's good; if he "rejects" the recommendation -- he never downloads this artist -- that's bad.

AA := set of artists that U downloads in log; LEN := 0; repeat working forward through the log, { let T be the next time U downloads a file; based on log data up to T, generate a set of recommendations RR; pick artist A in RR (that has not yet been recommended or downloaded?) at random; if A in AA then LEN++ } until all artists in AA have either been recommended or downloaded.Note: the maximum number of different artists downloaded by any user is small: median = 10.

Accuracy up to Kth trial= number of recommendations accepted within first K trials / K. Average over all user who download at least K files.

Plot accuracy up to Kth trial against K. Tends to increase:

A. More information available about users' tastes

B. The pool of users being averaged over continually changes.
At time K, we are dealing with only users who download at least K files,
and who are therefore easier to match.

Overall accuracy for U: Number of recommendations accepted / Number of recommendations made.

*Any* such evaluation method, based on user logs, necessarily must
systematically and hugely underestimate accuracy.

- POP: Popularity of artist A = number of user who download A.

Recommend most popular artist that U has not yet seen.

Note that this is pretty much independent of U. Gives a baseline that is actually not easy to beat. - K-NN: K nearest neighbors. Find K users closest to U and sum their
votes.

Distance(U1,U2) = number of artists where U1, U2 disagree. - Weighted majority using the following experts:
- For each pair of artists A,B, an expert E[A,B] that predicts that all users will rate A and B identically.
- For each pair of users U,V, an expert E[U,V] that predict that U and V will rate all artists identically.
- For each artist A, an expert E[A] that predicts that all users will rate A positively.

- Extended Direct Bayesian prediction: XDB. I omit the explanation.

"Genre" = large category: e.g. jazz, country.

"Musical style" = small category e.g. "country boogie".

B related to A if A's page has reference to B.

(Nothing clever here: Allmusic.com has all this information laid out neatly.)

** Genre method: **

Construct "pseudo-user" for each genre, that ranks every artist in
the genre positively and every artist not in the genre negatively.
Use K-NN.

** Related artist method: **

For each artist A, construct a pseudo-user that ranks positively A
and every artist related to A and negatively all others.
Use K-NN.

**Phase 1:**

1. For each artist A (in log? not clear), call AltaVista with artist's name.
Retrieve top 100 URL's.

2. Extract URL's that appear at least twice (two different artists).

3. Apply list-extractor.

Results: 2519 pseudo-users with 13527 positive ratings.

List extractor: Look for same position in HTML-structure (e.g. http://lusers.com/~biff+body.ul.li) associated with text that is close to name of artist.

Problem with phase 1 results: tends to weight all artists equally. Do better to have popular artists weighted more heavily than unpopular artists.

** Phase 2: **
1. Find 1000 pairs of entity names co-occurring most frequently in output of
Phase 1.

2. Search Northern Lights for each such pair of names. Retrieve top 10 URL's

3. Apply list extractor.

Results: 1894 pseudo-users. 48,878 positive ratings.

Random choice: accuracy = 0.02.

POP: ACC(0)=0.15. ACC(7)=0.17. ACC(25)=0.22. ACC(50)=0.29.

K-NN and XDB are better than POP; statistically significant.

K-NN: ACC(0)=0.15, ACC(7)=0.19. ACC(25)=0.25. ACC(50)=0.32

WM is slightly worse (not significantly) than POP.

Genres and Related artists are much worse than the user-log based methods.

Genres: ACC(1) = 0.04, ACC(7) = 0.07. ACC(25)=0.11. ACC(50)=0.17.

Related artists starts out at ACC(1)=0.14 (why?), but declines sharply,
and converges to same behavior as genres.

Spider: Better than related artists, not as good as User logs. Phase 2 is
somewhat better than phase 1.

SPIDER phase 2: ACC(1)=0.14. ACC(7)=0.12. ACC(25)=0.18. ACC(50)=0.25

SPIDER phase 1: ACC(1)=0.01. ACC(7)=0.08. ACC(25)=0.15. ACC(50)=0.22

** Moral: ** Recommendations based on web searches can be effective,
but are not as effective as recommendations based on user data. Maybe.

Doubts:

- Generally wonky evaluation method.
- In user-log, the evaluation uses the same data set as the training data.
- Quite possible that different IP addresses represent same user, giving user-log data further unfair advantage.
- Atypical user population (Bell Lab employees) will disfavor Web-based techniques.
- Web-based methods are by no means obvious. How much fiddling was needed to find them? To what extent do they generalize? Can you do better?