## Lecture 8: Collaborative Filtering

### Efficient implementation of K-means clustering

As we wrote the algorithm before, on each iteration you have to compute the distance from every point to every centriod. If there are many points, and K is reasonably large, and the division into clusters has become fairly stable, then the update procedure can be made more efficient as follows:

First, the computation of the new centroid requires only the points added to the cluster, the points taken out of the cluster, and the number of points in the cluster.

Second, computing the change made to the cluster made by moving the cetroid can be made more efficient as follows:

Let CJ(T) be the centroid of the Jth cluster on the Tth iteration, and let XI be the Ith point. Note that
dist(XI,CJ(T+1)) > dist(XI,CJ(T)) + dist(CJ(T),CJ(T+1).)

Therefore you can maintain two arrays:

RADIUS(J) = an overestimate of the maximum value of dist(XI,CJ), where XI is in cluster J.

DIST(J,Z) = an underestimate of the minimum value of dist(XI,CZ), where XI is in cluster J.

which you update as follows:

DIST(J,Z) := DIST(J,Z) + dist(CZ(T),CZ(T+1)); then as long as RADIUS(J) < DIST(J,Z) you can be sure that none of the points in cluster J should be moved to cluster Z. (You update these values more exactly whenever a point is moved.)

### Evaluating Clusters

Formal measures: Normalize all vectors to length 1. Assume fixed number of clusters.

• Minimize average diameter of clusters. (Diameter of cluster C = max distance between two points in C)
• Minimize average distance between points in cluster.
• Minimize means-squared distance from point to centroid. Maximum likelihood estimate if clusters generated by normal distribution around centroid.
• Average over cluster C of longest edge in minimum spanning tree for C.

Variable number of clusters: Any of the above + suitable penalty for more clusters.

Formal measures test adequacy of clustering algorithm, but not relevance of measures to actual significance.

Ask subjects to cluster, compare systems of clusterings.

• Let E1, E2 be within-cluster arcs in cluster systems 1, 2. Then measure |E1 intersect E2| / |E1 union E2|. (Note that, if two system both contain 2 clusters, then the above is at least 1/3; on average 1/2.)
• For any cluster C in system 1, let m(C) be closest cluster in system 2. Compute weighted average over C of similarity(m(C),C).

Ask subjects to evaluate similarity of all pairs of documents. Correlate these similarities with clustering (e.g. average similarity within cluster / average similarity between clusters)

Ask subjects whether system of clustering seems natural or useful.

For clustering of responses to query:

• Max precision over all clusters. (Hearst and Pedersen) (User model: User can easily identify most relevant cluster, only examines that one.)
• Variant: Sort clusters in decreasing order of precision. Examine them in order, down to a fixed number of documents. Compute precision over these.

### Collaborative Filtering

#### Example: Terms and Documents

We say that document D is relevant to query term T if
• D contains T.
• Documents that are similar to D contain T. Document D1 is "similar" to D if D1 contains many terms that are in D.
• D contains terms "related" to T. Term T1 is related to T if T1 appears in many documents that contain T.
• Many documents similar to D contain terms related to T.

#### Example: Personal preferences

Database of what users like which movies (M x N matrix).
Task: Predict whether user U likes movie M.

Algorithm 1: Nearest neighbors over users:
Find K users whose taste in movies is closest to U and who have expressed an opinion on M. Combine their votes about M. (Either simple sum or weighted sum by closeness.)

Algorithm 1.A: Clustering users
Divide users into clusters, based on their movie preferences.
Take vote on M over U's cluster.

Algorithm 2 (dual of algorithm 1): Nearest neighbors over movies
Find K movies whose fan-list is most similar to those of M's fan-list (presumably, movies that are similar to M in some way) and about which U has expressed an opinion. Combine U's opinions about these K movies (either simple sum or weighted sum by closeness).

Algorithm 2.A: Clustering movies
Divide movies into clusters, based on their fan-list.
Average U's votes over M's cluster.

General issues in either of these:

• Are votes Boolean or numeric?
As always, induction from only positive examples is harder (both to do and to evaluate) than induction from positive and negative examples.
• What is the measure of closeness?

In practice, one generally uses NN over the larger of the two sets (e.g. if your database has more users than movies, use algorithm 1) That way, you are doing learning in a smaller dimensional space over a larger training set, while the other way, you are doing learning larger dimensional space over a smaller training set.

On the other hand, it is not impossible that learning over the smaller set could work better than learning over the larger one. Consider a dataset constructed as follows:
1. Create a user * movie matrix and assign ratings at random.
2. Duplicate each movie row.
Then 1-NN over movies predicts ratings perfectly, while users are entirely uncorrellated.

Algorithm 3: Cluster both movies and users .

```1. Choose fixed number #M of movie clusters and number #U of user clusters.
2. Use algorithm 1A to cluster users and algorithm 2A to cluster movies.
3. repeat  {
4.     for each user UI,
5.        define a vector UUI[1...#M] s.t.
6.            UUI[J] = number of movies in Jth cluster that UI likes;
7.     cluster the vectors UUI;
8.     for each movie MJ,
9.        define a vector MMJ[1...#U] s.t.
10.           MMJ[I] = number of users in Ith cluster that like MJ;
cluster the vectors MMJ;
}
11. until nothing changes.
```

Algorithm 4: Probabilistic clustering. (adapted from Chakrabarti p. 116.)
Assume that there are Z movie clusters and N user clusters.
Let AU,C be the probability that user U belong to cluster C.
Let BM,D be the probability that movie M belongs to cluster D.
Let PC,D be the probability that a user in cluster C likes a movie in cluster D.
1 Then the probability that person U likes movie M is proportional to

sumC,D AU,C PC,D BM,D
Use a hill-climbing technique to find the values of A,B,P that best fit the data. Use these values of A,B,P to predict whether U likes M.

Note: Danger of overfitting with all these parameters to play with.

Algorithm 5: Weighted majority
Imagine a space of experts. Each expert implements a very simple recommendation strategy. Let R be a number (learning rate) between 0 and 1. If an expert E has thus far been right P times and wrong Q times, then the credibility of E is (2-R)P * RQ.
(?? Doesn't seem right, as this can go to zero even for experts with some informational value. I would have thought (1/R)P * RQ. The above is the formula from Cohen and Pan.)

Sum experts votes weighted by credibility; recommend item with highest score.

#### Evaluation

Complex, because the "training set" has three parts: the movies that U likes, the users that like M, and all the other data.

Also, some users are much easier to recommend for than others. Users who like popular movies (by definition, there are a lot of these) are much easier to recommend for than users with odd tastes. Users who like a lot of movies are easier to recommend for than users who are picky. Hence, there is a lot of variance in success rates from one user to another. Dually, movies that are popular are much safer recommendations than movies that are unpopular.

Web Collaborative Filtering: Recommending Music by Spidering the Web William W. Cohen and Wei Fan, 2000/

Input: Downloads of music at ATT Labs over 3 months from server log.
Test set = IP addresses that first connected in August. Training set = rest of data.
User U likes artist A iff U has downloaded at least one file of A's music.
Thus, Boolean and only positive data. (Problematic, since many users download only a few files.)

Usual difficulty identifying user with IP address.

#### Evaluation:

Simulate new user U from the time he enters.
Conceptually, each time the user is recorded in the log as downloading a file, you imagine that you compute a recommendation for him. If he "accepts" the recommendation -- i.e. it is recorded further on in the log that he downloaded this same artist -- then that's good; if he "rejects" the recommendation -- he never downloads this artist -- that's bad.
```AA := set of artists that U downloads in log;
LEN := 0;
repeat working forward through the log, {
based on log data up to T, generate a set of recommendations RR;
pick artist A in RR (that has not yet been recommended or downloaded?)
at random;
if A in AA then LEN++ }
```
Note: the maximum number of different artists downloaded by any user is small: median = 10.

Accuracy up to Kth trial= number of recommendations accepted within first K trials / K. Average over all user who download at least K files.

Plot accuracy up to Kth trial against K. Tends to increase:
B. The pool of users being averaged over continually changes. At time K, we are dealing with only users who download at least K files, and who are therefore easier to match.

Overall accuracy for U: Number of recommendations accepted / Number of recommendations made.

Any such evaluation method, based on user logs, necessarily must systematically and hugely underestimate accuracy.

#### Methods based on user logs

• POP: Popularity of artist A = number of user who download A.
Recommend most popular artist that U has not yet seen.
Note that this is pretty much independent of U. Gives a baseline that is actually not easy to beat.
• K-NN: K nearest neighbors. Find K users closest to U and sum their votes.
Distance(U1,U2) = number of artists where U1, U2 disagree.
• Weighted majority using the following experts:
• For each pair of artists A,B, an expert E[A,B] that predicts that all users will rate A and B identically.
• For each pair of users U,V, an expert E[U,V] that predict that U and V will rate all artists identically.
• For each artist A, an expert E[A] that predicts that all users will rate A positively.
Number of successes = just number of positive successes, not negative successes.
• Extended Direct Bayesian prediction: XDB. I omit the explanation.

#### Methods based on domain specific spider (or, more precisely a large, manually constructed index)

Spider traverses Allmusic.com , extracts "genre" and "musical style" of each artist and related artists.
"Genre" = large category: e.g. jazz, country.
"Musical style" = small category e.g. "country boogie".
B related to A if A's page has reference to B.
(Nothing clever here: Allmusic.com has all this information laid out neatly.)

Genre method:
Construct "pseudo-user" for each genre, that ranks every artist in the genre positively and every artist not in the genre negatively. Use K-NN.

Related artist method:
For each artist A, construct a pseudo-user that ranks positively A and every artist related to A and negatively all others. Use K-NN.

#### Method using general purpose search engines (somewhat simplified).

General idea: Look for lists of artist, use each such list as pseudo-user in K-NN.

Phase 1:
1. For each artist A (in log? not clear), call AltaVista with artist's name. Retrieve top 100 URL's.
2. Extract URL's that appear at least twice (two different artists).
3. Apply list-extractor.
Results: 2519 pseudo-users with 13527 positive ratings.

List extractor: Look for same position in HTML-structure (e.g. http://lusers.com/~biff+body.ul.li) associated with text that is close to name of artist.

Problem with phase 1 results: tends to weight all artists equally. Do better to have popular artists weighted more heavily than unpopular artists.

Phase 2: 1. Find 1000 pairs of entity names co-occurring most frequently in output of Phase 1.
2. Search Northern Lights for each such pair of names. Retrieve top 10 URL's
3. Apply list extractor.
Results: 1894 pseudo-users. 48,878 positive ratings.

#### Combined methods

Various combinations of real users and pseudo-users.

#### Results

Comparing accuracy after K trials: ACC(K). (These figures are eyeballed off a graph.)

Random choice: accuracy = 0.02.

POP: ACC(0)=0.15. ACC(7)=0.17. ACC(25)=0.22. ACC(50)=0.29.
K-NN and XDB are better than POP; statistically significant.
K-NN: ACC(0)=0.15, ACC(7)=0.19. ACC(25)=0.25. ACC(50)=0.32
WM is slightly worse (not significantly) than POP.

Genres and Related artists are much worse than the user-log based methods.
Genres: ACC(1) = 0.04, ACC(7) = 0.07. ACC(25)=0.11. ACC(50)=0.17.
Related artists starts out at ACC(1)=0.14 (why?), but declines sharply, and converges to same behavior as genres.

Spider: Better than related artists, not as good as User logs. Phase 2 is somewhat better than phase 1.
SPIDER phase 2: ACC(1)=0.14. ACC(7)=0.12. ACC(25)=0.18. ACC(50)=0.25
SPIDER phase 1: ACC(1)=0.01. ACC(7)=0.08. ACC(25)=0.15. ACC(50)=0.22

Moral: Recommendations based on web searches can be effective, but are not as effective as recommendations based on user data. Maybe.
Doubts:

• Generally wonky evaluation method.
• In user-log, the evaluation uses the same data set as the training data.
• Quite possible that different IP addresses represent same user, giving user-log data further unfair advantage.
• Atypical user population (Bell Lab employees) will disfavor Web-based techniques.
• Web-based methods are by no means obvious. How much fiddling was needed to find them? To what extent do they generalize? Can you do better?