Lecture 6: Collaborative Filtering / Information Extraction

Tao Yang's Lecture

ExpertRank: Ranking system for Ask.com. See US Patent Application 7028026 by Tao Yang, Wei Wang, and Apostolos Gerasoulis. Another interesting observation is that a search engine should try to return pages that cover all the meanings of the query terms. E.g. (Yang's example) the query "CMU" should return Central Michigan University toward the top as well as Carnegie-Mellon, even though Carnegie-Mellon pages have higher page rank. (Both Ask.com and Google return Central Michigan U. as the second answer.)

Required Reading:
Chakrabarti, sec 4.5
Evaluating collaborative filtering recommender systems By Jonathan Herlocker, Joseph Konstan, Loren Terveen, and John Reidl, ACM Transations on Information Systems, vol. 22, No. 1, 2004, pp. 5-53. One of the best papers I have ever read on evaluation of any kind of intelligent systems.

Unsupervised Named-Entity Extraction from the Web. Oren Etzioni et al.

Additional Reading Amazon.com Recommendations: Item to Item Collaborative Filtering by Greg Linden, Brent Smith and Jeremy York, IEEE Internet Computing January-February 2003.

Collaborative Filtering

Example: Terms and Documents

We say that document D is relevant to query term T if

Example: Personal preferences

Database of what users like which movies (M x N matrix).
Task: Predict whether user U likes movie M.

Algorithm 1: Nearest neighbors over users:
User-based collaborative filtering
Find K users whose taste in movies is closest to U and who have expressed an opinion on M. Combine their votes about M. (Either simple sum or weighted sum by closeness.)

Algorithm 1.A: Clustering users
Divide users into clusters, based on their movie preferences.
Take vote on M over U's cluster.

Algorithm 2 (dual of algorithm 1): Nearest neighbors over movies
Item-based collaborative filtering
Find K movies whose fan-list is most similar to those of M's fan-list (presumably, movies that are similar to M in some way) and about which U has expressed an opinion. Combine U's opinions about these K movies (either simple sum or weighted sum by closeness).

Algorithm 2.A: Clustering movies
Divide movies into clusters, based on their fan-list.
Average U's votes over M's cluster.

General issues in either of these:

I would think that it would generally be better to use NN over the larger of the two sets (e.g. if your database has more users than movies, use algorithm 1) That way, you are doing learning in a smaller dimensional space over a larger training set, while the other way, you are doing learning larger dimensional space over a smaller training set.

On the other hand, it is not impossible that learning over the smaller set could work better than learning over the larger one. Consider a dataset constructed as follows:
1. Create a user * movie matrix and assign ratings at random.
2. Duplicate each movie row.
Then 1-NN over movies predicts ratings perfectly, while users are entirely uncorrellated.

However, amazon.com does item-based CF, presumably for efficiency reasons. (Linden, Smith, and York, 2003). As of 2003, they report a database with 29 million customers and several million catalogue items. The following algorithm is performed offline:

for each item I1
  for each customer C who bought I1
     for each item I2 that C bought
        record that a customer bought both I1 and I2;
for each item I1
   for each I2 that has a customer in common with I1
     compute the similarity between I1 and I2;
If "similarity" is measured as the normalized dot product then the similarity between I1 and I2 is equal to
[the number of customers who bought both I1 and I2] divided by
sqrt([the number of customers who bought I1]) * sqrt([the number of customers who bought I2]).

Let C be the number of customers, N be the number of items, and K be the average number of items a customer has bought. Thus the total number of purchases P=KC. Assuming (which is certainly not true) that the transactions are uniformly distributed over both customers and items, then this takes time O(PK)= O(CK2) = O(P2/C). If the distribution is non-uniform then the time required can be substantially larger.

When the customer logs on, the system accesses his purchase record and computes the items that are marked as most similar to the items he has purchased (probably weighted by recency of purchase, though the article does not say so.) This runs in time proportional to the number of items that particular customer has purchased.

The article claims that the advantage of this approach is that the similarity can be computed offline but of course there's no reason that a user-based similarity measure can't be computed offline. The real difference, I suppose, is that the analogous algorithm over users would take time O(P2/N). It's also quite likely that the departures from uniformity has a more adverse effect here, as the non-uniformity over items is almost certainly much greater than the non-uniformity over users (though the article says specifically that a few users have bought a significant fraction of the items in the catalogue; hard to imagine, unless there are large retail stores that are stocking inventory from amazon.com, which seems implausible.)

Linden et al. say "The click-through and conversion rates [for user-specific recommendations] ... vastly exceed those of untargeted content such as banner advertisements and top-seller lists."

Another advantage to item-based methods is that one can combine the similarity measure based on users with a content-based similarity measure; e.g. on the basis of catalogue description, either structured or textual. This gets around the new item problem: the problem that, until someone has ordered or at least looked at an item, there's no basis on which to recommend it. (The dual is the "new user" problem, and the dual fix is to use user profiles; people have done that too, but that's obviously much more problematic in a number of ways.) Systems that have used this hybrid approach are described in Semantically Enhanced Collaborative Filtering on the Web Bamshad Mobasher, Xin Jin, and Yanzan Zhou, and Building Recommender Systems using a Knowledge Base of Product Semantics by Rayiud Ghani and Andrew Fano.

Algorithm 3: Cluster both movies and users .

1. Choose fixed number #M of movie clusters and number #U of user clusters.
2. Use algorithm 1A to cluster users and algorithm 2A to cluster movies.
3. repeat  {
4.     for each user UI, 
5.        define a vector UUI[1...#M] s.t. 
6.            UUI[J] = number of movies in Jth cluster that UI likes;
7.     cluster the vectors UUI; 
8.     for each movie MJ, 
9.        define a vector MMJ[1...#U] s.t. 
10.           MMJ[I] = number of users in Ith cluster that like MJ;
       cluster the vectors MMJ; 
    }
11. until nothing changes.

Algorithm 4: Probabilistic clustering. (adapted from Chakrabarti p. 116.)
Assume that there are Z movie clusters and N user clusters.
Let AU,C be the probability that user U belong to cluster C.
Let BM,D be the probability that movie M belongs to cluster D.
Let PC,D be the probability that a user in cluster C likes a movie in cluster D.
1 Then the probability that person U likes movie M is proportional to

sumC,D AU,C PC,D BM,D
Use a hill-climbing technique to find the values of A,B,P that best fit the data. Use these values of A,B,P to predict whether U likes M.

Note: Danger of overfitting with all these parameters to play with.

Evaluation

The paper (Herlocker et al. 2004) is remarkably thorough.

End-user tasks

Offline evaluation vs. Live User evaluation

Suppose you have a data set of preference information, and want to use the dataset evaluate a CF algorithm. In evaluating classification algorithms (Does patient have TB? Is borrower a good credit risk? etc.) the standard procedure is: Hide the classication attribute for some of the data instances (test set), and try to predict the value of I.C (attribute C of instance I) based on the remaining data (training set) plus the predictive attributes of instance I.

The success rate is the percentage correct (or weighted sum, or precision/recall etc.)

In CF, however, things are murkier:

So, e.g. there are a number of ways to go about picking a test set: These are quite different distributions, and measure quite different kinds of success.

The problem of evaluating Boolean votes

Suppose you have a data set like Amazon's which shows for each user the list of what they've bought (also what they've looked at, but never mind that). How do you do offline evaluation of the recommendation algorithm? There is no really adequate solution. The problem is that there's never any way to detect false positives (recommendations of the system that the user will not actually like). If in an offline test the system recommends item I to user U and the data set shows that U has not bought I, you can't conclude that the recommendation is not correct.

If the recommendation system has been deployed, then one can evaluate to some extent based on the frequency with which a users browses/buys a recommended object. But:
A. This is a "bottom line" measure; in some ways more important than the accuracy, but one would also like to know the accuracy.
B. Obviously deploying a system is an expensive undertaking, and deploying a substandard system may be a very costly one, in terms of user dissatisfaction.

Or one can do systematic experimentation on user groups, but that has its own obvious difficulties.

The sparse data problem

Suppose that the users are ranking the items on a scale from 1 to 10, so that non-votes and negative votes are no longer confounded. Still, the data is very sparse (most users have not ranked most objects). This raises its own problems for offline evaluation.

1. The recorded votes are generally a representative sample of all potential votes, since users tend to be more interested in ranking items they like than items they dislike. If algorithms are evaluated for their accuracy over the recorded votes recorded in the dataset, this creates a bias in favor of algorithms that tend to give unduly many favorable votes. If you try to fix this by treating a non-vote as a somewhat negative vote, then that creates a bias in favor of algorithms that tend to produce somewhat negative votes.

2. There are other subtler biases. For example, suppose that item I has only been evaluated by one user U. What is an algorithm to do about recommending I to other users? Well, some algorithms will recommend it to users who resemble U, some will not, but we have no way to measuring which is the better strategy. If [U,I] is in the training set, then we never test whether [U1,I] is a valid recommendation, because we don't have its value for any U1 != U. And if [U,I] is in the test set, then we have no basis for recommending it to U, because we have no evaluations of I in the training set.

Dataset Properties

Domain features: Content topic, User tasks, Importance of novelty, Cost/benefit of false/true positives/negatives, Granularity of actual preferences

Sample features: Density and distribution of ratings.

Available datasets

(As of 2004).
EachMovie dataset. 2.8 million ratings from over 70,000 users.
Extracts of the MovieLens datasets (100,000 ratings and 1 million ratings).
Jester dataset: user ratings of jokes.

The main point is that there are very few, and quite similar in structure (as contrasted with, say, the large number of datasets available for classification or IR).

Numerical Accuracy Measure

Not much new here. Mean-squared error, precision/recall curves, prediction-rating correlation, evaluation of ranked list. Analysis of the correlation between these various measure. Tend to be fairly well correlated, but with exceptions.

Beyond accuracy

Coverage: :
"Prediction coverage": For what fraction of pairs [U,I] does the system provide a recommendation?
"Catalog coverage": What fraction of items I are recommended to someone?
"Useful coverage" (analogous to recall): What is the likelihood that an item actually useful to a given user will be recommended to him?

Learning rate: How soon can the system start to make recommendations to a user?

Novelty: There is no point to recommending to grocery shoppers that they buy milk, bread, and bananas. There is no point in recommending the Beatles "White Album" to music shoppers. Distinguish novelty from serendipity: A recommendation is serendipitous if the user would have been unlikely to find the item otherwise.

Strength: How much does the system think the user will like the iterm? vs.
Confidence: How sure is the system of its own recommendation?

User Interface: Additional material about the item (picture, snippet etc.); explanation of the recommendation (similar items bought by user etc.)

Information Extraction

General idea: To leverage the redundancy of the web against the difficulty of natural language intepretation. Somebody somewhere will have stated the fact you want in a form that your program can recognize.

General bootstrapping algorithm:

{ EXAMPLES := seed set of examples of the kind of thing you want to collect.
  repeat { EXAMPLEPAGES := retrieve pages containing the examples in E;
           PATTERNS := patterns of text surrounding the examples in E
                         in EXAMPLEPAGES;
           PATTERNPAGES := retrieve pages containing patterns in PATTERNS;
           EXAMPLES := extract examples from PATTERNPAGES matching PATTERNS
         }
  until (some stopping condition: e.g. enough iterations, enough examples, 
         some measure of accuracy too low, etc.)
  return(EXAMPLES)
}

KnowItAll (Etzioni et al.)

Task: To collect as many instances as possible of various categories. (cities, states, countries, actors, and films.)

Domain-independent extractor and assessor rules.

Extractor rule:

Predicate: Class1
Pattern: NP1 "such as" NPList2 
Constraints: head(NP1)=plural(label(Class1)) properNoun(head(each(NPList2))) Bindings: Class1(head(each(NPList2)))
E.g. For the class "City" the pattern is "cities such as NPList2" "cities such as" can be used as a search string. The pattern would match "cities such as Providence, Pawtucket, and Cranston" and would label each of Providence, Pawtucket, and Cranston as cities.

Subclass extractors: Look for instances of a subclass rather than the superclass. E.g. it is easier to find people described as "physicists", "biologists", "chemists" etc. rather than "scientists."

List extractor rules.

Assessor Collection of high-precision, searchable patterns. E.g. "[INSTANCE] is a [CATEGORY]" ("Kalamazoo is a city.") There will not be very many of these on the Web, but if there are a few, that is sufficient evidence.

Learning

Learning synonyms for category names: E.g. learn "town" as a synonym for "city"; "nation" as a synonym for "country" etc.
Method: Run the extractor rules in the opposite direction. E.g. Look for patterns of the form "[CLASS](pl.) such as [INST1], [INST2] ..." where some of the instances are known to be cities.

Learning patterns

Some of the best patterns learned:
the cities of [CITY]
headquartered in [CITY]
for the city of [CITY]
in the movie [FILM]
[FILM] the movie starring
movie review of [FILM]
and physicist [SCIENTIST]
physicist [SCIENTIST]
[SCIENTIST], a British scientist

Learning Subclasses

Subclass patterns:
[SUPER] such as [SUB]
such [SUPER] as [SUB]
[SUB] and other [SUPER]
[SUPER] especially [SUB]
[SUB1] and [SUB2] (e.g. "physicists and chemists")

Learning list-pattern extractors

Looks for repeated substructures within an HTML subtree with many instances of the category in a particular place.
E.g. the pattern " CITY " will detect CITY as the first element of a row in a table. (Allows wildcards in the argument to an HTML tag.) Predict that the leaves of the subtree are all instances.
Hugely effective strategy; increases overall number retrieved by a factor of 7, and increases "extraction rate" (number retrieved per query) by a factor of 40.

Results: Found 151,016 cities of which 78,157 were correct: precision = 0.52. At precision = 0.8, found 33,000. At precision = 0.9, found 20,000