Course Evaluations

Colloquium Tomorrow

COMPUTER SCIENCE COLLOQUIUM
Friday, December 6, 2002
11:30 a.m.
Room 1302 WWH
251 Mercer Street
New York, NY 10012-1185

Speaker: Thomas Funkhouser, Princeton University

Title: A Search Engine for 3D Models

Abstract:

As the number of 3D models available on the Web grows, there is an increasing need for a search engine to help people find them (e.g., a Google for 3D models). Unfortunately, traditional text-based search techniques are not always effective for 3D data. In this talk, we investigate new shape-based search methods. A key challenge is to find a computational representation of shape (a "shape descriptor") that is concise, robust, quick to compute, efficient to match, and discriminating between similar and dissimilar shapes.

In this talk, I will describe shape descriptors designed for computer graphics models commonly found on the Web (i.e., they may contain arbitrary degeneracies and alignments). We have experimented with them in a Web-based search engine that allows users to query for 3D models based on similarities to 3D sketches, 3D models, 2D sketches, and/or text keywords. We find our best shape matching methods provide better precision-recall performance than related approaches and are fast enough to return query results from a repository of 20,000 polygonal models in under a second. You can try them out at:http://shape.cs.princeton.edu.

Recommendations and Collaborative Filtering

Personalization on the Web Alexander Pretschner and Susan Gauch, 1999.

Taxonomy of issues in recommendation (adapted from Pretschner and Gauch, who are looking at a larger class.)

Collaborative Filtering

Database of what users like which movies (M x N matrix).
Task: Predict whether user U likes moving M.

Algorithm 1: Nearest neighbors over users:
Find K users whose taste in movies is closest to U and who have expressed an opinion on M. Combine their votes about M. (Either simple sum or weighted sum by closeness.)

Algorithm 1.A: Clustering users
Divide users into clusters, based on their movie preferences.
Take vote on M over U's cluster.

Algorithm 2 (dual of algorithm 1): Nearest neighbors over movies
Find K movies whose fan-list is most similar to those of M's fan-list (presumably, movies that are similar to M in some way) and about which U has expressed an opinion. Combine U's opinions about these K movies (either simple sum or weighted sum by closeness).

Algorithm 2.A: Clustering movies
Divide movies into clusters, based on their fan-list.
Average U's votes over M's cluster.

General issues in either of these:

In practice, everyone uses NN over users not over movies. For a good reason: there are many more movies than users. Hence NN (or other learning algorithm) over users is doing learning in a smaller dimensional space over a larger training set, while learning over movies takes place in a larger dimensional space over a smaller training set.

On the other hand, it is not impossible that learning over movies could work better than learning over users. Consider a dataset constructed as follows:
1. Create a user * movie matrix and assign ratings at random.
2. Duplicate each movie row.
Then 1-NN over movies predicts ratings perfectly, while users are entirely uncorrellated.

Algorithm 3: Cluster both movies and users .

1. Choose fixed number #M of movie clusters and number #U of user clusters.
2. Use algorithm 1A to cluster users and algorithm 2A to cluster movies.
3. repeat  {
4.     for each user UI, 
5.        define a vector UUI[1...#M] s.t. 
6.            UUI[J] = number of movies in Jth cluster that UI likes;
7.     cluster the vectors UUI; 
8.     for each movie MJ, 
9.        define a vector MMJ[1...#U] s.t. 
10.           MMJ[I] = number of users in Ith cluster that like MJ;
       cluster the vectors MMJ; 
    }
11. until nothing changes.

Algorithm 4: Probabilistic clustering. (adapted from Soumen Chakrabarti Mining the Web p. 116.)
Assume that there are M movie clusters and N user clusters.
Let AU,C be the probability that user U belong to cluster C.
Let BM,D be the probability that movie M belongs to cluster D.
Let PC,D be the probability that a user in cluster C likes a movie in cluster D.
1 Then the probability that person U likes movie M is proportional to

sumC,D AU,C PC,D BM,D
Use a hill-climbing technique to find the values of A,B,P that best fit the data. Use these values of A,B,P to predict whether U likes M.

Note: Danger of overfitting with all these parameters to play with.

Algorithm 5: Weighted majority
Imagine a space of experts. Each expert implements a very simple recommendation strategy. Let R be a number (learning rate) between 0 and 1. If an expert E has thus far been right P times and wrong Q times, then the credibility of E is (2-R)P * RQ.
(?? Doesn't seem right, as this can go to zero even for experts with some informational value. I would have thought (1/R)P * RQ. The above is the formula from Cohen and Pan.)

Sum experts votes weighted by credibility; recommend item with highest score.

Evaluation

Complex, because the "training set" has three parts: the movies that U likes, the users that like M, and all the other data.

Also, some users are much easier to recommend for than others. Users who like popular music (by definition, there are a lot of these) are much easier to recommend for than users with odd tastes. Users who like a lot of music are easier to recommend for than users who are picky. Hence, there is a lot of variance in success rates from one user to another.

Web Collaborative Filtering: Recommending Music by Spidering the Web William W. Cohen and Wei Fan, 2000/

Input: Downloads of music at ATT Labs over 3 months from server log.
Test set = IP addresses that first connected in August. Training set = rest of data.
User U likes artist A iff U has downloaded at least one file of A's music.
Thus, Boolean and only positive data. (Problematic, since many users download only a few files.)

Test set: 23,438 downloads to 1028 IP addresses. 981 artists.
Training set: 5095 downloads to 353 IP addresses.
Usual difficulty identifying user with IP address.

Evaluation:

Simulate new user U from the time he enters.
Conceptually, each time the user is recorded in the log as downloading a file, you imagine that you compute a recommendation for him. If he "accepts" the recommendation -- i.e. it is recorded further on in the log that he downloaded this same artist -- then that's good; if he "rejects" the recommendation -- he never downloads this artist -- that's bad.
AA := set of artists that U downloads in log;
LEN := 0;
repeat working forward through the log, {
    let T be the next time U downloads a file;
    based on log data up to T, generate a set of recommendations RR;
    pick artist A in RR (that has not yet been recommended or downloaded?) 
      at random;
    if A in AA then LEN++ }
 until all artists in AA have either been recommended or downloaded.
Note: the maximum number of different artists downloaded by any user is small: median = 10.

Accuracy up to Kth trial= number of recommendations accepted within first K trials / K. Average over all user who download at least K files.

Plot accuracy up to Kth trial against K. Tends to increase:
A. More information available about users' tastes
B. The pool of users being averaged over continually changes. At time K, we are dealing with only users who download at least K files, and who are therefore easier to match.

Overall accuracy for U: Number of recommendations accepted / Number of recommendations made.

Any such evaluation method, based on user logs, necessarily must systematically and hugely underestimate accuracy.

Methods based on user logs

Methods based on domain specific spider (or, more precisely a large, manually constructed index)

Spider traverses Allmusic.com , extracts "genre" and "musical style" of each artist and related artists.
"Genre" = large category: e.g. jazz, country.
"Musical style" = small category e.g. "country boogie".
B related to A if A's page has reference to B.
(Nothing clever here: Allmusic.com has all this information laid out neatly.)

Genre method:
Construct "pseudo-user" for each genre, that ranks every artist in the genre positively and every artist not in the genre negatively. Use K-NN.

Related artist method:
For each artist A, construct a pseudo-user that ranks positively A and every artist related to A and negatively all others. Use K-NN.

Method using general purpose search engines (somewhat simplified).

General idea: Look for lists of artist, use each such list as pseudo-user in K-NN.

Phase 1:
1. For each artist A (in log? not clear), call AltaVista with artist's name. Retrieve top 100 URL's.
2. Extract URL's that appear at least twice (two different artists).
3. Apply list-extractor.
Results: 2519 pseudo-users with 13527 positive ratings.

List extractor: Look for same position in HTML-structure (e.g. http://lusers.com/~biff+body.ul.li) associated with text that is close to name of artist.

Problem with phase 1 results: tends to weight all artists equally. Do better to have popular artists weighted more heavily than unpopular artists.

Phase 2: 1. Find 1000 pairs of entity names co-occurring most frequently in output of Phase 1.
2. Search Northern Lights for each such pair of names. Retrieve top 10 URL's
3. Apply list extractor.
Results: 1894 pseudo-users. 48,878 positive ratings.

Combined methods

Various combinations of real users and pseudo-users.

Results

Comparing accuracy after K trials: ACC(K). (These figures are eyeballed off a graph.)

Random choice: accuracy = 0.02.

POP: ACC(0)=0.15. ACC(7)=0.17. ACC(25)=0.22. ACC(50)=0.29.
K-NN and XDB are better than POP; statistically significant.
K-NN: ACC(0)=0.15, ACC(7)=0.19. ACC(25)=0.25. ACC(50)=0.32
WM is slightly worse (not significantly) than POP.

Genres and Related artists are much worse than the user-log based methods.
Genres: ACC(1) = 0.04, ACC(7) = 0.07. ACC(25)=0.11. ACC(50)=0.17.
Related artists starts out at ACC(1)=0.14 (why?), but declines sharply, and converges to same behavior as genres.

Spider: Better than related artists, not as good as User logs. Phase 2 is somewhat better than phase 1.
SPIDER phase 2: ACC(1)=0.14. ACC(7)=0.12. ACC(25)=0.18. ACC(50)=0.25
SPIDER phase 1: ACC(1)=0.01. ACC(7)=0.08. ACC(25)=0.15. ACC(50)=0.22

Moral: Recommendations based on web searches can be effective, but are not as effective as recommendations based on user data. Maybe.
Doubts:

Domain-Specific Search Engines

DEADLINER : Search engine for conference announcements

DEADLINER: Building a New Niche Search Engine Andries Kruger et al. 2000

Objectives: (A) to build a tool; (B) to work toward a toolkit.

Document retrieval

Filtering

Finding proper paragraph

Rich collection of individually weak binary filters. Nayman-Pearson procedure finds rules for combining K filters that optimally trade-off precision vs. recall.

Sample Filters:

Title:
Match keywords in paragraph.
Match "on" in paragraph.
At most 30 words in paragraph. (Similar filters for other counts)
At least 75% of words are capitalized.
Change in indentation from previous/next paragraph
Date found in this paragraph.
Date found within next 5 paragraphs.
One of the first 7 paragraphs
Country name found in this paragraph
At least 20 white-space or non-alphanumeric characters (??)

Deadline:
Match the words "deadline", "by", "before", "later than", "on", "submit", "paper", ":", "important date", "deadline qualifiers" (each of these is a separate filter).
Find a date.

Representation of filter: Regular expression augmented with counting operators, arithmetic operators, format detection operators.

min(indentation(cur_par)) != min(indentation(prev_par))
#(cur_par, /[A-Z]w*@+/) / #(cur_par, /w+@+/) > = 0.75
(The number of capitalized words divided by the number of words > = 0.75. @ denotes whitespace.)

Extracting information from paragraph

Specialized heuristics. E.g.

Program committee

Performance

SVM Filtering.
Corpus: 592 Calls for Papers, 2269 negative examples, mostly conference related pages.
Training set: 249 positive, 1250 negative examples.
Test set: 343 positive, 1019 negative examples.
Results: Linear SVM: Positive accuracy = 88.1%. Negative accuracy = 98.7%.
Gaussian SVM: Positive accuracy = 95.9%. Negative accuracy = 98.6%.

Finding correct paragraph
Paper provides only hard-to-read graph, no figures. But doing the best I can:

For deadline:
Best single filter at 2% false positives gives 88% recall rate. (They don't say which this is, but probably the word "deadline".)
Best 4 filters combined at 2% false positives gives 92% recall rate.

Program committee even more successful.

Information extraction
Target Total Detected/Extracted Detected/Not Extracted Extraneous
Deadline 300 214 2 31
Committee and Affiliation 1455 1252 72 136

CORA

Building Domain-Specific Search Engines with Machine Learning Techniques Andrew McCallum et al. 1999

CORA --domain specific search engine for computer science research papers.
(Not currently running.) 33,000 papers as of 1999; 50,000 as of some later date.

Stage 1: Spidering.
Stage 2: Classification into Hierarchy.
Stage 3: Information extraction
Stage 4: User Interface: Keyword search, Field search, Hierarchy

Spidering

Starting crawl from home pages of CS departments and labs.

Learn to choose links to follow based on words surrounding anchor.
1. Assign values to pages thus far spidered by discounted future reward (reinforcement learning).
2. Use Naive Bayes to predict discretized values from bag of words surrounding anchor.

Experimental results: Finds relevant papers in test set much more quickly than breadth-first search. E.g. finds 75% of relevant papers after searching 11% of links vs. 30% of links for breadth-first search.
Training set = about 40,000 pages, 450,000 links.

Hierarchical classification

Hand constructed fixed hierarchy with keywords. 51-leaf hierarchy with a few keywords per node. Construction took 3 hours per node, examining conference proceedings and CS websites.

Classification:

Information extraction

Estimate the probability that a given string in the text expresses a particular field using a Hidden Markov Model.

Collect 500 research papers.
Manually tag 500 references with one of 13 classes: title, author, institution, etc. Use 300 as training set, 200 as test set.
Construct HMMs using a series of increasingly complex techniques.
Final accuracy: about 93%.