Lecture 9: Result Diversity and Query Log Mining

Result Diversity

Optional reading

Increasing Diversity in Web Search Results, Dimitrios Skoutas, Enrico Minack, and Wolfgang Nejdl, Web Science 2010.

Diversifying Search Results Rakesh Agrawal et al., WSDM 2009.

Similar problem to clustering, with similar techniques and evaluation measures. The differences are:

Measures of diversity: Let q be query; Rel(d,q) be the relevance of doc d to q; L a positive parameter.

Skoutas: Distance-based. Choose points far apart but close to the query. Choose the set S that maximizes
    L*min d in S Rel(d,q) + mind1,d2 in S Dist(d1,d2).
Problem: This favors extreme documents.

Skoutas: Coverage-based. Make sure that all relevant document is close to some returned document.
Choose the set S that minimizes
maxd relevant to Q Rel(q,d)L * Dist(d,nearest doc in S)

Agrawal: Choose set S that maximizes the probability that a searcher will find some document relevant to their information need. Note that this is weighted by number of queries in each category, not by number of documents.

Query Log Mining

Important reading

Mining query logs: Turning search usage data into knowledge Fabrizio Silvestri. Book length survey. Lots of material. Technical presentation is not always very clear, so you often have to go back to the original paper. But generally an amazing source.

Data Preparation for Mining World Wide Web Browsing Patterns Cooley, Mobasher, and Srivastava

Mining Web Logs to Improve Website Organization Ramakrishnan Srikant and Yinghui Yang

Query expansion using associated queries Bodo Billerbeck et al., CIKM 2003.

Search advertising using web relevance feedback Andrei Broder et al., CIKM 2008.

Data sources

(Cooley, Mobasher, and Srivasta)

Server Log

Most common source.

Difficulties and limitations

Reconstructing search path: Pages previously looked at are cached at the client side and not recorded in server log. Hence, if user request page A and has previously requested C with a link to A, then infer that user has returned to C and followed the link to A.

Server Application

Integrating E-Commerce and Data Mining: Architecture and Challenges S. Ansari et al.

Advantages: Record [description of] dynamically generated content. Sessionize, identify users, using cookies. Save information missing from server logs: Stop button, local time of user, speed of user's connection.

Disadvantage: Have to either write or modify web server code.

Proxy server logs

Advantage: many servers, many users (though generally not a representative collection of users).

Disadvantage: Low quality information.

Client level

Induce client to use "bugged" browser. Get all the information you want (though the analysis of content is generally easier at the server side.)

Better yet, you can bug some of their other programs as well and get even more information. E.g. if you monitor the browser, email, and text editing programs, you can see how often the user is stuffing information from the browser into email and text files.


Any usage collection runs into privacy issues; the more complete the data, the more serious the issue.

Information to be extracted

Statistical measures


(Mostly but not entirely from Silvestri)

Information about queries

Distribution: Follows a power-law distribution with exponent 2.04. That is the Kth most popular query has frequency proportional to (K+C)-2.04 where C is a constant. Small number of queries account for most of the volume; large number of rare queries (long tail).

Subject matter: Remarkably unstable over time. See Silvestri.

Spatial demographic tracking. E.g. Famous Google tracking of swine flu of 2009 in advance of CDC. Google trends.

Associations (Cooley etc.)

Search patterns and paths.

Mining techniques

Statistical analysis, classification techniques, clustering, Markov models, sequential patterns.

Association rules:
Examples from 1996 Olympics Web site: (Cooley et al.)
Indoor volleyball => Handball (Confidence: 45%)
Badminton, Diving => Table Tennis (Confidence: 59.7%)

Sequential patterns: Atlanta home page followed by Sneakpeek main page (Support: 9.81%) Sports main page followed by Schedules main page (Support: 0.42%)

User characteristics

General characteristics: Users mostly (78%) of the time only look at the first page of results. The conditional probability that they look at the K+1st given that they've looked at the Kth is an increasing function of K.

Statistics on re-retrieval: submitting the same query some time apart.

Relate web activities to user profile.
E.g. 30% of users who ordered music are 18-25 and live on the West Coast.

Query expansion

Adding terms to a query to gain precision. Variant of relevance feedback.

Digression on relevance feedback Old IR technique. (MR&S, chapter 9)

[Standard} relevance feedback: User does a query, gets documents back, marks some of them as relevant (or is observed clicking through to some of them). Words that are common in the relevant document are added to the query, and the modified query is reissued.

In a Boolean model, OR'ing them improves recall and AND'ing them improves precision. In a vector model adding these tends to improve both recall (because a new relevant document may match words in the marked documents which were not in the query, such as synonyms) and precision (because words associated with unintended meanings of the query words, or unintended classes of documents will tend to be disfavored).

[Ordinary] pseudo-relevance feedback or blind relevance feedback: There is no actual feedback from the user; rather, the common words in all the retrieved documents are added to the query and the query is reissued.

This can also often improve retrieval. Suppose that 50% of the documents retrieved are relevant to the intended subject and 50% are irrelevant, but deal with a variety of different subjects. (This is particularly apt to happen with multi-word queries.) Then the words that are common in many of the retrieved documents will be those that are actually relevant to the intended subject, and again, in a vector model, both recall and precision will go up in the modified query.

Has to be used with care with web search engines, because they are generally more unhappy with really long queries than conventional IR engines,

End of digression Getting back to query logs. Billerbeck et al. do query expansion using associated queries:

Offline: With each document D, associate the list of query words Q for which D is on the results page. (Note that this is different from the words in D, because these are drawn from the space of queries that people actually ask).

Query Time:

Query suggestion

Suggest alternative query to user. If a lot of previous user first issue Q1 then Q2, and current user has issued Q1, suggest Q2. Various ways to implement this.

Personalized results

Rerank search results corresponding to user tastes as gathered from a query log. Method described in Silvestri from Liu is very complicated, but basically you have a fixed set of categories, you categorize web pages off-line, you learn what categories a user is interested using a moving time window (because tastes change), and you give extra ranking points to categories the user likes.

Alternative, by Boydell and Smith, is to rerank on the client side using the information in the snippets rather than the full documents. Safer in terms of privacy.

Learning the ranking function / Search engine evaluation

Essentially, learning comes down to evaluation. I discussed this in the evaluation lecture, but Silvestri has some more details.

Interesting statistic from Joachims and Radlinksi: In a test on results from a search engine, people click on the first result 40% of the time and on the second about 10% of the time. If you swap the top two results, then they click on the first 30% of the time and still click on the second 10% of the time. Eyetracking shows that they're looking at both, so it isn't that they aren't looking at the second.

The point is that there is a strong bias toward clicking on results early in the page, so in evaluation, it is not even close to safe to say that a page is more relevant if it is clicked on than if it is not. If you do that, you just endorsing whatever order the search engine has already generated.

What you can say is that if A comes after B on the results page, and the user clicks on A but not B then on average A is more relevant than B. Using that fact, or variants, you can come up with an evaluation of search engine results from query logs, and thus with weights on the ranking function that optimize that evaluation.

Query spelling correction

Query spelling correction is similar to query suggestion. If query Q1 is frequently followed by query Q2 in the query logs, and Q1 is plausibly a misspelling of Q2, then in future suggest Q2 as a correction for Q1. "Plausibly a misspelling" can be much looser here than in a context without query logs. Phonetic similarity. Note that the correction may involve changing the number of words; users may have combined two words or split a single word.


I. What do you cache? Answer: You do all three.

II. Cache replacement strategies

Cache miss frequency is not the only issue. A disadvantage of a dynamic scheme is that, since parallel processes are accessing the cache, you need to enforce a mutex, which slows things down. (I would have thought that you could get away with a single write process and avoid the need for mutex, but Silvestri says not, so I believe him.)

III. Prefetching For K >= 2, if a user asks for the Kth result page, he will probably ask for the K+1st. (Emphatically not true for K=1). So once the user asks for the Kth page, you can prefetch (i.e. precompute and store in the cache) the K+1st results page.

Web Site Reorganization

Srikant and Yang.

Insight: If a user searches down a hierarchy to index page B, backtracks, and then ends up at target page T, and T is the first target page looked at in the search, then it seems likely that the user expected to find T under B, and therefore one can suggest that it might be good to add a link from B to T.

Test case: Wharton School of Business web site. The web site has 240 leaf pages. Based on 6 days worth of server logs with 15,000 visitors and 3,000,000 records (200 records per visitor seems like a large number, but of course that includes imbedded image files and other junk), the program suggested new links for 25 pages. Some examples:

Visitors expect to find the answer to "Why choose Wharton?" under "Student to Student Program's Question and Answer Session" directory instead of "Student to Student Program's General Description"

Visitors expect to find "MBA Student Profiles" under "Student" instead of "MBA Admission".

Visitors expect to find "Calendar" under "Programs" instead of "WhartonNow".

Visitors expect to find "Concentrations" and "Curriculum" under "Students" instead of "Programs" (less convincing).

The program also made 20 other suggestions.


I couldn't find any good papers on this. But I'd guess the main applications are:

Customer Profiling: You can find out what kind of customers are buying what kind of items if you can get demographic information (which, apparently, you often can get just by asking --- one writer was shocked at how readily online shoppers provided personal information that the company had no business asking.)

Advertisement placement (my own guess). The "referred" place in the server log tells you what advertisements are attracting what kind of business.

Other reading

Caching search engine results over incremental indices Roi Blanco et al., WWW-10.

A Survey of Web Cache Replacement Strategies Podlipnig and Boszormenyi

Mining search association patterns from search logs Xuanhui Wang and ChengXiang Zhai, CIKM 2008

Analyzing and evaluating query reformulation strategies in web search logs Jeff Huang and Efthimis Efthimiadis, CIKM 09.

Web Search Result Diversification, Rodrygo L.T. Santos, Craig Macdonald, Iadh Ounisr. WWW 10.

Architecture of the Internet Archive Elliot Jaffe and Scott Kirkpatrick, SYSTOR '09.

What can history tell us?: Toward different models of interaction with document histories Adam Jatowt et al., HT '08.

Query-log mining for detecting spam

Web search/browse log mining: challenges, methods, and applications

Temporal query log profiling to improve web search ranking

Learning about the world through long-term query logs

Search Engines that Learn from Implicit Feedback Thorsten Joachims and Filip Radlinkski, Computer August, 2007.