Increasing Diversity in Web Search Results, Dimitrios Skoutas, Enrico Minack, and Wolfgang Nejdl, Web Science 2010.
Diversifying Search Results Rakesh Agrawal et al., WSDM 2009.
Similar problem to clustering, with similar techniques and evaluation measures. The differences are:
Skoutas: Distance-based. Choose points far apart but close to the query.
Choose the set S that maximizes
L*min d in S Rel(d,q) + mind1,d2 in S Dist(d1,d2).
Problem: This favors extreme documents.
Skoutas: Coverage-based. Make sure that all relevant document is close to
some returned document.
Choose the set S that minimizes
maxd relevant to Q Rel(q,d)L * Dist(d,nearest doc in S)
Agrawal: Choose set S that maximizes the probability that a searcher will find some document relevant to their information need. Note that this is weighted by number of queries in each category, not by number of documents.
Mining query logs: Turning search usage data into knowledge Fabrizio Silvestri. Book length survey. Lots of material. Technical presentation is not always very clear, so you often have to go back to the original paper. But generally an amazing source.
Data Preparation for Mining World Wide Web Browsing Patterns Cooley, Mobasher, and Srivastava
Mining Web Logs to Improve Website Organization Ramakrishnan Srikant and Yinghui Yang
Query expansion using associated queries Bodo Billerbeck et al., CIKM 2003.
Search advertising using web relevance feedback Andrei Broder et al., CIKM 2008.
Difficulties and limitations
Reconstructing search path: Pages previously looked at are cached at the client side and not recorded in server log. Hence, if user request page A and has previously requested C with a link to A, then infer that user has returned to C and followed the link to A.
Advantages: Record [description of] dynamically generated content. Sessionize, identify users, using cookies. Save information missing from server logs: Stop button, local time of user, speed of user's connection.
Disadvantage: Have to either write or modify web server code.
Disadvantage: Low quality information.
Better yet, you can bug some of their other programs as well and get even more information. E.g. if you monitor the browser, email, and text editing programs, you can see how often the user is stuffing information from the browser into email and text files.
Any usage collection runs into privacy issues; the more complete the data, the more serious the issue.
Subject matter: Remarkably unstable over time. See Silvestri.
Spatial demographic tracking. E.g. Famous Google tracking of swine flu of 2009 in advance of CDC. Google trends.
Associations (Cooley etc.)
Search patterns and paths.
Examples from 1996 Olympics Web site: (Cooley et al.)
Indoor volleyball => Handball (Confidence: 45%)
Badminton, Diving => Table Tennis (Confidence: 59.7%)
Sequential patterns: Atlanta home page followed by Sneakpeek main page (Support: 9.81%) Sports main page followed by Schedules main page (Support: 0.42%)
Statistics on re-retrieval: submitting the same query some time apart.
Relate web activities to user profile.
E.g. 30% of users who ordered music are 18-25 and live on the West Coast.
Digression on relevance feedback Old IR technique. (MR&S, chapter 9)
[Standard} relevance feedback: User does a query, gets documents back, marks some of them as relevant (or is observed clicking through to some of them). Words that are common in the relevant document are added to the query, and the modified query is reissued.
In a Boolean model, OR'ing them improves recall and AND'ing them improves precision. In a vector model adding these tends to improve both recall (because a new relevant document may match words in the marked documents which were not in the query, such as synonyms) and precision (because words associated with unintended meanings of the query words, or unintended classes of documents will tend to be disfavored).
[Ordinary] pseudo-relevance feedback or blind relevance feedback: There is no actual feedback from the user; rather, the common words in all the retrieved documents are added to the query and the query is reissued.
This can also often improve retrieval. Suppose that 50% of the documents retrieved are relevant to the intended subject and 50% are irrelevant, but deal with a variety of different subjects. (This is particularly apt to happen with multi-word queries.) Then the words that are common in many of the retrieved documents will be those that are actually relevant to the intended subject, and again, in a vector model, both recall and precision will go up in the modified query.
Has to be used with care with web search engines, because they are generally more unhappy with really long queries than conventional IR engines,
End of digression Getting back to query logs. Billerbeck et al. do query expansion using associated queries:
Offline: With each document D, associate the list of query words Q for which D is on the results page. (Note that this is different from the words in D, because these are drawn from the space of queries that people actually ask).
Alternative, by Boydell and Smith, is to rerank on the client side using the information in the snippets rather than the full documents. Safer in terms of privacy.
Interesting statistic from Joachims and Radlinksi: In a test on results from a search engine, people click on the first result 40% of the time and on the second about 10% of the time. If you swap the top two results, then they click on the first 30% of the time and still click on the second 10% of the time. Eyetracking shows that they're looking at both, so it isn't that they aren't looking at the second.
The point is that there is a strong bias toward clicking on results early in the page, so in evaluation, it is not even close to safe to say that a page is more relevant if it is clicked on than if it is not. If you do that, you just endorsing whatever order the search engine has already generated.
What you can say is that if A comes after B on the results page, and the user clicks on A but not B then on average A is more relevant than B. Using that fact, or variants, you can come up with an evaluation of search engine results from query logs, and thus with weights on the ranking function that optimize that evaluation.
II. Cache replacement strategies
III. Prefetching For K >= 2, if a user asks for the Kth result page, he will probably ask for the K+1st. (Emphatically not true for K=1). So once the user asks for the Kth page, you can prefetch (i.e. precompute and store in the cache) the K+1st results page.
Insight: If a user searches down a hierarchy to index page B, backtracks, and then ends up at target page T, and T is the first target page looked at in the search, then it seems likely that the user expected to find T under B, and therefore one can suggest that it might be good to add a link from B to T.
Test case: Wharton School of Business web site. The web site has 240 leaf pages. Based on 6 days worth of server logs with 15,000 visitors and 3,000,000 records (200 records per visitor seems like a large number, but of course that includes imbedded image files and other junk), the program suggested new links for 25 pages. Some examples:
Visitors expect to find the answer to "Why choose Wharton?" under "Student to Student Program's Question and Answer Session" directory instead of "Student to Student Program's General Description"
Visitors expect to find "MBA Student Profiles" under "Student" instead of "MBA Admission".
Visitors expect to find "Calendar" under "Programs" instead of "WhartonNow".
Visitors expect to find "Concentrations" and "Curriculum" under "Students" instead of "Programs" (less convincing).
The program also made 20 other suggestions.
Customer Profiling: You can find out what kind of customers are buying what kind of items if you can get demographic information (which, apparently, you often can get just by asking --- one writer was shocked at how readily online shoppers provided personal information that the company had no business asking.)
Advertisement placement (my own guess). The "referred" place in the server log tells you what advertisements are attracting what kind of business.
Caching search engine results over incremental indices Roi Blanco et al., WWW-10.
A Survey of Web Cache Replacement Strategies Podlipnig and Boszormenyi
Mining search association patterns from search logs Xuanhui Wang and ChengXiang Zhai, CIKM 2008
Analyzing and evaluating query reformulation strategies in web search logs Jeff Huang and Efthimis Efthimiadis, CIKM 09.
Web Search Result Diversification, Rodrygo L.T. Santos, Craig Macdonald, Iadh Ounisr. WWW 10.
Architecture of the Internet Archive Elliot Jaffe and Scott Kirkpatrick, SYSTOR '09.
What can history tell us?: Toward different models of interaction with document histories Adam Jatowt et al., HT '08.
Query-log mining for detecting spam
Web search/browse log mining: challenges, methods, and applications
Temporal query log profiling to improve web search ranking
Learning about the world through long-term query logs
Search Engines that Learn from Implicit Feedback Thorsten Joachims and Filip Radlinkski, Computer August, 2007.