Format of Final Exam

The final exam will be an in-class exam. It will be given on Wednesday, May 11, 5:00 - 6:50 in WWH 102. It will be open book and open notes. No electronic devices allowed.

The exam will be divided into two parts: web search engines and web mining.

Part I: Web Search Engines

Part I, Web Search Engines, will be worth 60% of the exam grade. It will cover the subjects listed below. In this section I will ask only about any issues that I have discussed in class. In particular substantial parts of the discussions in MR&S are applicable only to traditional IR systems and not to web search; you may skip these.

Part II: Web Mining

Part II, Web Mining, will be worth 40% of the grade. For this part, you will choose ONE of the following six areas listed below and answer questions based on the specified readings. Please note: advance preparation --- that is, carefully reading and thinking about the papers in the area you choose --- is essential for this part of the exam. If you have prepared the papers in your chosen area, the questions should be quite straightforward; if you are trying frantically to read them during the exam, that will probably not work well.

On the exam, I will only accept answers from ONE area. If you answer questions on part II from more than one area, I will choose one of these randomly, grade you on that, and ignore the others.

Be sure to BRING A COPY of the papers in your area to the exam. The questions will refer to quite specific points in the paper, and you will need to have the actual paper in hand to answer them. In this part of the exam, I may ask about aspects of the paper that I did not discuss in class.

It might possibly be good strategy to prepare the papers in one back-up area in case you don't like the questions on your favorite area. I ADVISE STRONGLY AGAINST PREPARING ALL THE AREAS UNLESS YOU HAVE A LOT OF FREE TIME BETWEEN NOW AND MAY 11TH.

Also: I will try to make the questions from the different areas of equal difficulty. But the papers themselves are of quite different length and inherent difficulty; there is nothing I can do about that. So you might want to check out some of the areas that appeal to you and see which papers seem easiest to you.

Finally, quite a few of these papers do a statistical analysis of their results. I am not assuming a knowledge of statistics, and therefore will not ask about these statistical analyses. Generally, if one of the papers you are preparing draws on highly technical material outside the scope of this course, don't hesitate to check with me whether you really need to know this in detail; it is quite likely that I will not be asking you about it.

Invisible Web and WebTables

Google's Deep Web crawl Madhavan et al. Proc. VLDB Endowment, 1:2, August 2008.
WebTables: Exploring the Power of Tables on the Web, Michael Cafarella et al.

Sentiment Analysis

Opinion Mining and Sentiment Analysis Bo Pang and Lillian Lee, Chap. 4 (pp. 23-60)
Sentiment Analysis and Subjectivity, Bing Liu, in Handbook of Natural Language Processing 2nd edn. ed. N. Indurkhya and F. J. Damerau.

Query Log Mining

Mining query logs: Turning search usage data into knowledge Fabrizio Silvestri. Chaps. 4 and 5.
Data Preparation for Mining World Wide Web Browsing Patterns Cooley, Mobasher, and Srivastava


80 Million Tiny Images: A Large Dataset for Non-Parametric Object and Scene Recognition, Antonio Torralba, Rob Fergus, and William Freeman, IEEE PAMI Nov. 2008, 30:1.
A survey of browsing models for content based image retrieval Daniel Heesch Multimedia Tools and Applications 40:2, 2008, 261-284.

Multi-lingual Web

A corpus factory for many languages Adam Kilgariff et al., LREC 2010.
The Crubadan Project: Corpus building for under-resourced languages Kevin Scanell.

Information extraction

Preemptive information extraction using unrestricted relation discovery Yusuke Shinyama and Satoshi Sekine
Open Information Extraction from the Web Michele Banko et al.