Web Search Engines (CSCI-GA 2580)

Spring 2013, Department of Computer Science, NYU

Course Schedule   Course Projects

Current News:

2013/05/13: Project report is due 9am May 15th, and the project code is due 9am May 18th, both via NYU Classes.

2013/05/13: HW3 grades are released, you have till May 17 7pm to check with the TAs if you have any issue with the grading. As usual, the 50% penalty applies.

2013/05/06: Please note that project demo location is CIWW605, not the class room.

2013/05/02: Instructions for project report submission and project code submission will be available in NYUClasses on May 8th. Late submission policy of 1 hour (20% penalty) / 3 hours (50% penalty) applies for both.

2013/04/29: HW2 grades are released.

2013/04/02: Nitish Korula will give a guest lecture on Internet Advertising for the May 1st class!

Brief Description:

Search engines have become a core part of our daily lives. In this course, we will study the foundations of information retrieval and the technical aspects of modern Web search engines. We will also explore a few advanced topics that have emerged to become highly influential in relation to Web search.

You are expected to study the course material (textbook and research papers), participate in class discussion, and work on a class project that involves system design and implementation.

Instructors and Logistics:

Dr. Fernando Diaz (Microsoft Research), first_initial_and_last_name [AT] cs dot nyu dot edu
Dr. Cong Yu (Google Research), full_name [AT] cs dot nyu dot edu

Teaching Assistants (Questions regarding homeworks should be sent to TAs first):
Bowen Li, first_initial_and_last_initial_and_1182 [at] nyu dot edu
Yixia Mao, first_initial_and_last_initial_and_943 [at] nyu dot edu

Prerequisite: It is expected that you have a good knowledge of algorithms and at least one of the major programming languages.
Although not a strict prereq, having taken UA.0310 is a good proxy.

Time and Location: Wed 5:00p - 6:50p, CIWW102
Office Hours: Wed 3:50p - 4:50p, WWH328
Mailing List: Csci_ga_2580_001_sp13 [AT] cs dot nyu dot edu

Search Engines - Information Retrieval in Practice, by W. Bruce Croft, Donald Metzler, Trevor Strohman. Addison Wesley. 2009.

Participation 10%;
Exams 40%: Midterm 15%, Final 25%;
Project 50%: 3 Homeworks 30% (10% each), Project Report 10%, Project Demo 10%.

Course Schedule (tentative)

Notations: FD = Fernando Diaz; CY = Cong Yu.
Reading materials will be provided on the web site approximately one week before the lecture date.

Date Topic (Instructor) Reading Material Deadlines
Lec 00 (a, b) (01/30) Introduction (CY + FD) Chapter 1–2 HW0 out.
Lec 01 (02/06) Evaluation (FD) Chapter 8  
Lec 02 (02/13) Ranking (FD) Chapter 7 HW0 due; HW1 out. HW1 FAQ
Lec 03 (02/20) Indexing (CY) Chapter 5  
Lec 04 (02/27) Document Processing (CY) Chapter 4 HW1 due; HW2 out.
Lec 05 (03/06) Crawl (CY) Chapter 3  
Lec 06 (03/13) Query Mining (FD) Chapter 6.1, 6.2, [4], [5], [6] HW2 due;
Midterm out.
03/20 Spring Break no class  
Lec 07 (03/27) Big Data (CY) [2], [3] Midterm due;
HW3 out.

Lec 08 (04/03) Search Personalization (FD)    
Lec 09 (04/10) Realtime Search 1 (FD)   HW3 due.
Lec 10 (04/17) Realtime Search 2 (FD)    
Lec 11 (04/24) Knowledge Search (CY) [7], [8]  
Lec 12 (05/01) Internet Advertising
(Nitish Korula)
article 1; article 2  
05/08 Final Exam (CY + FD)    
05/15-18 Project Demo Days CIWW605 (via NYU Classes)
Project Report due at 5/15 9am.
Project Code due at 5/18 9am.
[1] Data-Intensive Text Processing with MapReduce by Lin and Dyer. (Supplemental reading on Big Data)
[2] MapReduce: Simplified Data Processing on Large Clusters, by Jefferey Dean and Sanjay Ghemawat, OSDI 2004.
[3] Distributed Cube Materialization on Holistic Measures, by Arnab Nandi, et al, ICDE 2011.
[4] Donald Metzler, Susan Dumais, and Christopher Meek. 2007. Similarity measures for short segments of text. In Proceedings of the 29th European conference on IR research (ECIR'07), Giambattista Amati, Claudio Carpineto, and Giovanni Romano (Eds.). Springer-Verlag, Berlin, Heidelberg, 16-27.
[5] Rosie Jones and Kristina Lisa Klinkner. 2008. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In Proceedings of the 17th ACM conference on Information and knowledge management (CIKM '08). ACM, New York, NY, USA, 699-708. DOI=10.1145/1458082.1458176 http://doi.acm.org/10.1145/1458082.1458176
[6] Marius Pasca and Benjamin Van Durme. 2007. What you seek is what you get: extraction of class attributes from query logs. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'07), Rajeev Sangal, Harish Mehta, and R. K. Bagga (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2832-2837.
[7] A Web of Concepts, by Dalvi et al, PODS 2009.
[8] Web 3.0: The Dawn of Semantic Search, by James Hendler, IEEE Computer, 43(1), 2010.

Course Projects

A big component of the course is a group project. Each group will design and implement a mini search engine in the first part of the project through a series of homeworks, and an advanced component on top in the second part of the project.

Group ID Group Members Group ID Group Members
G01 xc432, xh379, zz477 G02 ssb402, pvb221, nav237
G03 rc1972, alg489, ssw288 G04 jj1233, jyh300, ss6321
G05 hm1021, ka1042, hj601 G06 ql337, yl1258, yl1404
G07 cc3263, sj1167, hl1115 G08 ly544, zz491, mg3658
G09 fw454, bz465, jl4550 G10 zj285, ys1024, qh237
G11 td859, jl4527, dx262 G12 cl1934, rz557, zc440
G13 al3096, ao925, ys1155 G14 sl3268, hz575, yl1766
G15 ml3329, zl527, wx277 G16 aps398, sdb359, pk1094
G17 kb1573, mkv218, sp2619 G18 bs1781, qt224, tyw239
G19 am5156, kp1264, yg657    

Project Demo Slot Assignments:
Time (pm ET) May 15 May 16 May 17
5:00 G01 (CY) G02 (FD)  
5:15 G15 (CY) G09 (CY)  
5:30 G08 (FD) G11 (FD)  
5:45 reserved G10 (CY) G18 (CY)
6:00 G04 (CY) G16 (CY) G03 (FD)
6:15 G19 (FD) G06 (FD) G17 (FD)
6:30 G14 (CY) G12 (FD) G14 (CY) - 2nd
6:45 G13 (CY) G05 (FD) G07 (FD)