Realtime and Big Data Analytics

CSCI-GA.3033-008

 

NYU Courant Institute of Mathematical Sciences

Computer Science Department, Graduate Division

Spring 2014

 


 

General Information

 

Lecturer: Suzanne McIntosh (mcintosh@cs.nyu.edu)

 

Office Hours: Evenings by appointment in WWH 328, and after class.

 

Semester: Spring 2014

 

Room: CIWW (Courant Institute, Warren Weaver Hall) 1302

 

Day and Time: Thursday, 7:10-9:00 pm

 

Teaching Assistants: Kan Wang and Swathi Gillela

 


Prerequisites

 

CSCI-GA 2250 or equivalent Operating Systems course; programming experience in Java, Python, or C/C++ for assignments and final project; CSCI-GA 2262, CSCI-GA 2620, or undergraduate course in networks. A familiarity with databases will be useful.

 


Texts

 

Hadoop: The Definitive Guide, by Tom White

Hadoop Operations, by Eric Sammer (optional)

Programming Pig, by Alan Gates (optional)

 


Tools

Cloudera Distribution for Apache Hadoop (CDH) Fully configured QuickStart VM is available at:

http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=F6mO278Rvo

 


Description

 

This course will introduce technologies at the foundation of the Big Data movement that have facilitated scalable management of vast quantities of data collected through realtime and near realtime sensing. We will also explore the tools enabling the acquisition of near realtime data in the social domain, the fusion of those data when in flight and at rest, and their meaningful representation in graphical visualizations.

 

Students are required to complete weekly reading and programming assignments, and demonstrate mastery of course topics by developing and demonstrating an analytics project of their design. Class time will be set aside for project proposal and final demo.

 


Grading

 

Grades are based on the following approximate weighting:

 

Readings, lab assignments, class participation

25%

Midterm

25%

Final

20%

Project

30%

 


Syllabus (tentative)

Class

Topic

1

Introduction to Hadoop and Big Data

2

Distributed File Systems, MapReduce

3

HDFS and MapReduce Architecture

4

Introduction to Pig

5

Project Tee-up, Analytics Examples, Realtime Systems

6

New Alternatives to Traditional Database Systems and Access Methods, NoSQL, Intro. to Flume

7

Midterm Exam

8

Project Team meetings, Managing Big Data, Intro. to Hive

9

Hadoop in the Cloud

10

Realtime and Big Data in The Cloud: Autonomic Systems

11

Realtime and Big Data in The Cloud: Distributed Coordination

12

Fault Tolerance in Hadoop

13

Project Demo Day!

14

Final Exam Review

15

Final Exam