Realtime and Big Data Analytics

CSCI-GA.3033-005

 

NYU Courant Institute of Mathematical Sciences

Computer Science Department, Graduate Division

Summer 2015

 


 

General Information

 

Lecturer: Suzanne McIntosh (mcintosh@cs.nyu.edu)

 

Office Hours: Evenings by appointment in WWH 328, and after class.

 

Semester: Summer 2015

 

Room: CIWW (Courant Institute, Warren Weaver Hall) room 1302

 

Day and Time: Thursday, 6:00-8:20 pm

 


Prerequisites

 

Prerequisites: CSCI-GA 2250 or equivalent Operating Systems course; programming experience in Java, C/C++, or Python for assignments and final project; CSCI-GA 2262, CSCI-GA 2620, or undergraduate course in networks. A familiarity with databases and Linux will be helpful, but is not required.

 


Texts           

 

Hadoop: The Definitive Guide, 4th edition, by Tom White

Hadoop Operations, by Eric Sammer (optional)

Programming Pig, by Alan Gates (optional)

Programming Hive, by Edward Capriolo, Dean Wampler, and Jason Rutherglen (optional)

HBase in Action, by Nick Dimiduk and Amandeep Khurana (optional)

HBase: The Definitive Guide, by Lars George (optional)

 

 


Tools

         Cloudera QuickStart VM is available at:

 http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=F6mO278Rvo

 


Description

 

This course introduces architectures and technologies at the foundation of the Big Data movement. These technologies facilitate scalable management and processing of vast quantities of data collected through realtime and near realtime sensing. We explore tools enabling the acquisition of data in the social domain and the fusion of those data when in flight and at rest using Hadoop and Hadoop-related tools.

 

The material covered in this course aligns with the prevailing state of the art in Big Data technologies, which continues to be a rapidly evolving landscape as new technologies emerge and existing ones evolve and mature.

 

Students are required to complete weekly reading and/or programming assignments and demonstrate mastery of course topics by designing, developing, and demonstrating an analytics project of their choosing. Class time will be set aside for project proposal and final demo.

 


Acknowledgement

 

We are grateful to Amazon for supporting this course through the Amazon Web Services (AWS) in Education Grant.

 


Grading

 

Grades are based on the following approximate weighting:

 

Readings, lab assignments, class participation

25%

Midterm

25%

Quizzes

20%

Project

30%

 


Syllabus

        

Class

Topic

1

Introduction to Distributed and Parallel Compute Systems

2

Distributed File Systems, HDFS, MapReduce

3

HDFS and MapReduce Architecture

4

Introduction to Pig, Analytics Examples

5

Programming in Pig

6

New Alternatives to Traditional Database Systems and Access Methods, NoSQL, Introduction to Flume

7

Midterm Exam

8

Hadoop in the Cloud, Programming with Hive, Autonomic Systems

9

Distributed Coordination, Zookeeper

10

YARN/MR2, Fault Tolerance

11

Introduction to Spark, Sqoop, Oozie, and Kafka

12

Project Demo Day