Realtime and Big Data Analytics
CSCI-GA.3033-003
NYU Courant Institute of Mathematical Sciences
Computer Science Department, Graduate School
Summer 2017
General Information
Lecturer: Suzanne
McIntosh (mcintosh@cs.nyu.edu)
Office Hours: Evenings
by appointment in WWH 328, and after class.
Semester: Summer 2017
Room: CIWW (Courant Institute,
Warren Weaver Hall) room 317
Day and Time: Wednesday,
6:00-8:20 pm
Prerequisites
CSCI-GA 2250 or equivalent Operating Systems course; programming experience in Java,
C/C++, or Python for assignments and final project; CSCI-GA 2262, CSCI-GA 2620,
or undergraduate course in networks. A familiarity with databases and Linux will
be helpful, but is not required.
Texts
Hadoop: The Definitive Guide, 4th edition,
by Tom White
Hadoop Operations, by Eric Sammer
(optional)
Programming Pig, by Alan Gates (optional)
Programming Hive, by Edward Capriolo,
Dean Wampler, and Jason Rutherglen (optional)
Architecting HBase Applications, by Jean-Marc Spaggiari
and Kevin O'Dell (optional)
HBase: The Definitive Guide, by Lars George (optional)
Tools
Cloudera QuickStart VM is available at:
Description
This
course introduces architectures and technologies at the foundation of the Big
Data movement. These technologies facilitate scalable management and processing
of vast quantities of data collected through realtime and near realtime
sensing. We explore tools enabling the acquisition of data in the social domain
and the fusion of those data when in flight and at rest using Hadoop and
Hadoop-related tools.
The
material covered in this course aligns with the prevailing state of the art in
Big Data technologies, which continues to be a rapidly evolving landscape as
new technologies emerge and existing ones evolve and mature.
Students
are required to complete weekly reading and/or programming assignments and
demonstrate mastery of course topics by designing, developing, and
demonstrating an analytics project of their choosing. Class time will be set
aside for project proposal and final demo.
Acknowledgement
We are grateful to Amazon for supporting this course
through the Amazon Web Services (AWS) in Education Grant.
Grading
Grades are based on the following approximate
weighting:
Readings, lab assignments, class participation |
20% |
Midterm |
25% |
Final |
25% |
Project |
30% |
Syllabus
Class |
Topic |
1 |
Introduction to Hadoop and Big Data |
2 |
Distributed File Systems, HDFS, MapReduce |
3 |
HDFS and MapReduce Architecture |
4 |
Introduction to Pig, Analytics Examples |
5 |
Project Tee-up, Realtime
Systems, Introduction to Flume |
6 |
New Alternatives to Traditional Database Systems and
Access Methods, NoSQL, HBase |
7 |
Midterm Exam |
8 |
Hadoop in the Cloud |
9 |
Hive |
10 |
Autonomic Systems |
11 |
Distributed Coordination, ZooKeeper |
12 |
Hadoop Fault Tolerance |
13 |
YARN, Spark, Final Exam Review |
14 |
Project Demo Day |
15 |
Final Exam |