CSCI-GA.3033-001
NYU Courant Institute of
Mathematical Sciences
Computer Science Department, Graduate School
Summer 2017
Lecturer: Suzanne McIntosh (mcintosh@cs.nyu.edu)
Office Hours: Evenings by appointment in WWH 328, and
after class.
Semester: Summer 2017
Room: CIWW (Courant Institute, Warren Weaver
Hall) room 517
Day and Time: Thursday, 6:00-8:20 pm
This course is
designed for students who have successfully completed the Realtime
and Big Data Analytics graduate course.
Requirements:
·
Strong programming
skills in Java, Python, or C++
·
Experience using Hadoop
·
Coursework in
operating systems, networking, and algorithms
·
Familiarity with
Linux commands
Required
·
Learning Spark, by Karau,
Konwinski, Wendell, and Zaharia
Optional
·
Advanced Analytics with Spark, by Ryza, Laserson, Owen, and Wills
·
Hadoop: The Definitive Guide (third edition), by Tom White
Students may choose one of two platforms for completing homework
assignments:
1.
A fully configured virtual machine that includes Spark and
other course tools, such as the Cloudera QuickStart VM available at:
2.
The NYU HPC Hadoop cluster -
‘Dumbo’
This course covers Scala and Spark
programming, Spark architecture, Spark Streaming, and integration of Spark with
the Hadoop ecosystem for developing Big Data
applications. In addition, it covers some of the technologies that integrate
well with Spark in the creation of Big Data applications.
Students are required to complete weekly reading and programming
assignments and demonstrate mastery of course topics by developing a final
project using Scala, Spark, and complementary Hadoop tools.
Grades are based on the following
approximate weighting:
Readings, lab assignments, class
participation |
25% |
Midterm |
25% |
Final |
20% |
Project |
30% |
Class |
Topic |
1 |
Course Introduction, Programming with Scala |
2 |
Advanced Scala,
Distributed Processing |
3 |
Spark Overview, Distributed Storage |
4 |
Cluster Resource Management, Data
Ingest |
5 |
Data Management, Data Formats |
6 |
Data
Management with Partitioning |
7 |
Midterm Exam NRT Data Ingest, Distributed Messaging |
8 |
Spark
RDDs (Resilient Distributed Datasets) Spark
Applications |
9 |
Spark
Parallelization Spark
RDD Persistence, Spark Patterns |
10 |
Spark
Streaming Spark
SQL |
11 |
Project
Demos |
12 |
Final Exam |