Big Data Application Development

CSCI-GA.3033-001

 

NYU Courant Institute of Mathematical Sciences

Computer Science Department, Graduate School

 

Summer 2017

 


General Information

 

Lecturer: Suzanne McIntosh (mcintosh@cs.nyu.edu)

 

Office Hours: Evenings by appointment in WWH 328, and after class.

 

Semester: Summer 2017

 

Room: CIWW (Courant Institute, Warren Weaver Hall) room 517

 

Day and Time: Thursday, 6:00-8:20 pm

 


Prerequisites

 

This course is designed for students who have successfully completed the Realtime and Big Data Analytics graduate course.

 

    Requirements:

 

·  Strong programming skills in Java, Python, or C++

·  Experience using Hadoop

·  Coursework in operating systems, networking, and algorithms

·  Familiarity with Linux commands

 


Texts

 

    Required

 

·  Learning Spark, by Karau, Konwinski, Wendell, and Zaharia

 

    Optional

 

·  Advanced Analytics with Spark, by Ryza, Laserson, Owen, and Wills

·  Hadoop: The Definitive Guide (third edition), by Tom White

 


Tools

 

Students may choose one of two platforms for completing homework assignments:

 

1.     A fully configured virtual machine that includes Spark and other course tools, such as the Cloudera QuickStart VM available at:

http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=F6mO278Rvo

 

2.     The NYU HPC Hadoop cluster - ‘Dumbo’

 


Course Description

 

This course covers Scala and Spark programming, Spark architecture, Spark Streaming, and integration of Spark with the Hadoop ecosystem for developing Big Data applications. In addition, it covers some of the technologies that integrate well with Spark in the creation of Big Data applications.

 

Students are required to complete weekly reading and programming assignments and demonstrate mastery of course topics by developing a final project using Scala, Spark, and complementary Hadoop tools.

 


Grading

 

Grades are based on the following approximate weighting:

 

Readings, lab assignments, class participation

25%

Midterm

25%

Final

20%

Project

30%

 


Syllabus

        

Class

Topic

1

Course Introduction, Programming with Scala

2

Advanced Scala, Distributed Processing

3

Spark Overview, Distributed Storage

4

Cluster Resource Management, Data Ingest

5

Data Management, Data Formats

6

Data Management with Partitioning

7

Midterm Exam

NRT Data Ingest, Distributed Messaging

8

Spark RDDs (Resilient Distributed Datasets)

Spark Applications

9

Spark Parallelization

Spark RDD Persistence, Spark Patterns

10

Spark Streaming

Spark SQL

11

Project Demos

12

Final Exam