Big Data Application Development

CSCI-GA.3033-005

 

NYU Courant Institute of Mathematical Sciences

Computer Science Department, Graduate School

 

Spring 2018

 


General Information

 

Lecturer: Suzanne McIntosh (mcintosh@cs.nyu.edu)

 

Office Hours: Thursday 9-10pm and by appointment.

 

Semester: Spring 2018

 

Room: CIWW (Courant Institute, Warren Weaver Hall) room 101

 

Day and Time: Thursday, 7:10-9:00 pm

 


Prerequisites

 

This course is designed for students who have successfully completed the Realtime and Big Data Analytics graduate course, or have experience using Hadoop MapReduce and HDFS.

 

    Requirements:

 

·  Strong programming skills in Java, Python, or C++

·  Experience using Hadoop

·  Coursework in operating systems, networking, and algorithms

·  Familiarity with Linux commands

 


Texts

 

    Required

 

·  Learning Spark, by Karau, Konwinski, Wendell, and Zaharia

 

    Optional

 

·  Advanced Analytics with Spark, by Ryza, Laserson, Owen, and Wills

·  Hadoop: The Definitive Guide (Fourth edition), by Tom White

 


Tools

 

Students will use two platforms for completing homework assignments:

 

1.     A fully configured virtual machine that includes Spark and other course tools

 

2.     The NYU HPC Hadoop cluster - ‘Dumbo’

 


Course Description

 

This is an introductory course on Spark programming, Spark architecture, Spark SQL, Spark Streaming, and integration of Spark with the Hadoop ecosystem for developing Big Data analytics applications. The course project can be completed with Scala or Python, and Spark. This course covers technologies that integrate well with Spark in the creation of Big Data analytics applications.

 

Students are required to complete weekly reading and programming assignments and demonstrate mastery of course topics by developing a final project using Spark.

 


Grading

 

Grades are based on the following approximate weighting:

 

Readings, lab assignments, class participation

20%

Midterm

30%

Final

30%

Project

20%

 


Syllabus

        

Class

Topic

1

Course Introduction, Hadoop Review, Scala Intro

2

Scala Programming, Motivation for Scala with Spark

3

Spark Overview and Architecture, Spark Programming

4

Spark RDDs, Spark Applications

5

Spark Collections

6

Spark Flow Control, Midterm Exam Review

7

Midterm Exam

8

Project Discussion, Spark Pair RDDs

9

Spark Parallel Processing, Spark RDD Persistence, Project Roundtable

10

Spark Algorithms, SparkSQL

11

Spark Streaming, Project Scrum

12

Spark Data Formats, Lambda Architecture

13

Advanced Topics, Final Exam Review, Team Breakout Session

14

Big Data Application Development Symposium

15

Final Exam