Big Data and ML — Spring 2021

Instructor Aurojit Panda (E-mail)
Help? Campuswire
When? Tuesday 5:10pm to 7pm
Where? Zoom
Day Time Who Where
Wednesday 10-11am ET Panda Zoom

Course Aims

This course aims to look at trends in cluster computing, specifically trends driven by changes in hardware, applications, and privacy requirements and how these changes impact systems that drive modern datacenters. The aim of the course is to introduce students to recent work, and allow them to

  • Explain the design and architecture of these systems.
  • Analyze tradeoffs between the design of these systems, and decide what is most appropriate for a given use case.
  • Gain experience with using and building big data systems.

Tentative Schedule and Syllabus

Date Topic & Readings Other
02/02 Introduction: Course Mechanics and Overview Lab 0: Administrative
02/09 Introduction and Overview Lab 1: Setup HDFS and Spark
02/16 Introduction and Trends Whiteboard
02/23 Scheduling Lab 1 Due
03/02 Scheduling continued Project Proposal Due.
03/09 Storage: NVMe Project Proposal Due.
03/16 Storage: Privacy and Policies Midterm. Due 03/19 5pm ET.
03/23 Communication: Introduction and Performance Midterm. Due 03/26 5pm ET.
03/30 Communication: Applications and Privacy Final Project Checkin - I
04/06 Programming Models Whiteboard
04/13 Programming Models: Serverless Whiteboard
04/20 Applications: Machine Learning Final Project Checkin - II
04/27 Applications: Reinforcement Learning Whiteboard
05/04 What we missed. Whiteboard
05/11 Final out. Due 05/16 5pm ET.


Grading will be based on quality of work, and presentation. The grade breakdown is as follows (this might change until the beginning of semester):

  • 15% for the one project: This is designed to introduce you to CloudLab infrastructure and help you set up a basic cluster.
  • 25% for the final project: This should be done in groups of 2 or 3 people. You can either (a) explore a new research idea, or (b) work on a significant implementation project. For (a) you should work on a project that could eventually lead to a paper at SoCC, OSDI, SOSP or similar conference; while for (b) we recommend finding an existing open source project and extending or contributing to it (e.g., developing a new scheduling policy for Kubernetes or Apache Yarn); or developing a sufficiently large project.

    We will have 2 intermediate project checkpoints to give you early feedback on project progress. You are encouraged to use Campuswire and other class communication medium to ask questions and get help from others in the class.

  • 20% synthesized notes: Each student needs to sign up to produce notes for four lectures (each set of notes is worth 5%). These will be posted for the rest of the class, and should discuss the motivation, trends and tradeoffs in the papers for that lecture, and potentially look beyond these papers. These are due a week after the lecture.
  • 10% class participation: We are going to judge participation by responses on Campuswire, and comments on synthesized notes.
  • 10% Midterm, 20% Final exam: Both are going to be take-home.