Distributed Computing G22.2631 (Multicore Programming) - Spring 2011
Schedule: Mondays 5 to 6:50pm
Room: CIWW 312
Instructor: Alberto Lerner (lerner@cs.nyu.edu)
Throughout this class, you will build a simple, yet scalable web server. The goal in life of such a server is to respond to HTTP requests. HTTP is a human-readable protocol that your browser uses to tell a server which document you'd like to retrieve. For instance, If you typed in your browser:
It would issue a HTTP request that would look like this (in a simplified form):
GET /index.html HTTP/1.1
Host: cs.nyu.edu
User-Agent: Mozilla/5.0

The NYU server would respond with (again, simplified):
HTTP/1.1 200 OK
Date: Tue, 12 Jan 2011 15:23:51 GMT
Content-Length: 14459
Content-Type: text/html

... remainder of the document ...
And finally your browser would process the HTML. (If you're curious to see this working, you could try using 'firebug' in the tools section.)
We are interested in a HTTP server because it is arguably easy to parallelize. Requests are mostly independent, right? Not always. Take HTML document caching, for instance. It imposes some coordination that may or may not scale. We'll expose and study these scalability problems.
In the process, we will also discuss the tools we use to program and debug such parallel code. There is a growing consensus that we need better tools. We will see why.
The practical assignments will follow the steps below.
Lab 1 - Profiling exercise
Lab 2 - Introduce concurrency in the server
Lab 3 - A thread pool for an asychronous http sever
Lab 4 - Implement a shared document cache
Lab 5 - Implement a spinning lock
Lab 6 - Collect statistics
Please, take a look at the department's academic integrity policy. You can show a colleague how to use a given tool. You can discuss strategies to solve the problems with a colleague, if you mention it in writing in your assigment handin. This kind of collaboration is encouraged. However, each student is expected to type, compile, debug, and benchmark their own code. You are not allowed to look at a colleague's code.
Development environment and tools
We are going to be doing all our development and benchmarking in Linux.
For convenience, a development virtual machine is available that has all the packages we need pre-installed. (username with admin privileges is dev, passwd: dev). It is a Ubuntu 9.10 distribution, which is a very user-friendly distribution. Use the VM Player to execute the VM in your laptop, for intance. If you already have a Linux machine you'd prefer to use, please make sure to install all the necessary packages.
Please, take some time to get comfortable with the tools. The initial labs will allow that time but you should be proactive. By the lab 3, when the true fun starts, you will want to focus on "things multicore" not on underlying tools.
The tools in question are:
GCC and GDB are the compiler and debugger we will be using. The latter in particular is your new best friend. If you never used it, start by this simple and excellent tutorial.
Git is the tool we will use to manage our source code. Because a lab builds on the previous one, we want to keep track of the code changes we perform. Git will also be the tool I'll use to distribute code, when needed. Here is a primer to git and a very nice cheat sheet.
Libraries such as the STL, pthreads, and sockets are going to be used quite some. The course assumes you know the first one and will introduce the others as needed. A very accessible guide to programming with threads is Blaise Barney's. About socket programming, do not worry if you haven't seen it. Networking code is not a pre-requisite and will be given to you. I would expect you to feel a bit curious about it anyway, in which case I'd point to Steven's et al "Unix Network Programming" (from our book reserve). In more than one lab, we will be using code almost straight from the book.
Code Profiling will be done in almost all the labs. There are two tools that will amaze you, I'm sure. One is the Google's Performance Tools. It gives us a glimpse of where time is being spent inside our running code without disturbing it much. The other tool is a browser plug-in called Firebug. This tool adds a network activity monitor to your browser that allows you to profile and inspect HTTP requests and responses.
Load Generator is a tool that can create artificial HTTP requests that will hit our server. We'll be using a generator called HTTPerf. I understand you'd be thrilled by how efficient your server will handle these requests. But when running the generator on a shared machine, please be mindful of others. (Or we might get kicked out of the more powerful machines...)
Our mailing list g22_2631_001_sp11 can be a great source of help. Don't be ashamed to use it.