CS439 Spring 2013 Lab 8: Web Proxy

Handed out Saturday, April 20, 2013
Part A due Friday, April 26 Monday, April 29, 2013, 11:59 PM
Part B due Friday, May 3 Monday, May 6, 2013, 11:59 PM
UPDATE: NOTE THAT LATE PART Bs CANNOT BE ACCEPTED.

Introduction

In this lab, you will write a concurrent Web proxy that logs requests. In the first part of the lab, you will write a simple sequential proxy that repeatedly waits for a request, forwards the request to the end server, and returns the result back to the browser, keeping a log of such requests in a disk file. This part will help you understand basics about network programming and the HTTP protocol.

In the second part of the lab, you will upgrade your proxy so that it uses threads to deal with multiple clients concurrently. This part will exercise your experience with concurrency and synchronization.

You are programming in pairs for this assignment.

Late policy. Late policy: Part A of the lab will be treated as every other lab has been: you are welcome to use late hours, if you have them, and if you turn in part A late, your grade on that part of the lab decreases each hour (as usual). However, you cannot use late hours for part B, and furthermore, late part Bs will not be accepted (this is an exception to the usual lateness policy). .

Getting Started

Use Git to commit your Lab 7 source, fetch the latest version of the course repository, and then create a local branch called lab8 based on our lab8 branch, origin/lab8:

tig% cd ~/cs439/labs
tig% git commit -am 'my solution to lab7'
Created commit 734fab7: my solution to lab7
 4 files changed, 42 insertions(+), 9 deletions(-)
tig% git pull
Already up-to-date.
tig% git checkout -b lab8 origin/lab8
Branch lab8 set up to track remote branch refs/remotes/origin/lab8.
Switched to a new branch "lab8"
tig% make tidy
Removing ...
tig% 

You should find the following files after checking out the lab:

Lab Requirements

As before, you will need to do all of the regular exercises described in the lab. You do not need to do any of the challenge problems, though we suggest you try to do one if you have the time. The challenge problems here, however, can be done for extra credit. Follow the instructions given in the challenge problems if you implement them so that we, the teaching staff, know to look at and grade them. Note that you can never lose points from attempting to do a challenge problem. We will only check the challenges at the part B turn-in deadline, so you have the full two weeks to implement a challenge if you so choose.

Hand-In Procedure

When you are ready to hand in your lab code, run make turnin-parti where i is the part you want to turnin in the labs directory. This will first do a make clean to clean out any .o files and executables, and then create a tar file called lab8i-handin.tar.gz with the entire contents of your lab directory and submit it via the CS turnin utility. If you submit multiple times, we will take the latest submission and count lateness accordingly.

Background

A Web proxy is a program that acts as a middleman between a Web browser and an end server. Instead of contacting the end server directly to get a Web page, the browser contacts the proxy, which forwards the request on to the end server. When the end server replies to the proxy, the proxy sends the reply on to the browser.

Proxies are used for many purposes. Sometimes proxies are used in firewalls, such that the proxy is the only way for a browser inside the firewall to contact an end server outside. The proxy may do translation on the page, for instance, to make it viewable on a Web-enabled cell phone. Proxies are also used as anonymizers. By stripping a request of all identifying information, a proxy can make the browser anonymous to the end server. Proxies can even be used to cache Web objects, by storing a copy of, say, an image when a request for it is first made, and then serving that image in response to future requests rather than going to the end server. Proxies can also be used to turn the Web upside-down.

Overview of HTTP/1.0

The HyperText Transfer Protocol (HTTP) is a simple protocol for requesting and receiving documents. In version 1.0 of HTTP, every request/response pair gets its own network connection. A request message has four parts:

  1. Request line. The request line is ASCII text that identifies a method, a resource ID, and a protocol. For your proxy, the method will always be "GET", the resource ID will always be a full URL, and for HTTP/1.0 the protocol will always be "HTTP/1.0". Each request line ends with a carriage return (<CR>) a.k.a. '\r') followed by a linefeed (<LF> a.k.a. '\n'.) So, a request line might look like this:
    GET http://www.utexas.edu/index.html HTTP/1.0 <CR><LF>
    
  2. Headers. Headers are ASCII text that provide parameters to the server to guide the processing of the request. Each header line has a tag, a colon (:), and a value followed by <CR><LF>.

    You will pass each header you receive from the client to the server unmodified, so you don't need to worry about the meaning of tag/value pairs.

  3. Empty line. Each request is terminated with an empty line:
    <CR><LF>
    
  4. Optional message body. Some HTTP requests such as POST include a message body. You will only worry about GET requests, so the message body will never be present.

By default, HTTP servers use port number 80. So, if you receive a request for http://www.utexas.edu/index.html, this means connect to machine www.utexas.edu using port 80 to request document www.utexas.edu/index.html. Sometimes a server will use a different port number. In this case, a port number is added to the end of the URL. For example http://www.utexas.edu/index:3000 would say to use port 3000 on machine www.utexas.edu.

After receiving a request, a server sends a response. Your proxy will just send the response bytes back to the client unmodified, so you don't need to parse the response. A response message looks like this:

HTTP/1.0 200 OK<CR><LF>
Date: Fri, 19 May 2011 14:21:19 GMT
Last-Modified: Fri, 12 May 2011 11:32:26 GMT
Content-length: 1249
Content-type: text/html; charset=UTF-8

... [1249 bytes that represent the requested document] ...
The first line is the protocol (HTTP/1.0) and status (code 200 which means "OK"). The next four lines (in this example) are headers that describe aspects of the response. Then a blank line. Then the requested data.

You can find additional details on the protocol here: http://www.w3.org/Protocols/HTTP/1.0/spec.html

Changes in HTTP/1.1

HTTP/1.1 added a significant performance optimization to HTTP. Instead of opening an closing a connection for each request, multiple requests could be pipelined on a single connection: a client sends a series of requests on a connection without waiting for any responses, then the server sends a series of replies (in the same order as the request) to the client.

HTTP/1.1 is supported by all major browsers, so your proxy will see many HTTP/1.1 requests from clients. Fortunately, for backwards compatibility, clients running HTTP/1.1 must be prepared to deal with servers that only know about HTTP/1.0, so your proxy can simply read one request, send it to the server, receive the response, and close the connection. The client will then know to send its next request on a new connection.

You can find additional details on the protocol here: http://www.w3.org/Protocols/HTTP/1.1/spec.html

Part A: Implementing a Sequential Web Proxy

In this part you will implement a sequential logging proxy. Your proxy should open a socket and listen for a connection request. When it receives a connection request, it should accept the connection, read the HTTP request, and parse it to determine the name of the end server. It should then open a connection to the end server, send it the request, receive the reply, forward the reply to the browser, and close the connection.

Since your proxy is a middleman between client and end server, it will have elements of both. It will act as a server to the web browser, and as a client to the end server. Thus you will get experience with both client and server programming.

Unlike in previous labs, we have left the design of your proxy largely up to you. While you must conform to the multi-threaded coding standards we have given you and while you must write clean and readable C code, we will not generally dictate how you should implement your proxy, only what it must be capable of doing and what format its outputs must be in.

Before going any further, read every word of chapter 12 of Bryant and O'Hallaron. This chapter contains not only explanations of concepts which are extremely relevant to this project, but also provides code snippets which you may find useful to use or model your code after.

Logging Requirements

Your proxy should keep track of all requests in a log file named proxy.log. Each log file entry should be a file of the form:

Date: browserIP URL size
where browserIP is the IP address of the browser, URL is the URL asked for, and size is the size in bytes of the response that was returned (including everything in the response message: protocol, status, headers, blank line, and body of the response.) Here's an example log entry:
Thu 18 Apr 2013 12:55:01 CDT: 128.83.130.113 http://www.utexas.edu/ 62189

Note that size is essentially the number of bytes received from the end server, from the time the connection is opened to the time it is closed. Only requests that are met by a response from an end server should be logged. We have provided the function format_log_entry in csapp.c to create a log entry in the required format.

Port Numbers

You proxy should listen for its connection requests on the port number passed in on the command line, like so:

tig% ./proxy 15213
You may use any port number p, where 1024 ≤ p ≤ 65536, and where p is not currently being used by any other system or user services (including other students' proxies). See /etc/services for a list of the port numbers reserved by other system services.

You will develop your sequential proxy in three stages. In the first two stages, you will implement some of the core networking functionality for the proxy to handle HTTP/1.0 communication with remote servers and to accept incoming connections from clients. In the last stage, you will combine the functionality from the first two stages to complete your sequential proxy.

Exercise 1. Write the client-side portion of your proxy. You should be able to connect your proxy to a remote HTTP server, send a HTTP/1.0 GET request for the root web page of a remote HTTP server, and then dump the returned HTTP response to a file (separate from the log file) or to the console. The particular remote server doesn't matter in this case (in the full proxy, this is dependent on what the client to the proxy requests); you can hard-code one for now.

For example, your client should be able to send a GET request to Google's web servers that looks like the following:

GET http://www.google.com/ HTTP/1.0
and then print out the response you get to the console. You can check to see if your code is working by using telnet to emulate what your code is doing (or should be doing) and comparing the output of the two programs. Here's an example of how to test with telnet:
tig% telnet www.google.com 80
Trying 74.125.227.81...
Connected to www.google.com.
Escape character is '^]'.
GET http://www.google.com/ HTTP/1.0

HTTP/1.0 200 OK
Date: Wed, 17 Apr 2013 20:07:32 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
... [more of the header, then a blank line, then a bunch of HTML] ...
Note that you need to press enter twice after typing the GET line.

You may find the following hints helpful as you embark on this part of the project:

Exercise 2. Write the server-side code of your proxy. Your proxy should be able to listen for connections on a socket, wait for a client to send a GET request, and then send the client an error message thereafter. Your error message should have the format of an HTTP response, with a bit of HTML for a browser to display a nicely-formatted error message to the user. For the time being, you may disable the client-side code of the proxy to develop the server-side code.

A nicely-formatted error response might look something like this example modified from the IANA website (www.iana.org):

HTTP/1.0 400 Bad Request
Content-Length: 162
Content-Type: text/html

<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not
understand.<br />
</p>
</body></html>

You can test the server-side code of your proxy by using telnet. In one terminal, run ./proxy <port number>, where the port number is any number p in the range 1024 ≤ p ≤ 65536. In another terminal, on the same machine, run telnet localhost <port number>. Once telnet is running, you can submit a GET request (much like you did for testing in exercise 1) to your proxy. Once you submit the GET request, you should see your error message print on the console, and then the connection should close.

You should continue to pay heed to the hints given in exercise 1 with respect to doing I/O on sockets.

Exercise 3. Implement a HTTP/1.0 proxy which can handle a single client at a time and log all requests it recieves. In the previous exercises, you've built up the client and server components of your proxy. Now, you need to fuse those two parts to accept incoming connections from clients, parse GET requests from them, forward those GET requests to the correct remote server, and then siphon the response from that remote server back to the client. If there's an error in processing a request, you can send the client back an error message. Remember to log all requests you get!

Since we want you to focus on network programming issues for this lab, we have provided you with two additional helper routines: parse_uri, which extracts the hostname, path, and port components from a URI, and format_log_entry, which constructs an entry for the log file in the proper format.

Once you've written some more functionality into your proxy, you should test your proxy by pointing a real browser to it. You should make sure that your proxy properly redirects HTTP traffic between you, the end-user, and the remote server you connect to through the proxy. Here are pointers to instructions for how to use a proxy in a variety of modern browsers: Chrome, Firefox, Internet Explorer, and Opera.

Challenge! (15 points) Update your proxy so that if launched with the option "-persistent" it will support HTTP/1.1 persistent connections, pipelining multiple requests on a single TCP connection. Your proxy should continue to log requests as in the cases above.

Note that for grading purposes, if the "-persistent" option is not passed in on the command line, the proxy should accept HTTP/1.0 or HTTP/1.1 requests from clients, but it should only emit HTTP/1.0 requests to servers as described in the earlier part of the lab.

To get credit for this extra credit portion of the lab, you must add a line to the initial comment in the proxy.c file:

* EXTRA CREDIT: HTTP/1.1 PERSISTENT CONNECTIONS IMPLEMENTED AND TESTED

Challenge! (10 points) Update your proxy so that if launched with the option "-upsidedown" it will flip all .jpg, .gif, .png, and other image files downloaded so that the user sees all images as upside-down on their display. (If this option is not passed, then transformation should not happen.)

Hint: check out the man page for the mogrify program, which is available on the CS Linux machines in /usr/bin/mogrify.

Feel free to add additional options so that your proxy can distort the Internet in other amusing ways. How about Pig-Latin-Net, Elmer-Fudd-Net, or 5up4-1337-h@X0r-Net? Perhaps you can modify hyperlinks to point to totally different web pages? Be creative!

To get credit for this extra credit portion of the lab, you must add a line to the initial comment in the proxy.c file:

* EXTRA CREDIT: UPSIDE DOWN INTERNET IMPLEMENTED AND TESTED
*   OTHER SUPPORTED TRANSFORMATIONS: [list command line options here]

This completes part A of the lab. Make sure that any files you have added to the proxy have also been git added and that your proxy builds just by running make. Then make sure you have committed your changes with git commit, then run make turnin-partA.

Part B: Handling Concurrent Requests

Real proxies do not process requests sequentially. They deal with multiple requests concurrently. Once you have a working sequential logging proxy, you should alter it to handle multiple requests concurrently. Your solution should use a thread pool of NTHREADS threads (where NTHREADS has been defined in proxy.c). In particular, you must not be creating threads on-demand to handle requests. This is one design point which you must adhere to, though we otherwise have left the design of the proxy up to you.

With this approach, it is possible for multiple peer threads to access the log file concurrently. Thus, you will need synchronize access to the file such that only one peer thread can modify it at a time. If you do not synchronize the threads, the log file might be corrupted. For instance, one line in the file might begin in the middle of another.

You must follow all multi-threaded coding conventions that are standard in this class. These conventions are codified in Coding Standards for Programming with Threads, by Mike Dahlin. Note that you will have to adopt an object-oriented style of programming with the non-object-oriented C language you are using for this project.

For example, your implementation should create a struct Log representing the log file and its synchronization variables, and you should define a set of functions that operate on that struct. Each of these functions should be named log_<something> and each should take a pointer to a struct Log as its first argument.

Exercise 4. Extend your proxy to handle multiple concurrent requests from many clients, using a pool of NTHREADS threads to handle requests as they arrive. Do not spawn a new thread for every client request. You may find the following hints helpful as you work to extend your proxy:

This completes part B of the lab. Make sure that any files you have added to the proxy have also been git added and that your proxy builds just by running make. Then make sure you have committed your changes with git commit, then run make turnin-partB.


Last updated: Thu Apr 25 19:15:28 -0500 2013 [validate xhtml]