Lab 5: Proxy lab

Aaron Bauer

May 18, 2022

Lab 5: Concurrent Caching Web Proxy1

Introduction

A Web proxy is a program that acts as a middleman between a Web browser and an end server. Instead of contacting the end server directly to get a Web page, the browser contacts the proxy, which forwards the request on to the end server. When the end server replies to the proxy, the proxy sends the reply on to the browser.

Proxies are useful for many purposes. Sometimes proxies are used in firewalls, so that browsers behind a firewall can only contact a server beyond the firewall via the proxy. Proxies can also act as anonymizers: by stripping requests of all identifying information, a proxy can make the browser anonymous to Web servers. Proxies can even be used to cache web objects by storing local copies of objects from servers then responding to future requests by reading them out of its cache rather than by communicating again with remote servers.

In this lab, you will write a simple HTTP proxy that caches web objects. For the first part of the lab, you will set up the proxy to accept incoming connections, read and parse requests, forward requests to web servers, read the servers’ responses, and forward those responses to the corresponding clients. This first part will involve learning about basic HTTP operation and how to use sockets to write programs that communicate over network connections. In the second part, you will upgrade your proxy to deal with multiple concurrent connections. This will introduce you to dealing with concurrency, a crucial systems concept. In the third part, you will add caching to your proxy using a simple main memory cache of recently accessed web content. Finally, you will implement additional test cases to test aspects of your proxy not covered by the autograder.

Logistics

You can download the files you need for this lab with
    wget http://cs.carleton.edu/faculty/awb/cs208/s22/handouts/lab5-handout.tar
Then run the command
    tar xvf lab5-handout.tar

This will generate a handout directory called lab5-handout. The README file describes the various files.

proxy.c contains starter code and comments for Parts I-III described below. This starter code is simply guidance and not a requirement—you are free to modify and/or ignore it in any way you wish.

Part I: Implementing a sequential web proxy

The first step is implementing a basic sequential proxy that handles HTTP/1.0 GET requests. Other requests type, such as POST, are strictly optional. Implement the handle_request function in the provided starter code (and any helper functions you decide to add) to complete this part.

When started, your proxy should listen for incoming connections on a port whose number will be specified on the command line. Once a connection is established, your proxy should read the entirety of the request from the client and parse the request. It should determine whether the client has sent a valid HTTP request; if so, it can then establish its own connection to the appropriate web server then request the object the client specified. Finally, your proxy should read the server’s response and forward it to the client.

The starter code in main takes care of listening on the port passed in on the command line. When it receives a connection, it creates a new socket and passes the associated file descriptor to handle_request. Your code in handle_request should use the RIO functions to read the request from the client. For example, the Tiny server’s doit function uses the code below to initialize buffered reading on the socket file descriptor fd and read the first line from the client:

rio_t rio;

/* Read request line and headers */
Rio_readinitb(&rio, fd);
if (!Rio_readlineb(&rio, buf, MAXLINE))  
    return;

Note that RIO has functions for buffered reading (Rio_readlineb, Rio_readnb) and functions for unbuffered reading and writing (Rio_readn, Rio_writen). You should not interleave buffered and unbuffered calls on the same file descriptor as this will cause problems.

Once you have read and parsed the first line of the request from the client, you will need to open a socket to the server. Use the provided Open_clientfd function for this. It takes two strings arguments, the hostname and the port, and returns a file descriptor.

HTTP/1.0 GET requests

When an end user enters a URL such as http://www.carleton.edu/index.html into the address bar of a web browser, the browser will send an HTTP request to the proxy that begins with a line that might resemble the following:

    GET http://www.carleton.edu/index.html HTTP/1.1

In that case, the proxy should parse the request into the following fields: the hostname, www.carleton.edu; and the path or query and everything following it, /index.html. That way, the proxy can determine that it should open a connection to www.carleton.edu and send an HTTP request of its own starting with a line of the following form:

    GET /index.html HTTP/1.0

Note that all lines in an HTTP request end with a carriage return, '\r', followed by a newline, '\n'. Also important is that every HTTP request is terminated by an empty line: "\r\n".

You should notice in the above example that the web browser’s request line ends with HTTP/1.1, while the proxy’s request line ends with HTTP/1.0. Modern web browsers will generate HTTP/1.1 (or HTTP/2 or even HTTP/3) requests, but your proxy should handle them and forward them as HTTP/1.0 requests.

If a browser sends any request headers as part of an HTTP request, your proxy should forward them unchanged.

Remember that not all content on the web is ASCII text. Much of the content on the web is binary data, such as images and video. Ensure that you account for binary data when selecting and using functions for network I/O.

Port numbers

There are two significant classes of port numbers for this lab: HTTP request ports and your proxy’s listening port.

The HTTP request port is an optional field in the URL of an HTTP request. That is, the URL may be of the form, http://www.carleton.edu:8080/index.html, in which case your proxy should connect to the host www.carleton.edu on port 8080 instead of the default HTTP port, which is port 80.
The provided autograder always supplies a port number with its requests.

The listening port is the port on which your proxy should listen for incoming connections. Your proxy should accept a command line argument specifying the listening port number for your proxy. For example, with the following command, your proxy should listen for connections on port 15213:

    $ ./proxy 15213

You may select any non-privileged listening port (greater than 1,024 and less than 65,536) as long as it is not used by other processes. Since each proxy must use a unique listening port and many people may be simultaneously working on mantis, the script port-for-user.pl is provided to help you pick your own personal port number. Use it to generate port number based on your user ID:

    $ ./port-for-user.pl awb
    awb: 16362

The port, \(p\), returned by port-for-user.pl is always an even number. So if you need an additional port number, say for the Tiny server, you can safely use ports \(p\) and \(p+1\).

Please don’t pick your own random port. If you do, you run the risk of interfering with another user.

String parsing

This part is essentially a string parsing problem—given a URL, extract the hostname, the path, and (potentially) a port. To help you devise your own approach to this, here are some of the tools available in C for string parsing:

Part II: Caching web objects

For the next part of the lab, you will add a cache to your proxy that stores recently-used web objects in memory. Here a web object just means a file sent by a web server. HTTP actually defines a fairly complex model by which web servers can give instructions as to how the objects they serve should be cached and clients can specify how caches should be used on their behalf. However, your proxy will adopt a simplified approach.

When your proxy receives a web object from a server, it should cache it in memory as it transmits the object to the client. If another client requests the same object from the same server, your proxy need not reconnect to the server; it can simply resend the cached object.

The starter code in proxy.c provides struct definitions and a set of cache functions you can use to get started. Under that design, when the proxy receives a request, it should look up the URL in the cache, If an entry is found, the proxy should send the associated item to the client. If an entry isn’t found, when the proxy receives a response from the server, it would buffer that response in memory, and then insert an entry into the cache with the request’s URL as the url and the contents of the buffer as the item.

Part III: Dealing with multiple concurrent requests

Once you have a working sequential caching proxy, you should alter it to simultaneously handle multiple requests. The simplest way to implement a concurrent server is to spawn a new thread to handle each new connection request, using the technique discussed in class. Other designs are also possible, such as the prethreaded server described in Section 12.5.5 of your textbook.

Note that threads should run in detached mode to avoid memory leaks.

Part IV: Breaking things

The provided autograder is very simple. It checks that the proxy can successfully serve several different files from the CSPP Tiny web server (Basic), that the proxy will serve a previously requested file even when the server it came from is no longer available (Cache), and that the proxy will service a second client even when the first hasn’t finished (Concurrency). These verify only fairly basic aspects of the functionality of each part. For example, the autograder does not test whether the proxy gracefully handles an invalid or malformed HTTP request. It’s important that a good, robust proxy not crash in this case, but instead send an appropriate response to the client and stop processing the request.

For this part of the lab, you will devise additional test cases that go beyond the functionality checked by the autograder. Your goal will be to demonstrate test cases that expose flaws in a proxy server that would pass the provided autograder. A test case should consist of

This is by design a very open-ended task. Coming up with your own tests and perhaps even your own testing harness is a valuable skill in the real world, where exact operating conditions are rarely known and reference solutions are often unavailable. Test cases that are automated (like the autograder) will earn more credit. Step-by-step instructions should be sufficiently detailed that I can perform the test on a proxy and easily determine whether it passed or failed the test. Test cases should not rely on extra information being printed by the proxy.

Below are some ideas to get you started:

Testing

Autograder

Your handout materials include an autograder, called driver.sh, that will assign scores for Basic Correctness, Concurrency, and Cache. To run the autograder, run

    $ ./driver.sh

from the lab5-handout directory. You must run the driver on a Linux machine.

Tiny web server

Your handout directory contains the source code for the CSPP Tiny web server in the tiny subdirectory. While not as powerful as thttpd (a small, open-source web server written in C), the CSPP Tiny web server will be easy for you to modify as you see fit. It’s also a reasonable starting point for your proxy code. And it’s the server that the autograder uses to fetch pages.

A general pattern for testing might be to start the Tiny server in one terminal, start your proxy in another, and use curl (see below) to send a request to the Tiny server via your proxy. In VS Code, the little + in the upper right of the terminal section will open another terminal.

telnet

As described in your textbook (11.5.3), you can use telnet to open a connection to your proxy and send it HTTP requests.

curl

You can use curl to generate HTTP requests to any server, including your own proxy. It is an extremely useful debugging tool. For example, if your proxy and Tiny are both running on the local machine, Tiny is listening on port 15213, and proxy is listening on port 15214, then you can request a page from Tiny via your proxy using the following curl command:

linux> curl -v --proxy http://localhost:15214 http://localhost:15213/home.html
* About to connect() to proxy localhost port 15214 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 15214 (#0)
> GET http://localhost:15213/home.html HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu)...
> Host: localhost:15213
> Accept: */*
> Proxy-Connection: Keep-Alive
> 
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Server: Tiny Web Server
< Content-length: 120
< Content-type: text/html
< 
<html>
<head><title>test</title></head>
<body> 
<img align="middle" src="godzilla.gif">
Dave O'Hallaron
</body>
</html>
* Closing connection #0

netcat

netcat, also known as nc, is a versatile network utility. You can use netcat just like telnet, to open connections to servers. In addition to being able to connect to Web servers, netcat can also operate as a server itself. With the following command, you can run netcat as a server listening on port 12345:

    $ nc -l 12345

Once you have set up a netcat server, you can generate a request to a phony object on it through your proxy, and you will be able to inspect the exact request that your proxy sent to netcat.

Grading

This assignment will be graded out of a total of 100 points:


  1. This lab is adapted from the Proxy Lab developed for Computer Systems: A Programmer’s Perspective by Randal E. Bryant and David R. O’Hallaron, Carnegie Mellon University, available here.↩︎