Lab 5: Concurrent Caching Web Proxy¹

Assigned: Friday, May 20
Check-in Post: Before 9pm Friday, May 27 (check-in forum)
Due: 9pm Wednesday, June 1 (No late days can be used on this lab)
Collaboration: This assignment can be done with a partner or individually (I will assume same as lab 4 unless you tell me otherwise)
Handout: lab5-handout.tar
Submit: Upload your proxy.c file and any files related to your test cases to Lab 5 on Moodle
References:
- System-Level I/O
- Network Programming
- Concurrent Programming
- CSPP 10.5 (RIO functions for network I/O)
- CSPP 11.6 (Tiny web server)
- 208 Style Guide

Introduction

A Web proxy is a program that acts as a middleman between a Web browser and an end server. Instead of contacting the end server directly to get a Web page, the browser contacts the proxy, which forwards the request on to the end server. When the end server replies to the proxy, the proxy sends the reply on to the browser.

Proxies are useful for many purposes. Sometimes proxies are used in firewalls, so that browsers behind a firewall can only contact a server beyond the firewall via the proxy. Proxies can also act as anonymizers: by stripping requests of all identifying information, a proxy can make the browser anonymous to Web servers. Proxies can even be used to cache web objects by storing local copies of objects from servers then responding to future requests by reading them out of its cache rather than by communicating again with remote servers.

In this lab, you will write a simple HTTP proxy that caches web objects. For the first part of the lab, you will set up the proxy to accept incoming connections, read and parse requests, forward requests to web servers, read the servers’ responses, and forward those responses to the corresponding clients. This first part will involve learning about basic HTTP operation and how to use sockets to write programs that communicate over network connections. In the second part, you will upgrade your proxy to deal with multiple concurrent connections. This will introduce you to dealing with concurrency, a crucial systems concept. In the third part, you will add caching to your proxy using a simple main memory cache of recently accessed web content. Finally, you will implement additional test cases to test aspects of your proxy not covered by the autograder.

Logistics

You can download the files you need for this lab with

    wget http://cs.carleton.edu/faculty/awb/cs208/s22/handouts/lab5-handout.tar

Then run the command

    tar xvf lab5-handout.tar

This will generate a handout directory called lab5-handout. The README file describes the various files.

proxy.c contains starter code and comments for Parts I-III described below. This starter code is simply guidance and not a requirement—you are free to modify and/or ignore it in any way you wish.

Part I: Implementing a sequential web proxy

The first step is implementing a basic sequential proxy that handles HTTP/1.0 GET requests. Other requests type, such as POST, are strictly optional. Implement the handle_request function in the provided starter code (and any helper functions you decide to add) to complete this part.

When started, your proxy should listen for incoming connections on a port whose number will be specified on the command line. Once a connection is established, your proxy should read the entirety of the request from the client and parse the request. It should determine whether the client has sent a valid HTTP request; if so, it can then establish its own connection to the appropriate web server then request the object the client specified. Finally, your proxy should read the server’s response and forward it to the client.

The starter code in main takes care of listening on the port passed in on the command line. When it receives a connection, it creates a new socket and passes the associated file descriptor to handle_request. Your code in handle_request should use the RIO functions to read the request from the client. For example, the Tiny server’s doit function uses the code below to initialize buffered reading on the socket file descriptor fd and read the first line from the client:

rio_t rio;

/* Read request line and headers */
Rio_readinitb(&rio, fd);
if (!Rio_readlineb(&rio, buf, MAXLINE))  
    return;

Note that RIO has functions for buffered reading (Rio_readlineb, Rio_readnb) and functions for unbuffered reading and writing (Rio_readn, Rio_writen). You should not interleave buffered and unbuffered calls on the same file descriptor as this will cause problems.

Once you have read and parsed the first line of the request from the client, you will need to open a socket to the server. Use the provided Open_clientfd function for this. It takes two strings arguments, the hostname and the port, and returns a file descriptor.

HTTP/1.0 GET requests

When an end user enters a URL such as http://www.carleton.edu/index.html into the address bar of a web browser, the browser will send an HTTP request to the proxy that begins with a line that might resemble the following:

    GET http://www.carleton.edu/index.html HTTP/1.1

In that case, the proxy should parse the request into the following fields: the hostname, www.carleton.edu; and the path or query and everything following it, /index.html. That way, the proxy can determine that it should open a connection to www.carleton.edu and send an HTTP request of its own starting with a line of the following form:

    GET /index.html HTTP/1.0

Note that all lines in an HTTP request end with a carriage return, '\r', followed by a newline, '\n'. Also important is that every HTTP request is terminated by an empty line: "\r\n".

You should notice in the above example that the web browser’s request line ends with HTTP/1.1, while the proxy’s request line ends with HTTP/1.0. Modern web browsers will generate HTTP/1.1 (or HTTP/2 or even HTTP/3) requests, but your proxy should handle them and forward them as HTTP/1.0 requests.

If a browser sends any request headers as part of an HTTP request, your proxy should forward them unchanged.

Remember that not all content on the web is ASCII text. Much of the content on the web is binary data, such as images and video. Ensure that you account for binary data when selecting and using functions for network I/O.

Port numbers

There are two significant classes of port numbers for this lab: HTTP request ports and your proxy’s listening port.

The HTTP request port is an optional field in the URL of an HTTP request. That is, the URL may be of the form, http://www.carleton.edu:8080/index.html, in which case your proxy should connect to the host www.carleton.edu on port 8080 instead of the default HTTP port, which is port 80.
The provided autograder always supplies a port number with its requests.

The listening port is the port on which your proxy should listen for incoming connections. Your proxy should accept a command line argument specifying the listening port number for your proxy. For example, with the following command, your proxy should listen for connections on port 15213:

    $ ./proxy 15213

You may select any non-privileged listening port (greater than 1,024 and less than 65,536) as long as it is not used by other processes. Since each proxy must use a unique listening port and many people may be simultaneously working on mantis, the script port-for-user.pl is provided to help you pick your own personal port number. Use it to generate port number based on your user ID:

    $ ./port-for-user.pl awb
    awb: 16362

The port, \(p\), returned by port-for-user.pl is always an even number. So if you need an additional port number, say for the Tiny server, you can safely use ports \(p\) and \(p+1\).

Please don’t pick your own random port. If you do, you run the risk of interfering with another user.

String parsing

This part is essentially a string parsing problem—given a URL, extract the hostname, the path, and (potentially) a port. To help you devise your own approach to this, here are some of the tools available in C for string parsing:

sscanf is a good tool for extracting parts of a string when its structure is known. For example, line 60 of tiny/tiny.c uses sscanf to separate the request line consisting of the request method, the URL, and the HTTP version into separate variables:
```
sscanf(buf, "%s %s %s", method, url, version); 
```
This line will take the string in buf and copy the characters before the first space to method, copy the characters after the first space and before the second space to url, and copy the rest of buf to version. So GET www.carleton.edu/index.html HTTP/1.1 would be split into GET, www.carleton.edu/index.html, and HTTP/1.1. sscanf is also helpful in dealing with a known prefix, such as http://. For example
```
sscanf(url, "http://%s", url_trim);
```
will extract the part of url after the http:// and copy it to url_trim. Note that nothing will be copied to url_trim if url does not start with exactly http://. Recall that sscanf returns the number of items that were successfully matched.
char *strchr(char *cs , char c) returns a pointer to the first occurrence of character c in cs, or NULL if c is not present. For example, if url is "www.carleton.edu/index.html",
```
char *temp = strchr(url, '/');
strncpy(filename, temp, 100);
```
would set temp to point to the '/' character with url, and then copy from that point in url to the end (up to 100 characters) to filename (after which filename would point to "/index.html"). If you then did
```
*temp = '\0';
```
to insert a null terminator, url would become "www.carleton.edu".
char *strstr(char *cs, char *ct) returns a pointer to the first occurrence of string ct in cs, or NULL if ct is not present.
char *strtok(char *s, char *ct) searches s for tokens delimited by characters from ct. A sequence of calls of strtok(s, ct) splits s into tokens, each delimited by a character from ct. The first call in a sequence has a non-NULL s. It finds the first token in s consisting of characters not in ct; it terminates that by overwriting the next character of s with '\0' and returns a pointer to the token. Each subsequent call, indicated by a NULL value of s, returns the next such token, search from just path the end of the previous one. strtok returns NULL when no further token is found. The string ct may be different each call.
Not parsing per se, but sprintf can be a nice way to assemble a string from multiple pieces of data in C. It works exactly like printf, except the result is copied to another string rather than printed. See the documentation here.

Part II: Caching web objects

For the next part of the lab, you will add a cache to your proxy that stores recently-used web objects in memory. Here a web object just means a file sent by a web server. HTTP actually defines a fairly complex model by which web servers can give instructions as to how the objects they serve should be cached and clients can specify how caches should be used on their behalf. However, your proxy will adopt a simplified approach.

When your proxy receives a web object from a server, it should cache it in memory as it transmits the object to the client. If another client requests the same object from the same server, your proxy need not reconnect to the server; it can simply resend the cached object.

The starter code in proxy.c provides struct definitions and a set of cache functions you can use to get started. Under that design, when the proxy receives a request, it should look up the URL in the cache, If an entry is found, the proxy should send the associated item to the client. If an entry isn’t found, when the proxy receives a response from the server, it would buffer that response in memory, and then insert an entry into the cache with the request’s URL as the url and the contents of the buffer as the item.

Part III: Dealing with multiple concurrent requests

Once you have a working sequential caching proxy, you should alter it to simultaneously handle multiple requests. The simplest way to implement a concurrent server is to spawn a new thread to handle each new connection request, using the technique discussed in class. Other designs are also possible, such as the prethreaded server described in Section 12.5.5 of your textbook.

Note that threads should run in detached mode to avoid memory leaks.

Part IV: Breaking things

The provided autograder is very simple. It checks that the proxy can successfully serve several different files from the CSPP Tiny web server (Basic), that the proxy will serve a previously requested file even when the server it came from is no longer available (Cache), and that the proxy will service a second client even when the first hasn’t finished (Concurrency). These verify only fairly basic aspects of the functionality of each part. For example, the autograder does not test whether the proxy gracefully handles an invalid or malformed HTTP request. It’s important that a good, robust proxy not crash in this case, but instead send an appropriate response to the client and stop processing the request.

For this part of the lab, you will devise additional test cases that go beyond the functionality checked by the autograder. Your goal will be to demonstrate test cases that expose flaws in a proxy server that would pass the provided autograder. A test case should consist of

A piece of code to run, or a description of steps to follow, that test some functionality the autograder does not. This can be a shell script like the autograder (driver.sh), a C program, a Python program, a program in another language, or step-by-step instructions.
A description of the problem your test case reveals. This can be in a separate file, or in comments.
(For extra credit) A proxy server implementation that passes your test.

This is by design a very open-ended task. Coming up with your own tests and perhaps even your own testing harness is a valuable skill in the real world, where exact operating conditions are rarely known and reference solutions are often unavailable. Test cases that are automated (like the autograder) will earn more credit. Step-by-step instructions should be sufficiently detailed that I can perform the test on a proxy and easily determine whether it passed or failed the test. Test cases should not rely on extra information being printed by the proxy.

Below are some ideas to get you started:

Valid HTTP requests not covered by the autograder (e.g., a request without a port in the URL)
Invalid or malformed HTTP requests (i.e., a request that does not conform to the expected format due to missing or incorrect parts).
As discussed in the Aside on page 964 of the CSPP text, a proxy should ignore SIGPIPE signals and should deal gracefully with write operations that return EPIPE errors.
Sometimes, calling read to receive bytes from a socket that has been prematurely closed will cause read to return -1 with errno set to ECONNRESET. A proxy should not terminate due to this error.
A proxy should have a thread-safe cache, ensuring that cache access is free of race conditions. Be aware you may have to send many simultaneous requests to trigger a concurrency bug in your proxy’s cache. One solution would be to protecting accesses to the cache with one large exclusive lock. A more efficient solution would allow multiple threads to simultaneously read from the cache, while still allowing only one thread to write to the cache at a time. You may want to explore options such as partitioning the cache, using Pthreads readers-writers locks, or using semaphores to implement your own readers-writers solution.
Obviously, if your proxy were to cache every object that is ever requested, it would require an unlimited amount of memory. Moreover, because some web objects are larger than others, it might be the case that one giant object will consume the entire cache, preventing other objects from being cached at all. To avoid those problems, your proxy should have both a maximum cache size and a maximum cache object size. When designing a test case for this issue, you can stipulate the proxy should enforce specific values for these limits, and design a test to verify that it does. This could include an eviction policy that approximates a least-recently-used (LRU) eviction policy. It doesn’t have to be strictly LRU, but it should be something reasonably close. Note that both reading an object and writing would count as using the object.

Testing

Autograder

Your handout materials include an autograder, called driver.sh, that will assign scores for Basic Correctness, Concurrency, and Cache. To run the autograder, run

    $ ./driver.sh

from the lab5-handout directory. You must run the driver on a Linux machine.

Tiny web server

Your handout directory contains the source code for the CSPP Tiny web server in the tiny subdirectory. While not as powerful as thttpd (a small, open-source web server written in C), the CSPP Tiny web server will be easy for you to modify as you see fit. It’s also a reasonable starting point for your proxy code. And it’s the server that the autograder uses to fetch pages.

A general pattern for testing might be to start the Tiny server in one terminal, start your proxy in another, and use curl (see below) to send a request to the Tiny server via your proxy. In VS Code, the little + in the upper right of the terminal section will open another terminal.

`telnet`

As described in your textbook (11.5.3), you can use telnet to open a connection to your proxy and send it HTTP requests.

`curl`

You can use curl to generate HTTP requests to any server, including your own proxy. It is an extremely useful debugging tool. For example, if your proxy and Tiny are both running on the local machine, Tiny is listening on port 15213, and proxy is listening on port 15214, then you can request a page from Tiny via your proxy using the following curl command:

linux> curl -v --proxy http://localhost:15214 http://localhost:15213/home.html
* About to connect() to proxy localhost port 15214 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 15214 (#0)
> GET http://localhost:15213/home.html HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu)...
> Host: localhost:15213
> Accept: */*
> Proxy-Connection: Keep-Alive
> 
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Server: Tiny Web Server
< Content-length: 120
< Content-type: text/html
< 
<html>
<head><title>test</title></head>
<body> 
<img align="middle" src="godzilla.gif">
Dave O'Hallaron
</body>
</html>
* Closing connection #0

`netcat`

netcat, also known as nc, is a versatile network utility. You can use netcat just like telnet, to open connections to servers. In addition to being able to connect to Web servers, netcat can also operate as a server itself. With the following command, you can run netcat as a server listening on port 12345:

    $ nc -l 12345

Once you have set up a netcat server, you can generate a request to a phony object on it through your proxy, and you will be able to inspect the exact request that your proxy sent to netcat.

Grading

This assignment will be graded out of a total of 100 points:

Basic Correctness: 40 points for basic proxy operation (autograded).
Concurrency: 15 points for handling concurrent requests (autograded).
Cache: 15 points for a working cache (autograded).
Test Cases: up to 8 points for each test case (see below). This does not include potential extra credit for an implementation that passes the test. You can submit up to five test cases and score over 100 points, though each test case will need to target a different problem/feature.
- A test case that is performed via manual step-by-step instructions: 2 points
- A test case that is executed automatically (e.g., via a script), but whose correctness must be checked manually (e.g., looking for particular output): 5 points
- A fully automatic test case (e.g., a script both performs the test and determines whether it passed or failed): 8 points
Style: 10 points for style. Style guidelines can be found here. These style points will include whether your proxy implementation correctly frees all allocated memory.
Check-in Post: 5 points.

This lab is adapted from the Proxy Lab developed for Computer Systems: A Programmer’s Perspective by Randal E. Bryant and David R. O’Hallaron, Carnegie Mellon University, available here.↩︎

Lab 5: Concurrent Caching Web Proxy1