CS 332 w22 — Files and Directories

Table of Contents

1 In the Beginning

When a computer system boots, it sets the machine's program counter to start executing at a pre-determined position in memory. Since the computer is not yet running, the initial machine instructions must be fetched and executed immediately after the power is turned on before the system has had a chance to initialize its DRAM. Instead, systems typically use a special read-only hardware memory (Boot ROM) to store their boot instructions. On most x86 personal computers, the boot program is called the BIOS for Basic Input/Output System.

Why only put the BIOS in this special memory and not parts of the actual operating system? The most significant reason is that the operating system would be hard to update. ROM instructions are fixed when the computer is manufactured and are typically never changed. If an error occurs while the BIOS is being updated, the machine can be left in a permanently unusable state—unable to boot and unable to complete the update of the BIOS.

Instead, the BIOS provides a level of indirection, as shown here:

bios.png

The BIOS reads a fixed-size block of bytes from a fixed position on disk into memory. This block of bytes is called the bootloader. Once the BIOS has copied the bootloader into memory, it jumps to the first instruction in the block. On some newer machines, the BIOS checks that the bootloader has not been corrupted by a computer virus. (If a virus could change the bootloader and get the BIOS to jump to it, the virus would then be in control of the machine.) As a check, the bootloader is stored with a cryptographic signature to verify that an authorized entity produced the file.

The bootloader in turn loads the operating system kernel into memory and jumps to it. The kernel is a part of the operating system that is always present in memory while the OS is running. It is completely trusted code, and has full access to all parts of the machine. When a user program needs to access some privileged functionality that only the kernel has access to, it does so using a special kind of procedure called a system call. The kernel's executable image is usually stored in the file system. Thus, to find the bootloader, the BIOS need to read a block of raw bytes from disk; the bootloader, in turn, needs to know how to read from the file system to find and read the operating system image.

This bring us to the focus for today and Friday: the file system and how the operating system bridges the gap between the disk device interface and the flexible reading, writing, etc. that users need.

2 Reading: Background on Storage Technology

Read the first five pages of the introductory reading on file systems. The essential piece is the section on Nonvolatile storage and file systems on the second page. It's important context to understand the limitations of the interface to storage devices, as this impacts the design of file systems. Key questions to focus on as you read:

  • What makes disk storage "persistent"?
  • What additional constraints apply to disk access as compared to memory access?

3 Reading: The File System Abstraction and Interface

Read the remainder of introductory reading on file systems. You can skim pages 6–10 if you're already familiar with the structure of the Unix-type file systems. Basic file system terminology from this reading:

  • file
  • directory
  • home directory
  • root directory
  • absolute path
  • relative path

Section 11.3.2 (page 18) defines the key concept of a block device.

4 The File System API in Action

To provide some context for the file system API described in the reading, I go through some examples of how the Unix file system calls are used below.

4.1 Creating a File

We’ll start with the most basic of operations: creating a file. This can be accomplished with the open system call; by calling open() and passing it the O_CREAT flag, a program can create a new file. Here is some example code to create a file called "foo" in the current working directory:

int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC, S_IRUSR|S_IWUSR);

As an aside, you may wonder why it's O_CREAT instead of O_CREATE. When Ken Thompson was asked what he would do differently if he were redesigning UNIX, he replied: "I'd spell creat with an e."

The routine open() takes a number of different flags. In this example, the second parameter creates the file (O_CREAT) if it does not exist, ensures that the file can only be written to (O_WRONLY), and, if the file already exists, truncates it to a size of zero bytes thus removing any existing content (O_TRUNC). The third parameter specifies permissions, in this case making the file readable and writable by the owner. One important aspect of open() is what it returns: a file descriptor. A file descriptor is just an integer, private per running program, and is used in UNIX systems to access files; thus, once a file is opened, you use the file descriptor to read or write the file, assuming you have permission to do so. Another way to think of a file descriptor is as a pointer to an object of type file; once you have such an object, you can call other "methods" to access the file, like read() and write() (we’ll see how to do so below). Running programs can even share a file by using the same description as shown here (we'll go over what an inode is on Friday):

file-descriptor.png

4.2 Reading a File

Once we have some files, of course we might like to read or write them. Let’s start by reading an existing file. If we were typing at a command line, we might just use the program cat to dump the contents of the file to the screen.

$ echo hello > foo
$ cat foo
hello
$

In this code snippet, we redirect the output of the program echo to the file foo, which then contains the word "hello" in it. We then use cat to see the contents of the file. But how does the cat program access the file foo?

To find this out, we’ll use an incredibly useful tool to trace the system calls made by a program. On Linux, the tool is called strace; other systems have similar tools (see dtruss on a Mac, or truss on some older UNIX variants). What strace does is trace every system call made by a program while it runs, and dump the trace to the screen for you to see. Here is an example of using strace to figure out what cat is doing (some calls removed for readability):

$ strace cat foo
...
open("foo", O_RDONLY) = 3
read(3, "hello\n", 4096) = 6
write(1, "hello\n", 6) = 6
hello
read(3, "", 4096) = 0
close(3) = 0
...
$

The first thing that cat does is open the file for reading. A couple of things we should note about this; first, that the file is only opened for reading (not writing), as indicated by the O_RDONLY flag; second, that the call to open() succeeds and returns a file descriptor, which has the value of 3.

Why does the first call to open() return 3, not 0 or perhaps 1 as you might expect? As it turns out, each running process already has three files open, standard input (which the process can read to receive input), standard output (which the process can write to in order to dump information to the screen), and standard error (which the process can write error messages to). These are represented by file descriptors 0, 1, and 2, respectively. Thus, when you first open another file (as cat does above), it will almost certainly be file descriptor 3.

After the open succeeds, cat uses the read() system call to repeatedly read some bytes from a file. The first argument to read() is the file descriptor, thus telling the file system which file to read; a process can of course have multiple files open at once, and thus the descriptor enables the operating system to know which file a particular read refers to. The second argument points to a buffer where the result of the read() will be placed; in the system-call trace above, strace shows the results of the read in this spot ("hello"). The third argument is the size of the buffer, which in this case is 4 KB. The call to read() returns successfully as well, here returning the number of bytes it read (6, which includes 5 for the letters in the word "hello" and one for an end-of-line marker).

At this point, you see another interesting result of the strace: a single call to the write() system call, to the file descriptor 1. As we mentioned above, this descriptor is known as the standard output, and thus is used to write the word "hello" to the screen as the program cat is meant to do. But does it call write() directly? Maybe (if it is highly optimized). But if not, what cat might do is call the library routine printf(); internally, printf() figures out all the formatting details passed to it, and eventually writes to standard output to print the results to the screen. The cat program then tries to read more from the file, but since there are no bytes left in the file, the read() returns 0 and the program knows that this means it has read the entire file. Thus, the program calls close() to indicate that it is done with the file "foo", passing in the corresponding file descriptor. The file is thus closed, and the reading of it thus complete.

4.3 Writing Immediately with fsync()

Most times when a program calls write(), it is just telling the file system: please write this data to persistent storage, at some point in the future. The file system, for performance reasons, will buffer such writes in memory for some time (say 5 seconds, or 30); at that later point in time, the writes will actually be issued to the storage device. From the perspective of the calling application, writes seem to complete quickly, and only in rare cases (e.g., the machine crashes after the write() call but before the write to disk) will data be lost.

However, some applications require something more than this eventual guarantee. For example, in a database management system (DBMS), development of a correct recovery protocol requires the ability to force writes to disk from time to time.

To support these types of applications, most file systems provide some additional control APIs. In the UNIX world, the interface provided to applications is known as fsync(int fd). When a process calls fsync() for a particular file descriptor, the file system responds by forcing all dirty (i.e., not yet written) data to disk, for the file referred to by the specified file descriptor. The fsync() routine returns once all of these writes are complete.

Here is a simple example of how to use fsync(). The code opens the file foo, writes a single chunk of data to it, and then calls fsync() to ensure the writes are forced immediately to disk. Once the fsync() returns, the application can safely move on, knowing that the data has been persisted (if fsync() is correctly implemented, that is).

int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC, S_IRUSR|S_IWUSR);
assert(fd > -1);
int rc = write(fd, buffer, size);
assert(rc == size);
rc = fsync(fd);
assert(rc == 0);

Interestingly, this sequence does not guarantee everything that you might expect; in some cases, you also need to fsync() the directory that contains the file foo. Adding this step ensures not only that the file itself is on disk, but that the file, if newly created, also is durably a part of the directory. Not surprisingly, this type of detail is often overlooked, leading to many application-level bugs.

4.4 Renaming Files

Once we have a file, it is sometimes useful to be able to give a file a different name. When typing at the command line, this is accomplished with mv command; in this example, the file foo is renamed bar:

$ mv foo bar

Using strace, we can see that mv uses the system call rename(char* old, char * new), which takes precisely two arguments: the original name of the file (old) and the new name (new). One interesting guarantee provided by the rename() call is that it is (usually) implemented as an atomic call with respect to system crashes; if the system crashes during the renaming, the file will either be named the old name or the new name, and no odd in-between state can arise. Thus, rename() is critical for supporting certain kinds of applications that require an atomic update to file state.

Let’s be a little more specific here. Imagine that you are using a file editor (e.g., emacs), and you insert a line into the middle of a file. The file's name, for the example, is foo.txt. The way the editor might update the file to guarantee that the new file has the original contents plus the line inserted is as follows (ignoring error-checking for simplicity):

int fd = open("foo.txt.tmp", O_WRONLY|O_CREAT|O_TRUNC, S_IRUSR|S_IWUSR);
write(fd, buffer, size); // write out new version of file
fsync(fd);
close(fd);
rename("foo.txt.tmp", "foo.txt");

What the editor does in this example is simple: write out the new version of the file under a temporary name (foo.txt.tmp), force it to disk with fsync(), and then, when the application is certain the new file metadata and contents are on the disk, rename the temporary file to the original file's name. This last step atomically swaps the new file into place, while concurrently deleting the old version of the file, and thus an atomic file update is achieved.

4.5 Further Examples

I selected the examples above as the best subset of those presented in Chapter 39 of the OSTEP book. There are many more there if you are interested.