CS 332 w22 — Files and Directories
Table of Contents
1 In the Beginning
When a computer system boots, it sets the machine's program counter to start executing at a pre-determined position in memory. Since the computer is not yet running, the initial machine instructions must be fetched and executed immediately after the power is turned on before the system has had a chance to initialize its DRAM. Instead, systems typically use a special read-only hardware memory (Boot ROM) to store their boot instructions. On most x86 personal computers, the boot program is called the BIOS for Basic Input/Output System.
Why only put the BIOS in this special memory and not parts of the actual operating system? The most significant reason is that the operating system would be hard to update. ROM instructions are fixed when the computer is manufactured and are typically never changed. If an error occurs while the BIOS is being updated, the machine can be left in a permanently unusable state—unable to boot and unable to complete the update of the BIOS.
Instead, the BIOS provides a level of indirection, as shown here:
The BIOS reads a fixed-size block of bytes from a fixed position on disk into memory. This block of bytes is called the bootloader. Once the BIOS has copied the bootloader into memory, it jumps to the first instruction in the block. On some newer machines, the BIOS checks that the bootloader has not been corrupted by a computer virus. (If a virus could change the bootloader and get the BIOS to jump to it, the virus would then be in control of the machine.) As a check, the bootloader is stored with a cryptographic signature to verify that an authorized entity produced the file.
The bootloader in turn loads the operating system kernel into memory and jumps to it. The kernel is a part of the operating system that is always present in memory while the OS is running. It is completely trusted code, and has full access to all parts of the machine. When a user program needs to access some privileged functionality that only the kernel has access to, it does so using a special kind of procedure called a system call. The kernel's executable image is usually stored in the file system. Thus, to find the bootloader, the BIOS need to read a block of raw bytes from disk; the bootloader, in turn, needs to know how to read from the file system to find and read the operating system image.
This bring us to the focus for today and Friday: the file system and how the operating system bridges the gap between the disk device interface and the flexible reading, writing, etc. that users need.
2 Reading: Background on Storage Technology
Read the first five pages of the introductory reading on file systems. The essential piece is the section on Nonvolatile storage and file systems on the second page. It's important context to understand the limitations of the interface to storage devices, as this impacts the design of file systems. Key questions to focus on as you read:
- What makes disk storage "persistent"?
- What additional constraints apply to disk access as compared to memory access?
3 Reading: The File System Abstraction and Interface
Read the remainder of introductory reading on file systems. You can skim pages 6–10 if you're already familiar with the structure of the Unix-type file systems. Basic file system terminology from this reading:
- file
- directory
- home directory
- root directory
- absolute path
- relative path
Section 11.3.2 (page 18) defines the key concept of a block device.
4 The File System API in Action
To provide some context for the file system API described in the reading, I go through some examples of how the Unix file system calls are used below.
4.1 Creating a File
We’ll start with the most basic of operations: creating a file. This can be
accomplished with the open
system call; by calling open()
and passing
it the O_CREAT
flag, a program can create a new file. Here is some example
code to create a file called "foo" in the current working directory:
int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC, S_IRUSR|S_IWUSR);
As an aside, you may wonder why it's O_CREAT
instead of O_CREATE
. When Ken Thompson was asked what he would do differently if he were redesigning UNIX,
he replied: "I'd spell creat with an e."
The routine open()
takes a number of different flags. In this example, the second parameter creates the file (O_CREAT
) if it does not exist,
ensures that the file can only be written to (O_WRONLY
), and, if the file
already exists, truncates it to a size of zero bytes thus removing any existing content (O_TRUNC
).
The third parameter specifies permissions, in this
case making the file readable and writable by the owner.
One important aspect of open()
is what it returns: a file descriptor. A
file descriptor is just an integer, private per running program, and is used in UNIX
systems to access files; thus, once a file is opened, you use the file descriptor to read or write the file, assuming you have permission to do so.
Another way to think of a file descriptor is as a pointer to an object of type file; once you
have such an object, you can call other "methods" to access the file, like
read()
and write()
(we’ll see how to do so below). Running programs can even share a file by using the same description as shown here (we'll go over what an inode is on Friday):
4.2 Reading a File
Once we have some files, of course we might like to read or write them.
Let’s start by reading an existing file. If we were typing at a command
line, we might just use the program cat
to dump the contents of the file
to the screen.
$ echo hello > foo $ cat foo hello $
In this code snippet, we redirect the output of the program echo
to
the file foo
, which then contains the word "hello" in it. We then use cat
to see the contents of the file. But how does the cat
program access the
file foo
?
To find this out, we’ll use an incredibly useful tool to trace the system calls made by a program.
On Linux, the tool is called strace
; other
systems have similar tools (see dtruss
on a Mac, or truss
on some older
UNIX variants). What strace
does is trace every system call made by a
program while it runs, and dump the trace to the screen for you to see.
Here is an example of using strace
to figure out what cat
is doing
(some calls removed for readability):
$ strace cat foo ... open("foo", O_RDONLY) = 3 read(3, "hello\n", 4096) = 6 write(1, "hello\n", 6) = 6 hello read(3, "", 4096) = 0 close(3) = 0 ... $
The first thing that cat
does is open the file for reading. A couple
of things we should note about this; first, that the file is only opened for
reading (not writing), as indicated by the O_RDONLY
flag; second, that the call to open()
succeeds and returns a file descriptor, which has the value of 3.
Why does the first call to open()
return 3, not 0 or perhaps 1 as you
might expect? As it turns out, each running process already has three
files open, standard input (which the process can read to receive input),
standard output (which the process can write to in order to dump information to the screen), and standard error (which the process can write
error messages to). These are represented by file descriptors 0, 1, and 2,
respectively. Thus, when you first open another file (as cat does above),
it will almost certainly be file descriptor 3.
After the open succeeds, cat uses the read()
system call to repeatedly read some bytes from a file.
The first argument to read()
is the file
descriptor, thus telling the file system which file to read; a process can of
course have multiple files open at once, and thus the descriptor enables
the operating system to know which file a particular read refers to. The
second argument points to a buffer where the result of the read()
will be
placed; in the system-call trace above, strace shows the results of the read
in this spot ("hello"). The third argument is the size of the buffer, which
in this case is 4 KB. The call to read()
returns successfully as well, here
returning the number of bytes it read (6, which includes 5 for the letters
in the word "hello" and one for an end-of-line marker).
At this point, you see another interesting result of the strace: a single
call to the write()
system call, to the file descriptor 1. As we mentioned
above, this descriptor is known as the standard output, and thus is used
to write the word "hello" to the screen as the program cat
is meant to
do. But does it call write()
directly? Maybe (if it is highly optimized).
But if not, what cat might do is call the library routine printf()
;
internally, printf()
figures out all the formatting details passed to it, and
eventually writes to standard output to print the results to the screen.
The cat
program then tries to read more from the file, but since there
are no bytes left in the file, the read()
returns 0 and the program knows
that this means it has read the entire file. Thus, the program calls close()
to indicate that it is done with the file "foo", passing in the corresponding
file descriptor. The file is thus closed, and the reading of it thus complete.
4.3 Writing Immediately with fsync()
Most times when a program calls write()
, it is just telling the file
system: please write this data to persistent storage, at some point in the
future. The file system, for performance reasons, will buffer such writes
in memory for some time (say 5 seconds, or 30); at that later point in
time, the writes will actually be issued to the storage device. From the
perspective of the calling application, writes seem to complete quickly,
and only in rare cases (e.g., the machine crashes after the write()
call
but before the write to disk) will data be lost.
However, some applications require something more than this eventual guarantee. For example, in a database management system (DBMS), development of a correct recovery protocol requires the ability to force writes to disk from time to time.
To support these types of applications, most file systems provide some
additional control APIs. In the UNIX world, the interface provided to applications is known as fsync(int fd)
. When a process calls fsync()
for a particular file descriptor, the file system responds by forcing all dirty
(i.e., not yet written) data to disk, for the file referred to by the specified
file descriptor. The fsync()
routine returns once all of these writes are
complete.
Here is a simple example of how to use fsync()
. The code opens
the file foo
, writes a single chunk of data to it, and then calls fsync()
to ensure the writes are forced immediately to disk. Once the fsync()
returns, the application can safely move on, knowing that the data has
been persisted (if fsync()
is correctly implemented, that is).
int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC, S_IRUSR|S_IWUSR); assert(fd > -1); int rc = write(fd, buffer, size); assert(rc == size); rc = fsync(fd); assert(rc == 0);
Interestingly, this sequence does not guarantee everything that you
might expect; in some cases, you also need to fsync()
the directory that
contains the file foo
. Adding this step ensures not only that the file itself
is on disk, but that the file, if newly created, also is durably a part of the
directory. Not surprisingly, this type of detail is often overlooked, leading
to many application-level bugs.
4.4 Renaming Files
Once we have a file, it is sometimes useful to be able to give a file a
different name. When typing at the command line, this is accomplished
with mv
command; in this example, the file foo
is renamed bar
:
$ mv foo bar
Using strace
, we can see that mv
uses the system call
rename(char* old, char * new)
, which takes precisely two arguments: the original
name of the file (old
) and the new name (new
).
One interesting guarantee provided by the rename()
call is that it is
(usually) implemented as an atomic call with respect to system crashes;
if the system crashes during the renaming, the file will either be named
the old name or the new name, and no odd in-between state can arise.
Thus, rename()
is critical for supporting certain kinds of applications
that require an atomic update to file state.
Let’s be a little more specific here. Imagine that you are using a file editor (e.g., emacs),
and you insert a line into the middle of a file. The file's
name, for the example, is foo.txt
. The way the editor might update the
file to guarantee that the new file has the original contents plus the line
inserted is as follows (ignoring error-checking for simplicity):
int fd = open("foo.txt.tmp", O_WRONLY|O_CREAT|O_TRUNC, S_IRUSR|S_IWUSR); write(fd, buffer, size); // write out new version of file fsync(fd); close(fd); rename("foo.txt.tmp", "foo.txt");
What the editor does in this example is simple: write out the new
version of the file under a temporary name (foo.txt.tmp
), force it to
disk with fsync()
, and then, when the application is certain the new
file metadata and contents are on the disk, rename the temporary file to
the original file's name. This last step atomically swaps the new file into
place, while concurrently deleting the old version of the file, and thus an
atomic file update is achieved.
4.5 Further Examples
I selected the examples above as the best subset of those presented in Chapter 39 of the OSTEP book. There are many more there if you are interested.