CS 332 w22 — File System Reliability

Table of Contents

1 Reading: Crash Consistency: FSCK and Journaling

Read OSTEP Chapter 42 for a nice walkthrough of two approaches to recovering from a mid-file-system-operation crash.

2 The Problem

  • you're writing the file system
  • then the power fails
  • you reboot
  • is your file system still useable?
  • crash during multi-step operation
  • may leave FS in an inconsistent state
  • after reboot:
    • bad: crash again due to corrupt FS
    • worse: no crash, but reads/writes incorrect data

2.1 Example

This tiny example includes an inode bitmap (with just 8 bits, one per inode), a data bitmap (also 8 bits, one per data block), inodes (8 total, numbered 0 to 7, and spread across four blocks), and data blocks (8 total, numbered 0 to 7).

fs-update-before.png

Here's what's inside our inode:

owner : awb
permissions : read-write
size : 1
pointer : 4
pointer : null
pointer : null
pointer : null

What needs to be updated to append a block's worth of data to the existing file?1

fs-update-after.png

2.1.1 Crash Scenarios

  • Just data block is written
  • Just inode is written
  • Just bitmap is written
  • inode and data block are written, but not bitmap
  • inode and bitmap are written, but not data
  • data block and bitmap are written, but not inode

3 Solution #1: The File System Checker

  • Unix tool fsck
  • On reboot, check every part of the file system for consistency
    • Before file system is mounted (ensures no other file system operations are taking place)
    • Perform repairs as needed
  • Problem
    • Very slow to check the entire disk!

4 Solution #2: Write-Ahead Logging (a.k.a Journaling)

4.1 Overview

  • what can we hope for?
    • after rebooting and running recovery code
    • FS internal consistency maintained e.g., no block is both in free list and in a file
    • all but last few operations preserved on disk e.g., data I wrote yesterday are preserved but perhaps not data I was writing at time of crash so user might have to check last few operations
    • no order anomalies echo 99 > result ; echo done > status
  • correctness and performance often conflict
    • disk writes are slow!
    • safety => write to disk ASAP
    • speed => don't write the disk (batch, write-back cache, sort by track, &c)
  • crash recovery is a recurring problem
    • arises in all storage systems, e.g. databases
    • a lot of work has gone into solutions over the years
    • many clever performance/correctness tradeoffs

4.2 Implementation

  • most popular solution: logging (== journaling)
    • goal: atomic system calls w.r.t. crashes
    • goal: fast recovery (no hour-long fsck)
  • the basic idea behind logging
    • you want atomicity: all of a system call's writes, or none
      • let's call an atomic operation a "transaction"
    • record all writes the sys call will do in the log on disk (log)
    • then record "done" on disk (commit)
    • then do the FS disk writes (install)
    • on crash+recovery:
      • if "done" in log, replay all writes in log
      • if no "done", ignore log
    • this is a WRITE-AHEAD LOG
    • After install, clean the log so the blocks for the installed entries can be reused
  • write-ahead log rule
    • install none of a transaction's writes to disk until all writes are in the log on disk, and the logged writes are marked committed.
  • why the rule?
    • once we've installed one write to the on-disk FS, we have to do all of the transaction's other writes — so the transaction is atomic
    • we have to be prepared for a crash after the first installation write, so the other writes must be still available after the crash — in the log.

4.3 Challenges

  • challenge: prevent write-back from cache
    • a system call can safely update a cached block, but the block cannot be written to the FS until the transaction commits
    • tricky because e.g. cache may run out of space, and be tempted to evict some entries in order to read and cache other data.
    • consider create example:
      • write dirty inode to log
      • write dir block to log
      • evict dirty inode
      • commit
    • solutions:
      • ensure buffer cache is big enough
      • pin dirty blocks in buffer cache
      • after commit, unpin block
  • challenge: each block gets written to disk twice (once for log, once for install)
    • this is more overhead than we'd like
    • what if we only logged metadata?
      • so only the updates to e.g., the bitmap and the inode go in the log
    • when would we write the data?2

Footnotes:

1

The data block, the inode, and the data block bitmap (three seperate disk writes!)

2

After could leave us in an inconsistent state if there was a crash, so data must be written before