CS 332 w22 — File System Reliability

Table of Contents

1 Reading: Crash Consistency: FSCK and Journaling

Read OSTEP Chapter 42 for a nice walkthrough of two approaches to recovering from a mid-file-system-operation crash.

2 The Problem

  • you're writing the file system
  • then the power fails
  • you reboot
  • is your file system still useable?
  • crash during multi-step operation
  • may leave FS in an inconsistent state
  • after reboot:
    • bad: crash again due to corrupt FS
    • worse: no crash, but reads/writes incorrect data

2.1 Example

This tiny example includes an inode bitmap (with just 8 bits, one per inode), a data bitmap (also 8 bits, one per data block), inodes (8 total, numbered 0 to 7, and spread across four blocks), and data blocks (8 total, numbered 0 to 7).


Here's what's inside our inode:

owner : awb
permissions : read-write
size : 1
pointer : 4
pointer : null
pointer : null
pointer : null

What needs to be updated to append a block's worth of data to the existing file?1


2.1.1 Crash Scenarios

  • Just data block is written
  • Just inode is written
  • Just bitmap is written
  • inode and data block are written, but not bitmap
  • inode and bitmap are written, but not data
  • data block and bitmap are written, but not inode

3 Solution #1: The File System Checker

  • Unix tool fsck
  • On reboot, check every part of the file system for consistency
    • Before file system is mounted (ensures no other file system operations are taking place)
    • Perform repairs as needed
  • Problem
    • Very slow to check the entire disk!

4 Solution #2: Write-Ahead Logging (a.k.a Journaling)

4.1 Overview

  • what can we hope for?
    • after rebooting and running recovery code
    • FS internal consistency maintained e.g., no block is both in free list and in a file
    • all but last few operations preserved on disk e.g., data I wrote yesterday are preserved but perhaps not data I was writing at time of crash so user might have to check last few operations
    • no order anomalies echo 99 > result ; echo done > status
  • correctness and performance often conflict
    • disk writes are slow!
    • safety => write to disk ASAP
    • speed => don't write the disk (batch, write-back cache, sort by track, &c)
  • crash recovery is a recurring problem
    • arises in all storage systems, e.g. databases
    • a lot of work has gone into solutions over the years
    • many clever performance/correctness tradeoffs

4.2 Implementation

  • most popular solution: logging (== journaling)
    • goal: atomic system calls w.r.t. crashes
    • goal: fast recovery (no hour-long fsck)
  • the basic idea behind logging
    • you want atomicity: all of a system call's writes, or none
      • let's call an atomic operation a "transaction"
    • record all writes the sys call will do in the log on disk (log)
    • then record "done" on disk (commit)
    • then do the FS disk writes (install)
    • on crash+recovery:
      • if "done" in log, replay all writes in log
      • if no "done", ignore log
    • this is a WRITE-AHEAD LOG
    • After install, clean the log so the blocks for the installed entries can be reused
  • write-ahead log rule
    • install none of a transaction's writes to disk until all writes are in the log on disk, and the logged writes are marked committed.
  • why the rule?
    • once we've installed one write to the on-disk FS, we have to do all of the transaction's other writes — so the transaction is atomic
    • we have to be prepared for a crash after the first installation write, so the other writes must be still available after the crash — in the log.

4.3 Challenges

  • challenge: prevent write-back from cache
    • a system call can safely update a cached block, but the block cannot be written to the FS until the transaction commits
    • tricky because e.g. cache may run out of space, and be tempted to evict some entries in order to read and cache other data.
    • consider create example:
      • write dirty inode to log
      • write dir block to log
      • evict dirty inode
      • commit
    • solutions:
      • ensure buffer cache is big enough
      • pin dirty blocks in buffer cache
      • after commit, unpin block
  • challenge: each block gets written to disk twice (once for log, once for install)
    • this is more overhead than we'd like
    • what if we only logged metadata?
      • so only the updates to e.g., the bitmap and the inode go in the log
    • when would we write the data?2



The data block, the inode, and the data block bitmap (three seperate disk writes!)


After could leave us in an inconsistent state if there was a crash, so data must be written before