CS 208 s20 — Caching: Basic Principles

1. Video
2. Locality
- 2.1. Code With Good Locality Accesses Adjacent Elements
  - 2.1.1. Exercise
3. Caching Basics
- 3.1. Revisiting the memory hierarchy
- 3.2. Terminology
4. Anatomy of a Real Cache Hierarchy
- 4.1. cache management
5. Cache Organization
- 5.1. Fundamentals
6. Homework

1 Video

Here is a video lecture for the material outlined below. It covers CSPP sections 6.4.1 (p. 615–616). It contains sections on

flickers (0:04)
introduction (0:35)
locality: different types (1:47)
locality: stride examples (9:05)
3D array sum exercise (15:45)
revisit the memory hierarchy (17:09)
caching terminology (20:59)
types of cache misses (25:49)
real cache hierarchy (Core i7) (30:37)
cache organization: block offset (34:07)
cache organization: sets (43:19)
cache organization: spreadsheet example (49:01)
cache organization: tags (52:04)
cache miss example (55:19)
end notes (59:21)

The Panopto viewer has table of contents entries for these sections. Link to the Panopto viewer: https://carleton.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=23c6efcd-146a-4f65-8ec3-abc101039f8f

2 Locality

sum = 0;
for (i = 0; i < n; i++) {
          sum += a[i];
}
return sum;

sources of data locality:
- temporal: sum referenced in each iteration
- spatial: consecutive elements of array a[] accessed
sources of instruction locality:
- temporal: cycle through loop repeatedly
- spatial: reference instructions in sequence

2.1 Code With Good Locality Accesses Adjacent Elements

int sum_array_rows(int a[M][N]) {
    int i, j, sum = 0;
    for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            sum += a[i][j];
    return sum;
}

This is a stride 1 access pattern because we shift by 1 element each access. In general, the lower the stride, the better the locality.

int sum_array_rows(int a[M][N]) {
    int i, j, sum = 0;
    for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
}

This is a stride N access pattern because we shift by N elements each access.

Recall the example from topic 1 where switching the order of nested loops caused summing an array to take almost 40 times longer. That was exactly this difference—going from a stride 1 access pattern to a stride N access pattern.

2.1.1 Exercise

What is wrong with this code (and how would you fix it)?¹

int sum_array_3D(int a[M][N][L]) {
    int i, j, k, sum = 0;
    for (i = 0; i < N; i++)
        for (j = 0; j < L; j++)
            for (k = 0; k < M; k++)
                sum += a[k][i][j];
    return sum;
}

3 Caching Basics

3.1 Revisiting the memory hierarchy

each level acts as a cache for the level below
registers explicitly program controlled
hardware manages caching transparently

3.2 Terminology

3.2.1 cache hits

finding needed data in level \(k\) instead of \(k + 1\)

3.2.2 cache misses

having to go to level \(k+1\) because the data is not in level \(k\)
- cache at level \(k\) now contains the data
- if level \(k\) cache is full, it will overwrite an existing block (replacing or evicting)
which block gets ovewritten governed by the cache's replacement policy
- a random replacement policy would choose randomly
- a least recently used (LRU) policy would choose the block with the oldest last access

3.2.3 kinds of cache misses

if a cache is empty, every access will result in a cache miss (compulsory or cold misses on a cold cache)
- won't occur once a cache has been warmed up through repeated accesses
a cache at level \(k\) must have a placement policy to decide where to place a block retrieved from a \(k + 1\) cache
- faster caches tend to implement stricter policy since it's important to be able to locate a block quickly
  - e.g., block \(i\) at level \(k+1\) must be placed in block \(i \mod 4\) at level \(k\)
  - can lead to a conflict miss (enough room for a block, but placement policy causing a miss, as in cache thrashing between two repeatedly accessed blocks)
the set of blocks involved in the current phase of the program (e.g., a loop) is referred to as the working set
- when the working set is too large for a cache, that cache will experience capacity misses

4 Anatomy of a Real Cache Hierarchy

caches that hold only instructions are i-caches
caches that hold only data are d-caches
caches that hold both are unified caches
modern processors include separate i- and d-caches
- can read an instruction and data at the same time
- caches are optimized to different access patterns
- trades off fewer conflict misses (data and instructions won't conflict) for potentially more capacity misses (smaller caches)

4.1 cache management

each level must have some managing logic provided by software or hardware
compiler manages registers
L1/2/3 managed by hardware logic built into the caches
usually no action is required on the part of the program to manage caches

5 Cache Organization

5.1 Fundamentals

the block size \(B\) is the unit of transfer between memory and cache (in bytes, always a power of 2)
we'll use the low-order \(\log_2 B = b\) bits of a memory address to specify a byte within a block
- the block offset
for an \(m\)-bit address, the remaining \(m - b\) bits will be used to identify the block (block number)

5.1.1 Quick Check

If we have 6-bit addresses and block size \(B = 4\) bytes, which block and byte does 0x15 refer to?²

5.1.2 Where Should Data Go in the Cache?

need the procedure for checking if a memory address is in the cache to be fast
solution: mimic a hash table, a data structure that provides fast lookup
spreadsheet example
- use middle two bits from address (lower two bits of block number) as set index
- use \(s\) bits to index \(2^s=S\) sets

5.1.3 Example

A request for address 0x2a results in a cache miss. Which index does this block get loaded into and which 3 other addresses are loaded along with it?

address: 0b 10 10 10, so index 10
block offset is 10, so full block consists of 0x28, 0x29, and 0x2b

5.1.4 What About Collisions?

access an address not present in the cache, but with the same cache index
we need to be able to tell this blocks apart
solution: use leftover address bits as tag bits
- \(t = m - s - b\)

5.1.5 Cache Access

CPU sends address request for chunk of data
- index field tells you where to look in the cache
- tag field lets you check if the data is block you want
- offset field selects start byte within block

6 Homework

CSPP practice problems 6.9 (p. 616)
The week 6 quiz is due 9pm Wednesday, May 20 (tonight)
Lab 4 will be due 9pm Wednesday, May 27

Footnotes:

The loop ordering results in a very inefficient stride (\(N \times L\)). They can be reordered to achieve a stride 1 access pattern

int sum_array_3D(int a[M][N][L]) {
    int i, j, k, sum = 0;
    for (k = 0; k < M; k++)
        for (i = 0; i < N; i++)
            for (j = 0; j < L; j++)
                sum += a[k][i][j];
    return sum;
}

0x15 is 0b 01 0101 (6-bit address). With a block size of 4, we need 2 bits to specify a byte within a block, leaving the 4 higher order bits as the block number. Hence, 0x15 refers to byte 0b01 (1) and block 0b0101 (5).