CS 208 s20 — Caching: Basic Principles

Table of Contents

1 Video

Here is a video lecture for the material outlined below. It covers CSPP sections 6.4.1 (p. 615–616). It contains sections on

  1. flickers (0:04)
  2. introduction (0:35)
  3. locality: different types (1:47)
  4. locality: stride examples (9:05)
  5. 3D array sum exercise (15:45)
  6. revisit the memory hierarchy (17:09)
  7. caching terminology (20:59)
  8. types of cache misses (25:49)
  9. real cache hierarchy (Core i7) (30:37)
  10. cache organization: block offset (34:07)
  11. cache organization: sets (43:19)
  12. cache organization: spreadsheet example (49:01)
  13. cache organization: tags (52:04)
  14. cache miss example (55:19)
  15. end notes (59:21)

The Panopto viewer has table of contents entries for these sections. Link to the Panopto viewer: https://carleton.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=23c6efcd-146a-4f65-8ec3-abc101039f8f

2 Locality

sum = 0;
for (i = 0; i < n; i++) {
          sum += a[i];
}
return sum;
  • sources of data locality:
    • temporal: sum referenced in each iteration
    • spatial: consecutive elements of array a[] accessed
  • sources of instruction locality:
    • temporal: cycle through loop repeatedly
    • spatial: reference instructions in sequence

2.1 Code With Good Locality Accesses Adjacent Elements

int sum_array_rows(int a[M][N]) {
    int i, j, sum = 0;
    for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            sum += a[i][j];
    return sum;
}

stride1.png

This is a stride 1 access pattern because we shift by 1 element each access. In general, the lower the stride, the better the locality.

int sum_array_rows(int a[M][N]) {
    int i, j, sum = 0;
    for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
}

strideN.png

This is a stride N access pattern because we shift by N elements each access.

Recall the example from topic 1 where switching the order of nested loops caused summing an array to take almost 40 times longer. That was exactly this difference—going from a stride 1 access pattern to a stride N access pattern.

2.1.1 Exercise

What is wrong with this code (and how would you fix it)?1

int sum_array_3D(int a[M][N][L]) {
    int i, j, k, sum = 0;
    for (i = 0; i < N; i++)
        for (j = 0; j < L; j++)
            for (k = 0; k < M; k++)
                sum += a[k][i][j];
    return sum;
}

3 Caching Basics

3.1 Revisiting the memory hierarchy

memhier.png

  • each level acts as a cache for the level below
  • registers explicitly program controlled
  • hardware manages caching transparently

3.2 Terminology

3.2.1 cache hits

  • finding needed data in level \(k\) instead of \(k + 1\)

3.2.2 cache misses

  • having to go to level \(k+1\) because the data is not in level \(k\)
    • cache at level \(k\) now contains the data
    • if level \(k\) cache is full, it will overwrite an existing block (replacing or evicting)
  • which block gets ovewritten governed by the cache's replacement policy
    • a random replacement policy would choose randomly
    • a least recently used (LRU) policy would choose the block with the oldest last access

3.2.3 kinds of cache misses

  • if a cache is empty, every access will result in a cache miss (compulsory or cold misses on a cold cache)
    • won't occur once a cache has been warmed up through repeated accesses
  • a cache at level \(k\) must have a placement policy to decide where to place a block retrieved from a \(k + 1\) cache
    • faster caches tend to implement stricter policy since it's important to be able to locate a block quickly
      • e.g., block \(i\) at level \(k+1\) must be placed in block \(i \mod 4\) at level \(k\)
      • can lead to a conflict miss (enough room for a block, but placement policy causing a miss, as in cache thrashing between two repeatedly accessed blocks)
  • the set of blocks involved in the current phase of the program (e.g., a loop) is referred to as the working set
    • when the working set is too large for a cache, that cache will experience capacity misses

4 Anatomy of a Real Cache Hierarchy

corei7caches.png

  • caches that hold only instructions are i-caches
  • caches that hold only data are d-caches
  • caches that hold both are unified caches
  • modern processors include separate i- and d-caches
    • can read an instruction and data at the same time
    • caches are optimized to different access patterns
    • trades off fewer conflict misses (data and instructions won't conflict) for potentially more capacity misses (smaller caches)

4.1 cache management

  • each level must have some managing logic provided by software or hardware
  • compiler manages registers
  • L1/2/3 managed by hardware logic built into the caches
  • usually no action is required on the part of the program to manage caches

5 Cache Organization

5.1 Fundamentals

  • the block size \(B\) is the unit of transfer between memory and cache (in bytes, always a power of 2)
  • we'll use the low-order \(\log_2 B = b\) bits of a memory address to specify a byte within a block
    • the block offset
  • for an \(m\)-bit address, the remaining \(m - b\) bits will be used to identify the block (block number)

physaddr-simple.png

5.1.1 Quick Check

If we have 6-bit addresses and block size \(B = 4\) bytes, which block and byte does 0x15 refer to?2

5.1.2 Where Should Data Go in the Cache?

  • need the procedure for checking if a memory address is in the cache to be fast
  • solution: mimic a hash table, a data structure that provides fast lookup
  • spreadsheet example
    • use middle two bits from address (lower two bits of block number) as set index
    • use \(s\) bits to index \(2^s=S\) sets

5.1.3 Example

A request for address 0x2a results in a cache miss. Which index does this block get loaded into and which 3 other addresses are loaded along with it?

  • address: 0b 10 10 10, so index 10
  • block offset is 10, so full block consists of 0x28, 0x29, and 0x2b

5.1.4 What About Collisions?

  • access an address not present in the cache, but with the same cache index
  • we need to be able to tell this blocks apart
  • solution: use leftover address bits as tag bits
    • \(t = m - s - b\)

physaddr.png

5.1.5 Cache Access

  • CPU sends address request for chunk of data
    • index field tells you where to look in the cache
    • tag field lets you check if the data is block you want
    • offset field selects start byte within block

6 Homework

  1. CSPP practice problems 6.9 (p. 616)
  2. The week 6 quiz is due 9pm Wednesday, May 20 (tonight)
  3. Lab 4 will be due 9pm Wednesday, May 27

Footnotes:

1

The loop ordering results in a very inefficient stride (\(N \times L\)). They can be reordered to achieve a stride 1 access pattern

int sum_array_3D(int a[M][N][L]) {
    int i, j, k, sum = 0;
    for (k = 0; k < M; k++)
        for (i = 0; i < N; i++)
            for (j = 0; j < L; j++)
                sum += a[k][i][j];
    return sum;
}
2

0x15 is 0b 01 0101 (6-bit address). With a block size of 4, we need 2 bits to specify a byte within a block, leaving the 4 higher order bits as the block number. Hence, 0x15 refers to byte 0b01 (1) and block 0b0101 (5).