CS 208 w20 lecture 26 outline

1 Definitions

  • concurrent computing: when the execution of multiple computations (or processes) overlap
    • we need to correctly control access to shared resources
  • parallel computing: when the execution of multiple computations (or processes) occur simultaneously
    • using additional resources to produce an answer faster
  • concurrency primarily about effective coordination, parallelism primarily about efficient division of labor

2 Why Parallelism?

cpu-history.png

  • running into physical limits:
    • power densities over time roughly constant from 1958 to late 1990s
      • grow rapidly starting in late 1990s
    • Since 2000, transistors too small to shrink further (just a few molecules thick)
  • heat death
    • power scales with frequency
    • processors stopped around 4 GHz

cpu-power.png

  • solution: slap multiple processors on a single chip
    • continues the trend of more and more transitors, but…
    • requires new programming approach

3 Limits of Parallelism

  • is it faster to pick up a deck of cards with two people instead of 1?
    • how about 100 instead of 50?
  • there are a finite number of independent tasks, putting a limit on how much can be parallelized
  • some tasks need to wait on others, making them inherently sequential
  • the amount of speedup that can be achieved through parallelism is limited by the non-parallel portion

4 Threads

  • previously, we defined a process as an instance of a running program
  • a thread is a single execution sequence that represents the minimal unit of scheduling
    • one process may contain multiple threads
  • four reasons to use threads:
    • program structure: expressing logically concurrent tasks
      • each thing a process is doing gets its own thread (e.g., drawing the UI, handling user input, fetching network data)
    • responsiveness: shifting work to run in the background
      • we want an application to continue responding to input in the middle of an operation
      • when saving a file, write contents to a buffer and return control, leaving a background thread to actually write it to disk
    • performance: exploiting multiple processors
    • performance: managing I/O devices

4.1 Hardware Support for Multithreading

  • two copies of PC and Registers inside processor hardware:

cpu-multithread.png

  • looks like two processors to software
  • control logic decides which thread to execute an instruction from next

4.2 Threading Models

  • POSIX Threads (pthread)
    • #include <pthread.h>, low-level interface giving fine-grained control
  • Fork-Join model (OpenMP)
    • #include <omp.h>, higher-level interface for fork-join approach
    • pthreads can do fork-join manually

fork-join.png

5 Instruction-Level Parallelism

  • steps to execute the next x86-64 instruction:
    1. fetch the instruction 400540: 48 01 fe
    2. decode the instruction addq %rdi, %rsi
    3. gather data values read R[%rdi], R[%rsi]
    4. perform operation calc R[%rdi] + R[%rsi]
    5. store result save into %rsi
  • each of these steps is roughly associated with a part of the processor
    • each part can work independently

5.1 By Analogy to Doing Laundry

  • washer, dryer, folding, and putting away all take 30 minutes
    • full task takes 2 hours
  • four people doing laundry in sequence takes 8 hours
    • but the washer and dryer sit idle for 1.5 hours at a time
  • we can start the second load as soon as the first load is done with the washer
    • pipelined laundry takes 3.5 hours for 4 loads
  • does not improve the latency of a single load (still takes 2 hours), but improves throughput
  • processors pipeline instruction execution steps in a similar way

6 Why Concurrency?

  • a multi-threaded program's execution depends on how instructions are interleaved
  • threads sharing data may be affected by race conditions (or data races)
Thread A Thread B
x = 1 x = 2
  • What are the final possible values of x? Result depends on which thread "wins."
  • Suppose y=12 initially:
Thread A Thread B
x = y + 1 y = y * 2
  • What are the final possible values of x? Either 13 or 25 depending on which thread goes first
  • Suppose x = 0 initially:
Thread A Thread B
x = x + 1 x = x + 2
  • What are the final possible values of x? Remember that register values are preserved for each thread:
Interleaving 1   Interleaving 2   Interleaving 3  
load x into r1   load x into r1   load x into r1  
add 1 to r1     load x into r1   load x into r1
store r1 in x   add 1 to r1   add 1 to r1  
  load x into r1   add 2 to r1   add 2 to r1
  add 2 to r1 store r1 in x     store r1 in x
  store r1 in x   store r1 in x store r1 in x  
  • interleaving 1 results in x == 3
  • interleaving 2 results in x == 2
  • interleaving 3 results in x == 1

7 Synchronization

  • prevent bad interleavings of read/write operations by forcing threads to coordinate their accesses
    • identify critical section of code where only one thread should operate at a time
  • pseudocode:
check lock // loop/idle here if locked
acquire lock
CRITICAL SECTION (e.g., change shared variables)
release lock
  • problem:
Thread A Thread B
read lock (check)  
  read lock (check)
conditional jump  
  conditional jump
  write lock (acquire)
write lock (acquire)  
  CRITICAL SECTION…
CRITICAL SECTION…  
  • we need hardware support to make checking and acquiring the lock an atomic operation (impossible to interrupt)
  • we want to identify the minimal critical section to maintain parallelism