CS 208 s21 — Parallelism and Concurrency
1 Video
Here is a video lecture for the material outlined below. It contains sections on
- Definitions (0:00)
- Why Parallelism? (1:55)
- Limits of Parallelism (7:17)
- Threads (9:52)
- Instruction-Level Parallelism (17:34)
- Why Concurrency? (21:36)
- Synchronization (26:54)
The Panopto viewer has table of contents entries for these sections. Link to the Panopto viewer: https://carleton.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=6af81424-7cdc-465a-ac6a-ac72013d81bd
2 Definitions
- concurrent computing: when the execution of multiple computations (or processes) overlap
- we need to correctly control access to shared resources
- parallel computing: when the execution of multiple computations (or processes) occur simultaneously
- using additional resources to produce an answer faster
- concurrency primarily about effective coordination, parallelism primarily about efficient division of labor
3 Why Parallelism?
- running into physical limits:
- power densities over time roughly constant from 1958 to late 1990s
- grow rapidly starting in late 1990s
- Since 2000, transistors too small to shrink further (just a few molecules thick)
- power densities over time roughly constant from 1958 to late 1990s
- heat death
- power scales with frequency
- processors stopped around 4 GHz
- solution: slap multiple processors on a single chip
- continues the trend of more and more transitors, but…
- requires new programming approach
- First ARM processor created by Acorn Computers in 1985 for the BBC Micro computer vs Apple's recently announced M1 processor (also an ARM processor)
- ARM is an alternative processor architecture to the x86 architecture we've studied
- The M1 is over twice as large physically as the ARM1. It has 16 billion transistors vs 25,000 for the ARM1
- The ARM1 processor ran at 6 megahertz, while the M1 runs at 3.2 gigahertz, over 500 times faster
- While the ARM1 was a single processor, the M1 has 4 high-performance CPU cores, 4 efficiency CPU cores, a 16-core neural engine, and 8-core GPU
- Looking at the ARM1 die, you see functional blocks such as 100 bytes of registers and a basic 32-bit ALU. On the M1 die, similar-sized functional blocks are a 12-megabyte cache and a complete 64-bit CPU core.
- ALU = Arithmetic Logic Unit, the part of the processor that performs arithmetic
4 Limits of Parallelism
- is it faster to pick up a deck of cards with two people instead of 1?
- how about 100 instead of 50?
- there are a finite number of independent tasks, putting a limit on how much can be parallelized
- some tasks need to wait on others, making them inherently sequential
- the amount of speedup that can be achieved through parallelism is limited by the non-parallel portion
5 Threads
- previously, we defined a process as an instance of a running program
- a thread is a single execution sequence that represents the minimal unit of scheduling
- one process may contain multiple threads
- four reasons to use threads:
- program structure: expressing logically concurrent tasks
- each thing a process is doing gets its own thread (e.g., drawing the UI, handling user input, fetching network data)
- responsiveness: shifting work to run in the background
- we want an application to continue responding to input in the middle of an operation
- when saving a file, write contents to a buffer and return control, leaving a background thread to actually write it to disk
- performance: exploiting multiple processors
- performance: managing I/O devices
- program structure: expressing logically concurrent tasks
5.1 Hardware Support for Multithreading
- two copies of PC and Registers inside processor hardware:
- looks like two processors to software
- control logic decides which thread to execute an instruction from next
5.2 Threading Models
- POSIX Threads (pthread)
#include <pthread.h>
, low-level interface giving fine-grained control
- Fork-Join model (OpenMP)
#include <omp.h>
, higher-level interface for fork-join approach- pthreads can do fork-join manually
6 Instruction-Level Parallelism
- steps to execute the next x86-64 instruction:
- fetch the instruction
400540: 48 01 fe
- decode the instruction
addq %rdi, %rsi
- gather data values read
R[%rdi], R[%rsi]
- perform operation calc
R[%rdi] + R[%rsi]
- store result save into
%rsi
- fetch the instruction
- each of these steps is roughly associated with a part of the processor
- each part can work independently
6.1 By Analogy to Doing Laundry
- washer, dryer, folding, and putting away all take 30 minutes
- full task takes 2 hours
- four people doing laundry in sequence takes 8 hours
- but the washer and dryer sit idle for 1.5 hours at a time
- we can start the second load as soon as the first load is done with the washer
- pipelined laundry takes 3.5 hours for 4 loads
- does not improve the latency of a single load (still takes 2 hours), but improves throughput
- processors pipeline instruction execution steps in a similar way
7 Why Concurrency?
- a multi-threaded program's execution depends on how instructions are interleaved
- threads sharing data may be affected by race conditions (or data races)
Thread A | Thread B |
---|---|
x = 1 |
x = 2 |
- What are the final possible values of
x
? Result depends on which thread "wins." - Suppose
y=12
initially:
Thread A | Thread B |
---|---|
x = y + 1 |
y = y * 2 |
- What are the final possible values of
x
? Either 13 or 25 depending on which thread goes first - Suppose
x = 0
initially:
Thread A | Thread B |
---|---|
x = x + 1 |
x = x + 2 |
- What are the final possible values of
x
? Remember that register values are preserved for each thread:
Interleaving 1 | Interleaving 2 | Interleaving 3 | |||
---|---|---|---|---|---|
load x into r1 |
load x into r1 |
load x into r1 |
|||
add 1 to r1 |
load x into r1 |
load x into r1 |
|||
store r1 in x |
add 1 to r1 |
add 1 to r1 |
|||
load x into r1 |
add 2 to r1 |
add 2 to r1 |
|||
add 2 to r1 |
store r1 in x |
store r1 in x |
|||
store r1 in x |
store r1 in x |
store r1 in x |
- interleaving 1 results in
x == 3
- interleaving 2 results in
x == 2
- interleaving 3 results in
x == 1
8 Synchronization
- prevent bad interleavings of read/write operations by forcing threads to coordinate their accesses
- identify critical section of code where only one thread should operate at a time
- pseudocode:
check lock // loop/idle here if locked acquire lock CRITICAL SECTION (e.g., change shared variables) release lock
- problem:
Thread A | Thread B |
---|---|
read lock (check) | |
read lock (check) | |
conditional jump | |
conditional jump | |
write lock (acquire) | |
write lock (acquire) | |
CRITICAL SECTION… | |
CRITICAL SECTION… |
- we need hardware support to make checking and acquiring the lock an atomic operation (impossible to interrupt)
- we want to identify the minimal critical section to maintain parallelism