CS 208 w20 lecture 26 outline

1 Definitions

concurrent computing: when the execution of multiple computations (or processes) overlap
- we need to correctly control access to shared resources
parallel computing: when the execution of multiple computations (or processes) occur simultaneously
- using additional resources to produce an answer faster
concurrency primarily about effective coordination, parallelism primarily about efficient division of labor

running into physical limits:
- power densities over time roughly constant from 1958 to late 1990s
  - grow rapidly starting in late 1990s
- Since 2000, transistors too small to shrink further (just a few molecules thick)
heat death
- power scales with frequency
- processors stopped around 4 GHz

solution: slap multiple processors on a single chip
- continues the trend of more and more transitors, but…
- requires new programming approach

is it faster to pick up a deck of cards with two people instead of 1?
- how about 100 instead of 50?
there are a finite number of independent tasks, putting a limit on how much can be parallelized
some tasks need to wait on others, making them inherently sequential
the amount of speedup that can be achieved through parallelism is limited by the non-parallel portion

previously, we defined a process as an instance of a running program
a thread is a single execution sequence that represents the minimal unit of scheduling
- one process may contain multiple threads
four reasons to use threads:
- program structure: expressing logically concurrent tasks
  - each thing a process is doing gets its own thread (e.g., drawing the UI, handling user input, fetching network data)
- responsiveness: shifting work to run in the background
  - we want an application to continue responding to input in the middle of an operation
  - when saving a file, write contents to a buffer and return control, leaving a background thread to actually write it to disk
- performance: exploiting multiple processors
- performance: managing I/O devices

POSIX Threads (pthread)
- #include <pthread.h>, low-level interface giving fine-grained control
Fork-Join model (OpenMP)
- #include <omp.h>, higher-level interface for fork-join approach
- pthreads can do fork-join manually

steps to execute the next x86-64 instruction:
1. fetch the instruction 400540: 48 01 fe
2. decode the instruction addq %rdi, %rsi
3. gather data values read R[%rdi], R[%rsi]
4. perform operation calc R[%rdi] + R[%rsi]
5. store result save into %rsi
each of these steps is roughly associated with a part of the processor
- each part can work independently

washer, dryer, folding, and putting away all take 30 minutes
- full task takes 2 hours
four people doing laundry in sequence takes 8 hours
- but the washer and dryer sit idle for 1.5 hours at a time
we can start the second load as soon as the first load is done with the washer
- pipelined laundry takes 3.5 hours for 4 loads
does not improve the latency of a single load (still takes 2 hours), but improves throughput
processors pipeline instruction execution steps in a similar way

a multi-threaded program's execution depends on how instructions are interleaved
threads sharing data may be affected by race conditions (or data races)

Thread A	Thread B
`x = 1`	`x = 2`

What are the final possible values of x? Result depends on which thread "wins."
Suppose y=12 initially:

Thread A	Thread B
`x = y + 1`	`y = y * 2`

What are the final possible values of x? Either 13 or 25 depending on which thread goes first
Suppose x = 0 initially:

Thread A	Thread B
`x = x + 1`	`x = x + 2`

What are the final possible values of x? Remember that register values are preserved for each thread:

prevent bad interleavings of read/write operations by forcing threads to coordinate their accesses
- identify critical section of code where only one thread should operate at a time
pseudocode:

check lock // loop/idle here if locked
acquire lock
CRITICAL SECTION (e.g., change shared variables)
release lock

we need hardware support to make checking and acquiring the lock an atomic operation (impossible to interrupt)
we want to identify the minimal critical section to maintain parallelism