CS 332 w22 — Advanced Scheduling

1. Scheduling Sequential Tasks on a Multiprocessor
- 1.1. Solution: Per-Processor Afinity Scheduling
2. Scheduling Parallel Programs
3. Energy-Aware Scheduling
4. Linux Completely Fair Scheduler (CFS)

1 Scheduling Sequential Tasks on a Multiprocessor

What would happen if we used MFQ on a multiprocessor?
- Contention for scheduler spinlock
- Cache slowdown due to ready list data structure pinging from one CPU to another
- Limited cache reuse: thread's data from last time it ran is often still in its old cache

1.1 Solution: Per-Processor Afinity Scheduling

Each processor has its own scheduler
- Protected by a per-processor spinlock
Put threads back on the ready list where it had most recently run
- Ex: when I/O completes, or on Condition->signal
Idle processors can steal work from other processors
Decreased lock contention and better cache performance

2 Scheduling Parallel Programs

What happens if one thread gets time-sliced while other threads from the same program are still running?
- Assuming program uses locks and condition variables, it will still be correct
What about performance?
- The overall program performance may be limited by the slowdown of one thread on one processor
- Multiply this by numerous parallel programs running at the same time, and the system slows to a crawl

2.1 Oblivious Scheduling

Oblivious Scheduling: each processor time-slices its ready list independently of the other processors, with no coordination to run threads belonging to the same process at the same time

2.1.1 Bulk Synchronous Delay

Bulk Synchronous Parallelism is a commonly used technique
- Split work up into equal-sized chunks
- Loop at each processor:
  - Compute on local data (in parallel)
  - Barrier (synchronize)
  - Send (selected) data to other processors (in parallel)
  - Barrier (synchronize)
- Examples:
  - MapReduce
  - Fluid flow over a wing
  - Most parallel algorithms can be recast in BSP
  - Sacrificing a small constant factor in performance
However, the slowest processor will become a bottleneck

2.1.2 Producer-Consumer Delay

We've also discussed producer-consumer patterns—the results of one thread are fed to the next thread.

Preempting a thread in the middle of the chain can stall all of the processors in the chain.

2.2 Solution: Gang Scheduling

Designate set of threads that must run together
- e.g., all the threads from one process
Modern OSes have mechanisms for this
Works well when there's one primary application that should take priority
- A single-purpose web server
- A database needing precise control over thread assignment
Suffers when trying to share reserved sets of processors across multiple applications
- Especially when not all applications can take advantage of using all the processors at once
- Implication: usually more efficient to run two parallel programs each with half the processors than to time slice them onto all the processors

2.3 Solution: Space Sharing

Assign processors to specific programs
Use scheduler activations (calls from kernel to user program) to inform program of available processors
- The kernel sends an asynchronous signal to a process to inform it when a processor is added to its allocation or taken away

3 Energy-Aware Scheduling

When battery life becomes a critical resource (modern phones, laptops, etc.), no longer assume processor works at a fixed speed with fixed energy needs
Put performance-energy tradeoff under OS control by including two processors: fast, high-power and slow, low-power
- Scheduler can take power into account when deciding which to use for a given task
Turn off/leave idle cores and I/O devices we don't need at the moment
- Ex. wifi is very energy intensive, many embedded systems only activate it at specific times or default to a (low-power) passive listening mode

4 Linux Completely Fair Scheduler (CFS)

CFS is not based on queues, instead it implements fair-share scheduling
- One of several Linux scheduling algorithms, BSD's ULE scheduler does use queues
Each task t has a weight, \(w(t)\)
Each runnable task \(t\) should acquire CPU time at rate \(\frac{w(t)}{\sum w(j)}\) for all \(j\)
Keep track of accumulated weighted runtime vs. fair share amount
Over a fixed interval, try to run each runnable task at least once
Set time slice according to its fair share of interval, based on weights
The next task to run is the one whose accumulated runtime is most behind its fair share
Since we aren't just taking the first item in a queue like MFQ, need an efficient data structure for finding the next task to run
- Red-black tree, a balanced binary tree