CS 332 w22 — Advanced Scheduling

Table of Contents

1 Scheduling Sequential Tasks on a Multiprocessor

  • What would happen if we used MFQ on a multiprocessor?
    • Contention for scheduler spinlock
    • Cache slowdown due to ready list data structure pinging from one CPU to another
    • Limited cache reuse: thread's data from last time it ran is often still in its old cache

1.1 Solution: Per-Processor Afinity Scheduling

  • Each processor has its own scheduler
    • Protected by a per-processor spinlock
  • Put threads back on the ready list where it had most recently run
    • Ex: when I/O completes, or on Condition->signal
  • Idle processors can steal work from other processors
  • Decreased lock contention and better cache performance

perProcessor.png

2 Scheduling Parallel Programs

  • What happens if one thread gets time-sliced while other threads from the same program are still running?
    • Assuming program uses locks and condition variables, it will still be correct
  • What about performance?
    • The overall program performance may be limited by the slowdown of one thread on one processor
    • Multiply this by numerous parallel programs running at the same time, and the system slows to a crawl

2.1 Oblivious Scheduling

Oblivious Scheduling: each processor time-slices its ready list independently of the other processors, with no coordination to run threads belonging to the same process at the same time

oblivious.png

2.1.1 Bulk Synchronous Delay

  • Bulk Synchronous Parallelism is a commonly used technique
    • Split work up into equal-sized chunks
    • Loop at each processor:
      • Compute on local data (in parallel)
      • Barrier (synchronize)
      • Send (selected) data to other processors (in parallel)
      • Barrier (synchronize)
    • Examples:
      • MapReduce
      • Fluid flow over a wing
      • Most parallel algorithms can be recast in BSP
      • Sacrificing a small constant factor in performance
  • However, the slowest processor will become a bottleneck

bsp.png

2.1.2 Producer-Consumer Delay

We've also discussed producer-consumer patterns—the results of one thread are fed to the next thread.

prodCon.png

Preempting a thread in the middle of the chain can stall all of the processors in the chain.

2.2 Solution: Gang Scheduling

  • Designate set of threads that must run together
    • e.g., all the threads from one process
  • Modern OSes have mechanisms for this
  • Works well when there's one primary application that should take priority
    • A single-purpose web server
    • A database needing precise control over thread assignment
  • Suffers when trying to share reserved sets of processors across multiple applications
    • Especially when not all applications can take advantage of using all the processors at once
    • Implication: usually more efficient to run two parallel programs each with half the processors than to time slice them onto all the processors

cosched.png

2.3 Solution: Space Sharing

  • Assign processors to specific programs
  • Use scheduler activations (calls from kernel to user program) to inform program of available processors
    • The kernel sends an asynchronous signal to a process to inform it when a processor is added to its allocation or taken away

spaceshare.png

3 Energy-Aware Scheduling

  • When battery life becomes a critical resource (modern phones, laptops, etc.), no longer assume processor works at a fixed speed with fixed energy needs
  • Put performance-energy tradeoff under OS control by including two processors: fast, high-power and slow, low-power
    • Scheduler can take power into account when deciding which to use for a given task
  • Turn off/leave idle cores and I/O devices we don't need at the moment
    • Ex. wifi is very energy intensive, many embedded systems only activate it at specific times or default to a (low-power) passive listening mode

responseTime-annotated-1.png

responseTime-annotated-2.png

4 Linux Completely Fair Scheduler (CFS)

  • CFS is not based on queues, instead it implements fair-share scheduling
    • One of several Linux scheduling algorithms, BSD's ULE scheduler does use queues
  • Each task t has a weight, \(w(t)\)
  • Each runnable task \(t\) should acquire CPU time at rate \(\frac{w(t)}{\sum w(j)}\) for all \(j\)
  • Keep track of accumulated weighted runtime vs. fair share amount
  • Over a fixed interval, try to run each runnable task at least once
  • Set time slice according to its fair share of interval, based on weights
  • The next task to run is the one whose accumulated runtime is most behind its fair share
  • Since we aren't just taking the first item in a queue like MFQ, need an efficient data structure for finding the next task to run
    • Red-black tree, a balanced binary tree