Parallel Query Processing
For this topic we'll use the lecture from Andy Pavlo embedded below. Here's some terminology he uses you may not be familiar with:
- process: when an operating system runs a program, it creates a process to do so. The OS provides the process with access to memory and other resources. Within a process, one or more threads will execute the program. A lot of operating system functionality is built around processes. The key things to know are
- Processes are managed by the operating system
- They are heavy-weight, meaning creating a new process involves significantly more overhead than, say, creating a new thread
- A process can be single- or multi-threaded
- round robin: when sharing a resource or dividing up work, a round robin approach is like dealing cards—you go around giving a turn or handing out a piece to each thread
- RAID1: stands for Redundant Array of Independent Disks, and is a technique for combining multiple storage devices into a single logical unit. That is, the database would still read and write to a single "disk", but underneath this would actually be multiple devices. There are many different RAID arrangements with names like RAID 0 and RAID 1 that provide various advantages in terms of better parallelism and/or data redundancy.
- symlink: short for symbolic link, this is like a desktop shortcut—a file that simply points to another file.
A few notes about video timestamps. The video below starts at 1:24. You can stop watching at 1:05:40 when he starts going over the topics for their midterm. The sections on Parallel vs Distributed Databases and Process Models are optional, so you can skip from 5:30 to 27:15 unless you are interesting in those details.
Footnotes:
1
Andy does explain this, but in a fairly fast and brief way