Beyond CS 334
Table of Contents
1 Distributed Databases
1.1 Architecture
- what is shared?
- shared everything (single node)
- shared memory: nodes access shared memory and disk (uncommon, too much overhead to access memory over the network)
- shared disk: nodes have their own memory, access a shared disk over the network
- shared nothing: each node has its own disk and memory, communication/coordination still occurs over the network
- what can nodes do?
- homogeneous nodes: all nodes perform the same tasks
- heterogeneous nodes: nodes have specialized roles/tasks
1.2 Partitioning
Distributed system must partition the database across multiple resources, including disks, nodes, processors. Sometimes called sharding in NoSQL systems. The goal is to maximize the number of transactions/queries that only need to access one node.
- naive data partitioning: one table per node
- horizontal paritioning: distribute tuples across paritions according to some partition key
1.2.1 Consistent Hashing
- Goal: balance data across partitions
- Challenge: under standard hashing (e.g., \(hash(o) \mod n\)), when the number of partitions (\(n\)) changes, nearly all objects will need to move to new partitions
- Instead, use the hash function to map both partitions and objects to a unit circle (e.g., \(hash(o) \mod 360\)), and assign objects to the nearest partition in clockwise order
- Now, when a partition is removed or a new one is added, only a small subset of the objects are affected
1.3 OLTP vs OLAP
Similar to single-node databases, distributed databases will have different designs to support OLTP or OLAP workloads. For OLTP, a major concern is supporting transactions that span multiple nodes (see below). For OLAP, since queries often involve a lot of data, minimizing the amount of data that needs to be transferred between nodes is important.
1.4 Transactions
Executing a transaction that access multiple nodes is challenging because all of those nodes need to successfully commit the transaction. The DBMS must implement an additional protocol to ensure ACID guarantees (although many distributed systems instead provide less strict guarantees—BASE (Basically Available, Soft state, Eventual consistency)).
2 In-Memory Databases
The first DBMSs in the 1970s were designed in environment with the following characteristics:
- Uniprocessor (single core CPU)
- RAM was severely limited
- Database had to be stored on disk
- Disk is slow. Seriously slow
But now DRAM capacities are large and inexpensive enough that most structured databases (gigabytes/low terabytes) will entirely fit in memory.
If disk I/O is no longer the slowest resource, much of the DBMS architecture will have to change to account for other bottlenecks:
- Locking/latching
- Cache-line misses
- Following long chains of pointers
- Predicate evaluation
- Data movement and copying
- Networking (between application and DBMS)
No longer need to store data contiguously, as random reads and writes are fast in memory.
The best strategy for executing a query plan in a DBMS changes when all the data is already in memory. Sequential scans are no longer significantly faster than random access. The traditional tuple-at-a-time iterator model is too slow because of function calls.
In-memory systems will still need to store logs and checkpoints on disk in order to provide durability. This may change/is changing as new non-volatile memory technologies emerge.