Parallel Query Processing Practice Problems
What is the difference between inter-operator and intra-operator parallelism? Describe how each could improve the performance of the following query.1
SELECT artist.name, COUNT(DISTINCT language.name) FROM artist, artist_credit_name, artist_credit, release, language WHERE artist.id = artist_credit_name.artist AND artist_credit_name.artist_credit = artist_credit.id AND artist_credit.id = release.id AND release.language = language.id GROUP BY artist.name HAVING COUNT(DISTINCT language.name) > 1 ORDER BY COUNT(DISTINCT language.name) DESC;
- What form of parallelism (inter-query, inter-operator, or intra-operator) is likely the most important for each of the following tasks?2
- Increasing the throughput of a system with many small queries
- Increasing the throughput of a system with a few large queries when the number of disks and processors is large
- Suppose a large relation is horizontally partitioned onto two or more disks. Of the following operators, which would require additional work beyond simply concatenating the results of applying the operator to each parition? Consider a situation where the data is partitioned arbitrarily and a situation where the data is partitioned according to the value of the attribute being operated on (i.e., all the tuples with a particular value will be in the same partition).3
- Sequential scan (potentially with a filter)
- Distinct
- Sum
- Order by
- Projection
- How are partitioning and parallelism related? What is the potential problem if we have operator parallelism without any partitioning?4
- What kind of concurrency control might we need for each type of parallelism?5
- inter-query
- inter-operator
- intra-operator
Footnotes:
Inter-operator parallelism is when different operators within the query plan each have their own thread and run in parallel. Intra-operator parallelism is when an individual operator is made multi-threaded (i.e., the work of that operator is split among multiple threads). Inter-operator parallelism could improve that query by running the table or index scans of the five input relations in parallel. Intra-operator parallelism could speed up the these scans, the join, and/or the sorting by splitting the work among multiple threads.
- When there are many small queries, interquery parallelism gives good throughput. Parallelizing each of these small queries would increase the overhead of the query, without any significant reduction in response time.
- With a few large queries, intra-operator parallelism is essential to get fast response times. Given that there are large numbers of processors and disks, only intra-operator parallelism can take advantage of the parallel hardware (queries typially have few operators, but each one needs to proess a large number of tuples).
- No additional work in either case
- Would need to re-processes the data to eliminate duplicates across partitions in the first case. In the second case, no additional work would be needed.
- In both cases, the sums from the partitions would need to be added together
- In the first case, the sorted runs from each partition would need to be merged together. In the second case, it depends on whether the partitioning was done by sorting or by hashing. With sorting, nearby values will be in the same partition, and so we could just concatenate the output from each parition. With hashing, we can't make that assumption, and so we would need to do the extra work of merging the runs.
- No additional work needed
If queries or operators are run in parallel, there may be multiple simultaneous disk accesses. This could cause the disk to become a bottleneck (thus negating the performance benefits of parallelism), and potentially result in a lot of random I/O as the disk jumps back and forth between different requests. Partitioning the data across multiple disks can facilitate good performance for simultaneous accesses.
- The shared database data structures such as indexes, tuples, buffer pools, disk managers, etc. would need to be made thread-safe in order to allow concurrent access by multiple queries.
- Any shared buffers for passing tuples between operators running in parallel would need to be thread-safe.
- Temporary intra-operator data structures such as an output buffer or a hash table would need to be thread-safe.