Joins Practice Problems

Let relations \(R(A, B, C)\) and \(S(C, D, E)\) have the following properties: \(R\) has 20,000 tuples, \(S\) has 45,000 tuples, 25 tuples of \(R\) fit on one page, and 30 tuples of \(S\) fit on one page. Estimate the number of disk I/Os required using each of the following join strategies for \(R \bowtie S\):¹
1. Nested-loop join.
2. Block nested-loop join.
3. Sort-merge join.
4. Hash join.
What is the difference between a basic hash join and a grace (hybrid) hash join?²
When is recursive partitioning necessary as part of performing a hash join?³
The indexed nested-loop join algorithm described in the video and section 15.5.3 of the textbook excerpt can be inefficient if the index is a secondary (nonclustered) index and there are multiple tuples with the same value for the join attributes. Why is that? Describe a way, using sorting, to reduce the cost of retrieving tuples of the inner relation. Under what conditions would this algorithm be more efficient than hybrid sort-merge join (section 15.5.4.3)?⁴
The hash-join algorithm as described in the video and section 15.5.5 of the textbook excerpt computes the natural join of two relations. Describe how to extend the hash-join algorithm to compute the natural left outer join, the natural right outer join, and the natural full outer join. (Hint: Keep extra information with each tuple in the hash index to detect whether any tuple in the probe relation matches the tuple in the hash index.)⁵

Footnotes:

\(R\) needs 800 pages, and \(S\) needs 1500 pages. Let us assume \(B\) pages of memory. If \(B > 802\), the join can easily be done in \(1500 + 800\) disk accesses using a simple nested loop join. So we consider only the case where \(B \le 800\).

Nested-loop join:
Using \(R\) as the outer relation we need \(800 + 20000 \times 1500 = 30000800\) I/Os. If \(S\) is the outer relation, we need \(1500 + 45000 \times 800 = 36001500\) I/Os.
Block nested-loop join:
If \(R\) is the outer relation, we need \(800 + \lceil\frac{800}{B-2}\rceil \times 1500\). If \(S\) is the outer relation, we need \(1500 + \lceil\frac{1500}{B-2}\rceil \times 800\) I/Os.
Sort-merge join:
Assuming \(R\) and \(S\) are not initially sorted on the join key, the sorting cost is \(1500(2\lceil\log_{B-1}(1500/B)\rceil + 2) + 800(2\lceil\log_{B-1}(800/B)\rceil + 2)\). The merge cost is \(1500 + 800\).
Hash join:
Since \(R\) is smaller, we use it as the build relation and \(S\) as the probe relation. Assuming we don't need any recursive partitioning, the cost is \(3(1500 + 800) = 6900\) I/Os.

The basic hash join only builds a single hash table (for the outer relation) and then performs lookups (probes) into this table with tuples from the inner relation. The grace hash join accounts for the fact that a hash table for either relation may not fit entirely in memory. It builds a hash table for each relation, moving buckets out to disk as needed. It can then compute the join on a pair of buckets at a time.

In a grace hash join when a single hash table bucket gets too large to fit in memory, it can be recursively partitioned using additional hash functions until individual buckets can fit in memory.

⁴

If there are multiple tuples in the inner relation with the same value for the join attributes, each index lookup on that value may access many discontinuous pages of the inner relation (hence the inefficiency). To reduce this cost we can perform a join of the outer relation tuples with just the secondary index leaf entries (i.e., the key-pointer pairs), postponing the inner relation tuple retrieval. The result file obtained is then sorted on the inner relation addresses, allowing an efficient physical order scan to complete the join.

Hybrid sort-merge join requires the outer relation to be sorted, which the above algorithm does not (though it does require an index lookup for each tuple in the outer relation). If the outer relation is much larger than the inner relation, this index lookup cost will be less than the sorting cost.

⁵

For the probe relation tuple \(t_s\) under consideration, if no matching tuple is found in the build relation's hash partition, it is padded with nulls and included in the result. This will give us the natural right outer join \(t_r ⟖ t_s\). To get the natural left outer join \(t_r ⟕ t_s\), we can keep a boolean flag with each tuple in the current build relation partition \(R_i\) residing in memory, and set it whenever any probe relation tuple matches with it. When we are finished with \(R_i\), all the tuples in it which do not have their flag set are padded with nulls and included in the result. To get the natural full outer join, we do both the above operations together.