Sorting & Aggregation Practice Problems

  1. Why does a query planner need to be aware of the amount of available memory? Give an example of a decision this would impact.1
  2. Assume (for simplicity in this problem) that only one tuple fits in a page and memory holds at most three pages. Show the runs created on each pass of the sort-merge algorithm when applied to sort the following tuples on the first attribute: (kangaroo, 17), (wallaby, 21), (emu, 1), (wombat, 13), (platypus, 3), (lion, 8), (warthog, 4), (zebra, 11), (meerkat, 6), (hyena, 9), (hornbill, 2), (baboon, 12).2
  3. Why is it not desirable to force users to make an explicit choice of a query-processing strategy? Are there cases in which it is desirable for users to be aware of the costs of competing query-processing strategies? Explain your answer.3
  4. Suppose you have to compute a group by key \(A\), sum of attribute \(C\) as well as group by key \((A,B)\), sum of attribute \(C\). Describe how to compute these together using a single sorting of the input relation.4

Footnotes:

1

The query planner needs to choose a plan that fits within the available memory. For example, an in-memory sorting algorithm like quicksort will be faster than an external merge sort, but only if all the data fits in memory at once.

2

We will refer to the tuples (kangaroo, 17) through (baboon, 12) using tuple numbers \(t_1\) through \(t_{12}\). We refer to the jth run used by the ith pass, as \(r_{ij}\). The initial sorted runs have three pages each. They are:

\begin{align*} r_{11} &= \{t_3, t_1, t_2\}\\ r_{12} &= \{t_6, t_5, t_4\}\\ r_{13} &= \{t_9, t_8, t_7\}\\ r_{14} &= \{t_{12}, t_{11}, t_{10}\} \end{align*}

Each pass merges three runs. Therefore the runs after the end of the fist pass are:

\begin{align*} r_{21} &= \{t_3, t_1, t_6, t_9, t_5, t_2, t_7, t_4, t_8\}\\ r_{22} &= \{t_{12}, t_{11}, t_{10}\} \end{align*}

At the end of the seond pass, the tuples are completely sorted into one run:

\begin{align*} r_{31} = \{t_{12}, t_3, t_{11}, t_{10}, t_1, t_6, t_9, t_5, t_2, t_7, t_4, t_8\} \end{align*}
3

In general it is not desirable to force users to choose a query-processing strategy because users might select an inefficient strategy. The reason users would make poor choices about processing queries is that they would not know how a relation is stored, nor about its indices. It is unreasonable to force users to be aware of these details since ease of use is a major goal of database query languages. If users are aware of the costs of different strategies, they could write queries efficiently, thus helping performance. This could happen if experts were using the system.

4

Run the sorting operation on r, grouping by (A, B), as required for the second result. When evaluating the sum aggregate, keep running totals for both the (A, B) grouping as well as for just the A grouping.