Query Planning and Optimization Practice Problems

Connect to the large-university Postgres database on awb66333.mathcs.carleton.edu¹. Use the EXPLAIN feature to view the plan chosen by the database in the each of the following cases.
1. Write a query with an equality condition on student.name (which does not have an index), and view the plan chosen.
2. Write a query with an equality condition on student.id (which does have an index), and view the plan chosen.
3. Create simple queries joining two relations, or three relations, and view the plans chosen.
4. Create a query that computes an aggregate with grouping, and view the plan chosen.
5. Create an SQL query that uses a NOT IN clause, with a subquery using aggregation. Observe what plan is chosen.

Draw the optimitized logical plan for the following query using the rules from the lecture video:²

SELECT artist.name
FROM artist, artist_type, area
WHERE artist.area = area.id
AND artist.id = artist_type.id
AND area.name = 'Japan'
AND artist_type.name = 'Orchestra';

Why are statistics like histograms useful for query optimization?³
Suppose you want to get answers to r ⋈ s sorted on an attribute of r, and want only the top K answers for some relatively small K. Give a good way of evaluating the query:⁴
- When the join is on a foreign key of r referencing s, where the foreign key attribute is declared to be not null.
- When the join is not on a foreign key.
Consider a relation r(A, B, C), with an index on attribute A. Give an example of a query that can be answered by using the index only, without looking at the tuples in the relation. (Query plans that use only the index, without accessing the actual relation, are called index-only plans.)⁵

Footnotes:

Download pgAdmin, and create a new server with this information:

hostname: awb66333.mathcs.carleton.edu
database: large-university
username: lab334
password: strangelove

You will need to be on campus or on the Carleton VPN to connect. Go to Tools>Query Tool to bring up a text box where you can enter SQL queries. Use the highlighted buttons to run, explain, and explain analyze your query:

We can use EXPLAIN ANALYZE to have Postgres display the query plan and then execute it and report the actual performance.

                                  PROJECT(name)
                                       |
                                 JOIN(area=id)
                  ._____________/             \______.
                  |                                   \
             JOIN(id=id)                                \
            /          \________.                         \
           /                    |                           \
          /                  PROJECT(id)                PROJECT(id)
         /                         |                        |
PROJECT(name, area, id)    SELECT(name='Orchestra')    SELECT(name='Japan')
     /                             |                          \
  artist                      artist_type                    area

To reduce the number of results/tuples a query copies between operators, we want to apply more selective predicates first. This requires an estimate of predicate selectivity, which depends on the distribution of the underlying data. Histograms provide this distribution (or an approximation of it). In general, the number of tuples involved in various operators will be a key component of a cost model.

⁴

Sort r and collect the top K tuples. These tuples are guaranteed to be contained in r ⋈ s since the join is on a foreign key of r referencing s.
Execute r ⋈ s using a standard join algorithm until the first K results have been found. After K tuples have been computed in the result set, continue executing the join, but immediately discard any tuples from r that have attribute values less than all of the tuples in the result set. If a newly joined tuple t has an attribute value greater than at least one of the tuples in the result set, replace the lowest-valued tuple in the result set with t.

⁵

Any query that only involves the attribute A of r can be executed by only using the index. For example, the query

SELECT SUM(A)
FROM r

only needs to use the values of A, and thus does not need to look at r.