Lab 6: k-d Tree¹

Assigned: Friday, February 18
Check-in Post: Before 9pm Thursday, February 24 (check-in forum)
Due: 9pm Wednesday, March 02
Collaboration: Partner assignment (groups of 2 people—you may also work alone)
Handout: starter project
Submit: Upload KdTree.java to Lab 6. If you attempt any Challenges upload a text file named challenges.txt describing what you did.
Reference:
- Binary Search Trees
- Tree Traversals
- BST.java
- Trees for Multidimensional Data
- VS Code Live Share—useful for remote pair programming in VS Code

Overview

In this lab you will implement a binary search tree for two-dimensional data. In particular, you will implement a k-d tree that uses 2D points as the keys (i.e., k = 2). This 2d-tree will support efficient range search (find all of the points contained in a query rectangle) and nearest-neighbor search (find a closest point to a query point). 2d-trees have numerous applications, ranging from classifying astronomical objects and computer animation to speeding up neural networks and data mining.

The goals for this lab are

Practice writing recursive tree operations
Implement a useful data structure that standard libraries (including Java’s) do not provide
Sharpen your programming skills by implementing some clever optimizations

Preparatory Exercises

Work through the practice problems from Trees for Multidimensional Data. There is a nice video walkthrough you can watch.
Read this entire writeup
Download the starter project, unzip it into its own folder, and open that folder in VS Code.

Suggested Timeline

Implement contains, get, and an initial version of put by Wednesday, 2/23. Test by by writing your own main method in KdTree.java.
Implement the points method by Friday, 2/25. Test using KdTreeVisualizer.java.
Implement the range method by Monday, 2/28. Test using RangeSearchVisualizer.java.
Implement the nearest method by Wednesday, 3/2. Test using NearestNeighborVisualizer.java.

Geometric Objects

You will use the Point2D and RectHV classes from algs4 to represent two-dimensional points and rectangles for this lab.

The Point2D class (part of algs4) represents points in the plane. Here is the subset of its API that you may use:

public class Point2D implements Comparable<Point2D> {

    // construct the point (x, y)
    public Point2D(double x, double y)

    // x-coordinate 
    public double x()

    // y-coordinate 
    public double y()

    // square of Euclidean distance between two points 
    public double distanceSquaredTo(Point2D that)

    // does this point equal that object? 
    public boolean equals(Object that)

    // string representation with format (x, y) 
    public String toString()

}

The class RectHV (part of algs4) represents axis-aligned rectangles. Here is the subset of its API that you may use:

public class RectHV {
    // construct the rectangle [xmin, xmax] x [ymin, ymax] 
    public RectHV(double xmin, double ymin, double xmax, double ymax)

    // minimum x-coordinate of rectangle 
    public double xmin()

    // minimum y-coordinate of rectangle 
    public double ymin()

    // maximum x-coordinate of rectangle 
    public double xmax()

    // maximum y-coordinate of rectangle 
    public double ymax()

    // does this rectangle contain the point p (either inside or on boundary)? 
    public boolean contains(Point2D p)

    // does this rectangle intersect that rectangle (at one or more points)? 
    public boolean intersects(RectHV that)

    // square of Euclidean distance from point p to closest point in rectangle 
    public double distanceSquaredTo(Point2D p)

    // does this rectangle equal that object? 
    public boolean equals(Object that)

    // string representation with format [xmin, xmax] x [ymin, ymax] 
    public String toString()

}

Here is a diagram of some of the geometric operations RectHV provides:

Specification

Modify the file KdTree.java that uses a 2d-tree to implement the API below (these methods are also included in the starter file). A 2d-tree is a generalization of a BST to two-dimensional keys. The idea is to build a BST with points in the nodes, using the x- and y-coordinates of the points as keys in strictly alternating sequence, starting with the x-coordinates.

public class KdTree<Value> {

    // construct an empty 2d-tree of points 
    public KdTree()

    // is the tree empty? 
    public boolean isEmpty()

    // number of points
    public int size()

    // associate the value val with point p
    public void put(Point2D p, Value val)

    // value associated with point p 
    public Value get(Point2D p)

    // does the tree contain point p? 
    public boolean contains(Point2D p)

    // all points in the tree 
    public Iterable<Point2D> points()

    // all points that are inside the rectangle (or on the boundary) 
    public Iterable<Point2D> range(RectHV rect)

    // a nearest neighbor of point p; null if the tree is empty 
    public Point2D nearest(Point2D p)
}

Search and insert. The algorithms for search and insert are similar to those for BSTs, but at the root we use the x-coordinate (if the point to be inserted has a smaller x-coordinate than the point at the root, go left; otherwise go right); then at the next level, use the y-coordinate (if the point to be inserted has a smaller y-coordinate than the point in the node, go left; otherwise go right); then at the next level, use the x-coordinate; and so forth.

Level-order traversal. The points() method must return the points in level order: first the root, then all children of the root (from left/bottom to right/top), then all grandchildren of the root (from left to right), and so forth. The level-order traversal of the above 2d-tree is (0.7, 0.2), (0.5, 0.4), (0.9, 0.6), (0.2, 0.3), (0.4, 0.7). This method is useful to assist you (when debugging) and us (when grading).

The prime advantage of a 2d-tree over a BST is that it supports efficient implementation of range search and nearest-neighbor search. Each node corresponds to an axis-aligned rectangle, which encloses all of the points in its subtree. The root corresponds to the entire plane; the left and right children of the root correspond to the two rectangles split by the x-coordinate of the point at the root; and so forth.

Range search. To find all points contained in a given query rectangle, start at the root and recursively search for points in both subtrees using the following pruning rule: if the query rectangle does not intersect the rectangle corresponding to a node, there is no need to explore that node (or its subtrees). That is, search a subtree only if it might contain a point contained in the query rectangle.
Nearest-neighbor search. To find a closest point to a given query point, start at the root and recursively search in both subtrees using the following pruning rule: if the closest point discovered so far is closer than the distance between the query point and the rectangle corresponding to a node, there is no need to explore that node (or its subtrees). That is, search a node only if it might contain a point that is closer than the best one found so far.

Implementation Advice

Node data type. There are several reasonable ways to represent a node in a 2d-tree. One approach is to include the point, a link to the left/bottom subtree, a link to the right/top subtree, and an axis-aligned rectangle corresponding to the node. (Note the code below does not include a constructor.)

private class Node {
   private Point2D p;     // the point
   private Value val;     // the tree maps the point to this value
   private RectHV rect;   // the axis-aligned rectangle corresponding to this node
   private Node lb;       // the left/bottom subtree
   private Node rt;       // the right/top subtree
}

Here are some suggestions for how you might make progress on KdTree.java:

Write isEmpty() and size(). These should be very easy.

Write a simplified version of put() which does everything except set up the RectHV for each node. Write the get() and contains() methods, and use these to test that put() was implemented properly. Note that put() and get() are best implemented by using private helper methods analogous to those found in BST.java. I recommend using the orientation (vertical or horizontal) as an argument to these helper methods (a boolean like isVertical works nicely).

A common error is to not rely on your base case (or cases). For example, compare the following two get() methods for searching in a BST:

private Value get(Node root, Key key) {
    if (root == null) {
        return null;
    }
    int cmp = key.compareTo(root.key);
    if (cmp < 0) {
        return get(root.left, key);
    } else if (cmp > 0) {
        return get(root.right, key);
    } else {
        return root.value;
    }
}

private Value overlyComplicatedGet(Node root, Key key) {
    if (root == null) {
        return null;
    }
    int cmp = key.compareTo(root.key);
    if (cmp < 0) {
        if (root.left == null) {
            return null;
        } else {
            return overlyComplicatedGet(root.left, key);
        }
    } else if (cmp > 0) {
        if (root.right == null) {
            return null;
        } else {
            return overlyComplicatedGet(root.right, key);
        }
    } else {
        return root.value;
    }
}

In the latter method, extraneous null checks are made that would otherwise be caught by the base case. Trust in the base case. Your method may have additional base cases, and code like this becomes harder and harder to read and debug.

Implement the points() method. Use this to check the structure of your k-d tree.
Add code to put() which sets up the RectHV for each Node.
Write the range() method. Test your implementation using RangeSearchVisualizer.java, which is described in the Testing section.
Write the nearest() method. This is the hardest method. I recommend doing it in two stages, testing after each.
- First, find the nearest neighbor without pruning. That is, always explore both subtrees. Moreover, always explore the left subtree before the right subtree.
- Next, implementing the pruining rule: if the closest point discovered so far is closer than the distance between the query point and the rectangle corresponding to a node, there is no need to explore that node (or its subtrees).
Test your implementation using NearestNeighborVisualizer.java, which is described in the Testing section.

Q&A

Is a point on the boundary of a rectangle considered inside it? Do two rectangle intersect if they have just one point in common? Yes and yes, consistent with the implementation of RectHV.java.

Can I use the distanceTo() method in Point2D and RectHV? No, you may use only the subset of the methods listed. You should be able to accomplish the same result (more efficiently) with distanceSquaredTo().

What should I do if a point is inserted twice in the data structure? The data structure represents a map, so you should replace the old value with the new value.

What should points() return if there are no points in the data structure? What should range() return if there are no points in the range? The API says to return an Iterable<Point2D>, so you should return an iterable with zero points.

What should nearest() return if there are two (or more) nearest points? The API does not specify, so you may return any nearest point.

Testing

Testing put() and points() in KdTree. The file KdTreeVisualizer.java reads a sequence of points from a file (given as a command-line argument) and draws the corresponding k-d tree. It does so by reconstructing the k-d tree from the level-order traversal returned by points(). Note that it assumes all points are inside the unit square (coordinates between 0 and 1). Included in the starter project are a number of sample input files and corresponding png files showing the expected visualization. The starter project is configure to prompt you for a command line argument when running any of the testing files. You should see a box pop up at the top of the window prompting you to enter program arguments. Enter the name of one of the .txt files included in the starter project (e.g., input10.txt). For a number of them, the expected output is included as a .png of the same name.

Testing range() and nearest() in KdTree. Often a good way to test code is to perform the same sequence of operations on a new implementation (your KdTree) and an existing, trusted implementation (the provided PointMap), and identify any discrepancies. PointMap is a simple but inefficient map for 2d points that uses Java’s TreeMap to implement all the same methods you are implementing for KdTree. The key idea is to implement a reference solution in which you have confidence. The simplicity of PointMap allows it to serve this purpose in your testing.

The file RangeSearchVisualizer.java reads a sequence of points from a file (given as a command-line argument) and draws them to standard drawing. It also highlights the points inside the rectangle that the user selects by dragging the mouse. Specifically, it colors red the points returned by the method range() in PointMap and blue the points returned by the method range() in KdTree.
The client NearestNeighborVisualizer.java reads a sequence of points from a file (given as a command-line argument) and draws them to standard drawing. It also highlights the point closest to the mouse. Specifically, it colors red the point returned by the method nearest() in PointMap and blue the point returned by the method nearest() in KdTree.

Warning: both of these clients will be slow for large inputs because (1) the methods in the PointMap implementation are slow and (2) drawing the points is slow.

To test whether you are pruning correctly in range and nearest, you can write code in your KdTree main method to compare the speed of KdTree and PointMap. Correct pruning will cause KdTree to be significantly faster. In my own tests, KdTree’s range is twice as fast without pruning, and more than 10 times as fast with pruning. nearest is more dramatic, outperforming PointMap by a factor of more than 1000 with the pruning rule described in Specification. The first optimization described in the Challenges section should make nearest a further 20 times faster.

How do should you measure the performance of a method? Here is one reasonable approach.

Read the file input1M.txt and insert those 1 million points into the k-d tree (and/or PointMap), using integers as the values.
Count the number of calls m to nearest (or whatever method you’re testing) with random points in the unit square that occur in t seconds (i.e., while(timer.elapsedTime() < t)). You can use StopwatchCPU.
Use (m/t) as an estimate of the number of calls per second.

To get a reliable estimate, choose a t that is neither negligible (e.g., less than 1 second) or astronomical (e.g., more than 1 hour). When measuring the CPU time, Do not include the time to read in the 1 million points or construct the k-d tree.

How do you generate a uniformly random point in the unit square? Make two calls to StdRandom.uniform(0.0, 1.0)—one for the x-coordinate and one for the y-coordinate.

Style

You are expected to submit files that contains no checkstyle errors or warnings. You will lose 0.5 points per error (up to a maximum of 3) and per warning (up to a maximum of 2), as indicated in the Grading breakdown.

Challenges

Consider attempting any or all of these additional challenges if you have time:

Optimizations (up to 2 points).
- (1 point) The effectiveness of the pruning rule for nearest depends on quickly finding a nearby point. To do this, organize the recursive method so that when there are two possible subtrees to go down, you choose first the subtree that is on the same side of the splitting line as the query point; the closest point found while exploring the first subtree may enable pruning of the second subtree.
- (0.5 points) For range, instead of checking whether the query rectangle intersects the rectangle corresponding to a node, it suffices to check only whether the query rectangle intersects the splitting line segment: if it does, then recursively search both subtrees; otherwise, recursively search the one subtree where points intersecting the query rectangle could be.
- (0.5 points) You are not required to explicitly store a RectHV in each 2d-tree node (though it is probably wise in your first version). Save memory by only storing the splitting line segment (i.e., a min and a max coordinate, where the axis of those coordinates is implicitly determined by the level of the node in the tree).
Nearest k points (1 point). Add this method to KdTree.java: public Iterable<Point2D> nearest(Point2D p, int k). This method returns the k points that are closest to the query point (in any order); return all n points in the tree if n ≤ k. It must do this in an efficient manner, (i.e., using the technique from k-d tree nearest neighbor search, not from brute force). Once you’ve completed this method (and the first of the optimizations above), you’ll be able to run BoidSimulator.java (which depends upon both Boid.java and Hawk.java). Behold their flocking majesty.
Building a balanced k-d tree (1 point). Implement a method public static KdTree buildTree(List<Point2D> points) that takes in a list of points and constructs and returns a new balanced KdTree. An approach for constructing a balanced tree is described here. Include in your submission evidence that your buildTree constructs a correct and balanced k-d tree (e.g., write a main method that uses it, prints out the height, and shows a correct level-order traversal).

Grading

This lab will be graded out of 40 points as shown below. See the Testing section for advice on how to evaluate whether your implementation is correct. Partial credit will be awarded based on evidence of a good-faith effort to implement the related features. Comments explaining your approach can help earn partial credit.

Requirement	Points
`points` method returns correct level-order traversal	10 points
`range` method returns correct points	5 points
`nearest` method returns correct point	5 points
Correct pruning in `range` method	5 points
Correct pruning in `nearest` method	5 points
Submitted files do not have any `checkstyle` errors	3 points
Submitted files do not have any `checkstyle` warnings	2 points
Check-in post	2 points
`KdTree.java` compiles without warnings	0.5 points
Correct submission format (`KdTree.java`)	0.5 points
Challenges	up to 4 points

Adapted from Kevin Wayne and Josh Hug: https://www.cs.princeton.edu/courses/archive/fall20/cos226/assignments/kdtree/specification.php ↩︎

Lab 6: k-d Tree1