K-Dimensional tree for searching data

The KDTree facilitates the rapid querying of data stored in a DataSet. It can make querying by distance more efficient and faster than if you did it by manual lookup. Before a KDTree can be queried it has to be fitted to a DataSet containing the points that it will look up.

One use of the KDTree is for doing nearest neighbour lookups. The interactive example below contains a collection of data points, whose x and y values range between 0.0 and 1.0, which are plotted on to a canvas-like scatterplot. Using the coordinates of our mouse inside this scatterplot, we can search for the point, or group of points closest to our mouse. We can also constrain the radius of the search: only points within a certain distance of our query point, and not just whichever are nearest, are highlighted. Before you use the example you will have to fit the underlying KDTree by pressing the button labelled fit.

Number of Neighbours: 30
0 50
Radius: 0
0.0 1.0

You can change the number of neighbours and the radius constraint with the sliders at the bottom. When the radius or the number of neighbours are set to 0 the behaviour changes. For number of neighbours it will highlight every point. When the radius is 0 it is no longer a constraint and the distance of a nearest neighbour is not taken into account. Points that are highlighted with a red square are those that are deemed to be the nearest neighbours, given both the number of neighbours and radius constraint. Even if points fall outside the radius, a line will be drawn to them. Notice how working with both constraints allows you to home on different neighbourhoods of points. For example, by using a large number of neighbours (50) and a small radius (0.1) you can ensure that if there are actually 50 nearby points, only the ones within a tight proximity radius to the mouse will be highlighted.

Whilst k-d trees can offer very good performance relative to naïve search algorithms, they suffer from something called “the curse of dimensionality” (like many algorithms for multi-dimensional data). In practice, this means that as the number of dimensions of your data goes up, the relative performance gains of a k-d tree go down.