Finding the closest pair of points

Introduction

In this article, I am going to apply divide and conquer algorithm to find the closest pair of points from the given set of points. Two points are closest when the Euclidean distance between them is smaller than any other pair of points. The Euclidean distance between points $p_1(x_1, y_1)$ and $p_2(x_2, y_2)$ is given by the following mathematical expression
$$distance = \sqrt{(y_2 - y_1)^2 + (x_2 - x_1)^2}$$
We can find the closest pair of points using the brute force method in $O(n^2)$ time. Using the divide and conquer techniques we can reduce the time complexity to $O(n\log n)$. Even though algorithm described in this article takes $O(n \log^2 n)$ time, I will give you the idea on how to make the complexity $O(n\log n)$ towards the end of the article.

Algorithm

Before passing into the main procedure of the algorithm, the given set of points are first sorted by x-coordinate and y-coordinate respectively to create two separate point sets $P_x$ and $P_y$. The algorithm proceeds as follows.

If the number of points is less than 4, use the brute force method to find the closest pair in constant time.
Find a vertical line $l$ that bisects the point set into two halves - left half and right half. Let $Q$ denotes the left half and $R$ denotes the right half. Since we already have points sorted by $x$-coordinate, this can be done in a constant time.
Sort both $Q$ and $R$ by x-coordinate and y-coordinate respectively to produce $Q_x$, $Q_y$, $R_x$ and $R_y$. This operation can take from $O(n)$ to $O(n\log n)$ depending upon the implementation. Normal sorting takes $O(n\log n)$ but if there are techniques to divides sorted points into two sorted halves in linear time. In this article, I am going to use the normal sorting.
Recursively find the closest pair and distance in both the halves. Let the minimum distance in the left half be $\delta_l$ and minimum distance in the right half be $\delta_r$.
If the closest pair lies on either half, we are done. But sometimes the closest pair might be somewhere else. This happens when one point of the closest pair is in the left half and other in the right half. Find the closest pair of points such that one point is in the left half and other in right half. This step is a little bit tricky to do in a linear time. I explain this step in detail in the next section.
Now we have three distances and pairs. One from the left half, one from the right half and the last one from between the left and right half. Find the best in the three and return it.

More explanation on step 5

Consider a set of points divided by a vertical line $l$ (shown by a red line) as shown in the figure below.

In the figure, $\delta_l$ is the minimum distance in the left, $\delta_r$ is the minimum distance in the right and $\delta = \min (\delta_l, \delta_r)$.
To find the minimum distance between a point in the left-hand side with a point in the right-hand side, we do not need to check the distances between every point in the left to every point to the right. We only need to consider the points that lie inside the area that is $\delta$ units away from $l$ on either side. That means we need to consider the items shown in green color in the above figure.

Sometimes all the points may lie in this area (the area between two dotted lines). In that case, do we need to check the distance between every point on the left to every point on the right? The answer is NO. It is sufficient to find the distance between a point on the left to only a constant number of points on the right provided the points are sorted by y-coordinate.

That means, we use $P_y$ and find the closest pair of points by comparing the distance between a point with its at most 7 neighboring points (for the proof see here) for every point inside the area. In the worst case, if all the points lie in that region, the complexity will be $7n$, which is linear.

Implementation

Before calling the main procedure, we sort the points by $x$ and $y$ coordinates respectively.

1 2	px = sorted(points, key = lambda p: p.x) py = sorted(points, key = lambda p: p.y)

px and py are passed to the function closest_pair that find the closest pair of points.

# px : set of points sorted by x coordinates
# py: set of points sorted by y coordinates
def closest_pair(Px, Py):
    # base case
    if len(Px) < 4:
        return closest_pair_bruteforce(Px)

    # divide px into left and right halves
    Q = Px[:len(Px)/2]
    R = Px[len(Px)/2:]

    # sort Q and R both by x and y coordinates
    Qx = sorted(Q, key = lambda p: p.x) # complexity can be imporved
    Qy = sorted(Q, key = lambda p: p.y) # complexity can be imporved
    Rx = sorted(R, key = lambda p: p.x) # complexity can be imporved
    Ry = sorted(R, key = lambda p: p.y) # complexity can be imporved

    # conquer step, recursively find closest pairs on
    # left and right halves
    dl, p1, q1 = closest_pair(Qx, Qy)
    dr, p2, q2 = closest_pair(Rx, Ry)
    d = min(dl, dr)
    # closest pair split pair. One point is in left half
    # and other in right half
    dc, p3, q3 = closest_split_pair(Px, Py, d)

    # find the minimum and return the pair
    mind = min(dl, dr, dc)
    if mind == dl:
        return dl, p1, q1
    elif mind == dr:
        return dr, p2, q2
    else:
        return dc, p3, q3

Base case is straightforward. Use brute force when the number of points is less than 4.

# calculates distances from every point to every other point
# and compare the distance. Returns the minimum
def closest_pair_bruteforce(points):
    mind = sys.maxsize
    cpairs = None
    for i in range(len(points)):
        for j in range(i + 1, len(points)):
            dist = distance_sq(points[i], points[j])
            if dist < mind:
                mind = dist 
                cpairs = (points[i], points[j])
    return mind, cpairs[0], cpairs[1]

Finally, the implementation of step 5 is


# returns the closest pair of points between 
# the left and the right half
def closest_split_pair(Px, Py, d):
    # get the point in the boundary x-coordinate
    xbar = Px[len(Px)/2].x

    # get all the points between xbar - d and xbar + d
    # sorted by y-coordinates
    Sy = []
    for point in Py:
        if xbar - d <= point.x <= xbar + d:
            Sy.append(point)

    # compare each point with at most 7 neighboring points
    best = d
    best_pair = Px[0].x, Px[0].y

    for i in range(len(Sy)):
        for j in range(1, min(8, len(Sy) - i)):
            if distance_sq(Sy[i], Sy[i + j]) < best:
                best = distance_sq(Sy[i], Sy[i + j])
                best_pair = Sy[i], Sy[i + j]

    return best, best_pair[0], best_pair[1]

Complexity

The recurrence relation for this divide and conquer algorithm can be written as
$$T(n) = 2T(n/2) + O(n\log n)$$
Since we are performing sorting operation inside the divide step, the algorithm spends $O(n\log n)$ in each recursive call. The overall complexity is, therefore, $O(n\log^2 n)$.

Improvement on Complexity.

Line numbers 13 - 16 in closest_pair function can be improved. Since px and py are already sorted, we can get the sorted halves in a linear time. This sounds like an opposite operation of merge procedure in merge sort where we find the sorted list from two sorted lists in linear time.

References

Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (n.d.). Introduction to algorithms (3rd ed.). The MIT Press.
Closest Pair of Points. (n.d.). Lecture. Retrieved August 26, 2018, from http://www.cs.ubc.ca/~liorma/cpsc320/files/closest-points.pdf