Article

O(n) Point Duplicate Detection

So Donald Knuth says that premature optimization is the root of all evil, which probably isn't too far off the truth. (Loving money is, I suppose, a sort of premature optimization.)

But an interesting problem came up at work about how to detect duplicate points within some epsilon of each other quickly. For sake of simplicity, but without loss of generality, we can assume that we're using two-dimensional points for this exercise, but I think the proposed solution should be extensible.

The naive brute-force solution is to compare each point against every other point in the set and discard the duplicates. This runs in O(n^2) on the number of points.

The next trick in our toolbox is to sort the points by, e.g., their X coordinate. In the end this reduces to n+ n·log(n): sorting at best runs in n·log(n), and we have to make a single pass through the array in order to find the duplicates.

The use of comparisons for sorting is what imposes the strict n·log(n) penalty for sorting. Since we're dealing with numbers, we might be able to do a bit better using a counting sort, or a variation thereof.

Strictly speaking, a sort really isn't necessary: instead all we really need is a hash for Points. I didn't really think of it this way to start, and I'm not sure how much I like thinking about it now: hashing floats is hard work because the lack of precision makes implementation difficult.

On the other hand, we could attempt to encode each point as a single value instead of as two values. The first idea I had was inspired by Cantor's diagonalization argument, or, better put, by the proof that you can map all of the points in the unit square onto the unit line. The essential argument goes that you simply interleave the coordinates together. So for example, mapping the point (0.1232, 0.43205) to the unit square would result in 0.1423322005).

Precision costs us again here (we'll find that we'll run out of available decimal places very quickly in this scenario), and in any case it's not immediately clear how to efficiently do such an interleaving.

On the other hand, it is possible to encode two bit strings as one using XOR: cryptographers do this all the time. It will rightly be pointed out that this fails an equality test: two bit strings that are equal could be easily generated by four different bit strings. As a trivial example, 0x00 can be easily produced by 0x11^0x11 as well as 0xFF^0xFF or 0x00^0x00. Moreover, in the two-dimensional point world, order matters, but XOR doesn't really care about order.

But what we can say for sure is that any two bit strings that are not equal were definitely generated by bit strings that are not equal. If you're dealing with very few duplicate points, this early break could be fast enough to do a reasonable bucket-style sort.

How does this work? In a nutshell:

Upon creation, store bit-equivalent integers for the double values used in the point. Probably this needs to be truncated or rounded to the desired epsilon. Call these integer values xb and yb, for x-bits and y-bits respectively.

When checking for point equivalence, compare the values like so:

    if( (xb ^ yb) != (other.xb ^ other.yb) ) {
        return false;
    }
    else { /* maybe they're equal, so check the point values */ }

Originally I concentrated a bit too much on byte-by-byte conversion (see the footnote below), but it's relatively trivial to get a bit-wise representation of a double in C++:

    double x, y;
    int64_t xb, yb;

    xb = (int64_t &)x;
    yb = (int64_t &)y;

In a short sample program I wrote with this implemented, we find:

    Point p(1.0, 2.0),
    int p(1.0, 2.0),
          q(1.0, 2.0),
          r(1.0, 1.0),
          s(2.0, 1.0);

    p == q;
    q == r;
    q == p;
    p == s;

results in:

    XOR passed, equality passed.
    XOR failed
    XOR passed, equality passed.
    XOR passed, equality failed.

To detect duplicate points, iterate through the list and maintain a map from the xor'd bit string to array indexes: when a collision occurs, iterate through the list to determine whether the given point matches any of the ones seen previously. If not, append, else discard.

I haven't done a real time analysis here, but ignoring (for the moment), the time spent in conversion, the average time complexity is probably O(n) to detect duplicates: we scan through the points once and discard duplicates when we find a collision.

The worst-case scenario is when we collide every single time without finding a duplicate. This situation is roughly as bad as our naive solution since we iterate through a whole list of candidate points to check for equality, and we do it n times. While the constants are smaller in this solution, the overall worst-case running time is still O(n^2).

Further analysis is probably necessary to check how frequent collisions are in a given data set.


[1]: The conversion here is probably a little slow, and there are undoubtedly faster ways to do this with bit arithmetic and loop unrolling and the like, but it's late!

    double x, y;
    int64_t xb, yb;
    memcpy( (void *)(&xb), (void *)(&x), sizeof(double) );
    memcpy( (void *)(&yb), (void *)(&y), sizeof(double) );