Convenient and near-optimal binary search on floating point numbers

TL;DR

Use the following template (C++20) for efficient and near-optimal binary search (in terms of number of queries) on floating point numbers.

Template


template <std::size_t N_BITS>
using int_least_t = std::conditional_t<
    N_BITS <= 8, std::uint8_t,
    std::conditional_t<
        N_BITS <= 16, std::uint16_t,
        std::conditional_t<
            N_BITS <= 32, std::uint32_t,
            std::conditional_t<
                N_BITS <= 64, std::uint64_t,
                std::conditional_t<N_BITS <= 128, __uint128_t, void>>>>>;

// this should work for float and doubles, but for long doubles, std::bit_cast will fail on most systems due to being 80 bits wide.
// to handle this, consider using doubles instead or std::bit_cast the long double to an 80-bit bitset and convert it to a 128 bit integer using to_ullong.

/*
   * returns first x in [a, b] such that predicate(x) is false, conditioned on
   * logical_predicate(a) && !logical_predicate(b) && logical_predicate(-inf) &&
   * !logical_predicate(inf)

   * here logical_predicate is the mathematical value of the predicate, not the
   * machine value of the predicate

   * it is guaranteed that non-nan, non-inf inputs are passed into the predicate

   * if NaNs or infinities are passed to this function as argument, then the
   * inputs to the predicate will start from smallest/largest representable
   * floating point numbers of the input type - this can be a source of errors
   * if you multiply the input by something > 1 for example

   * strictly speaking, the predicate should also be perfectly monotonic, but if
   * it gives out-of-order booleans in some small range [a, a + eps] (and the
   * correct order elsewhere), then the answer will be somewhere in between

   * the same holds for how denormals are handled by this code

   */
//
template <bool check_infinities = false,
          bool distinguish_plus_minus_zero = false,
          bool deal_with_nans_and_infs = false, std::floating_point T>
T partition_point_fp(T a, T b, auto&& predicate) {
    static constexpr std::size_t T_WIDTH = sizeof(T) * CHAR_BIT;
    using Int = int_least_t<T_WIDTH>;
    static constexpr auto is_negative = [](T x) {
        return static_cast<bool>((std::bit_cast<Int>(x) >> (T_WIDTH - 1)) & 1);
    };
    if constexpr (distinguish_plus_minus_zero) {
        if (a == T(0.0) && b == T(0.0) && is_negative(a) && !is_negative(b)) {
            if (!predicate(-T(0.0))) {
                return -T(0.0);
            } else {
                // predicate(0.0) is guaranteed to be true because b = 0.0
                return T(0.0);
            }
        }
    }

    if (a >= b) return NAN;

    if constexpr (deal_with_nans_and_infs) {
        // get rid of NaNs as soon as possible
        if (std::isnan(a)) a = -std::numeric_limits<T>::infinity();
        if (std::isnan(b)) b = std::numeric_limits<T>::infinity();

        // deal with infinities
        if (a == -std::numeric_limits<T>::infinity()) {
            if constexpr (check_infinities) {
                if (predicate(-std::numeric_limits<T>::max())) {
                    a = -std::numeric_limits<T>::max();
                } else {
                    return -std::numeric_limits<T>::max();
                }
            } else {
                a = -std::numeric_limits<T>::max();
            }
        }
        if (b == std::numeric_limits<T>::infinity()) {
            if constexpr (check_infinities) {
                if (!predicate(std::numeric_limits<T>::max())) {
                    b = std::numeric_limits<T>::max();
                } else {
                    return std::numeric_limits<T>::infinity();
                }
            } else {
                b = std::numeric_limits<T>::max();
            }
        }
    }

    // now a and b are both finite - deal with differently signed a and b
    if (is_negative(a) && !is_negative(b)) {
        // check 0 once
        if constexpr (distinguish_plus_minus_zero) {
            if (!predicate(-T(0.0))) {
                b = -T(0.0);
            } else if (predicate(T(0.0))) {
                a = T(0.0);
            } else {
                return T(0.0);
            }
        } else {
            if (!predicate(T(0.0))) {
                b = -T(0.0);
            } else {
                a = T(0.0);
            }
        }
    }

    // in the case a and b are both 0 after the above check, return 0
    if (a == b) return T(0.0);

    // start actual binary search
    auto get_int = [](T x) { return std::bit_cast<Int, T>(x); };
    auto get_float = [](Int x) { return std::bit_cast<T, Int>(x); };
    if (b > 0) {
        while (get_int(a) + 1 < get_int(b)) {
            auto m = std::midpoint(get_int(a), get_int(b));
            if (predicate(get_float(m))) {
                a = get_float(m);
            } else {
                b = get_float(m);
            }
        }
    } else {
        while (get_int(-b) + 1 < get_int(-a)) {
            auto m = std::midpoint(get_int(-b), get_int(-a));
            if (predicate(-get_float(m))) {
                a = -get_float(m);
            } else {
                b = -get_float(m);
            }
        }
    }
    return b;
}

It is also possible to extend this to breaking early when a custom closeness predicate is true (for example, min(absolute error, relative error) < 1e-9 and so on), but for the sake of simplicity, this template does not do so.

Introduction

It is well known that writing the condition for continuation of the binary search as anything like while (r - l > 1) or while (l < r) is bug-prone on floating point integers, so people tend to write a binary search by fixing the number of iterations. However, this is usually not the best method — you need \(O(\log L)\) iterations where \(L\) is the length of the range, and the range of floating point numbers is very large.

So, I came up with a way to binary search on (IEEE754, i.e., practically all implementations of) floating point numbers that takes bit_width(floating_type) + O(1) calls to the predicate for binary search. I am pretty sure that this method has been explored before (feel free to drop references in the comments if you find them). Regardless of whether this is a novel algorithm or not, I wanted to share this since it teaches you a lot about floating points, and it also has the following advantages:

No need to hard-code the number of iterations.
It avoids issues that you get in other implementations (for example, using sqrt, you end up with a ton of cases with 0, negatives and so on).
It is efficient (sqrt is expensive, and so is dealing with a lot of cases).

Arriving at the algorithm

I started out by thinking about this: floating point numbers have a very large range. Can we do better than \(O(\log L)\) where \(L\) is the length of the range? Recall that floating point numbers also have a fixed width that is much smaller than the log of the range they represent, and between two consecutive representable floating point numbers, it is their ratio that is nicely bounded (ignoring the boundary of 0 and inf and nans) instead of their difference. Now since it is well-known that for getting to a certain relative-error, it is optimal to use \(\sqrt{lr}\) instead of \(\frac{l + r}{2}\) in binary search.

So the first idea that comes to mind is to use sqrt instead of midpoint in the usual binary search. However, it has a lot of issues — for example, if \(l = 0\), then you never progress in the binary search (this is fixable by doing a midpoint search on 0 to 1 and sqrt search on 1 to \(r\)). One more issue is how to deal with negatives (this is again doable by splitting the input range into multiple ranges where sqrt works) and with overflows/underflows. And the main issue with this approach is that sqrt is expensive — if you are doing a problem where the predicate is pretty fast and you need to do a lot of binary searches, most of your computation time would be due to sqrt.

Inspired by these, we decide to try to approximate sqrt in a way that preserves monotonicity. Here comes the main point — note that the IEEE754 implementation of floating point numbers separates the mantissa from the exponent, and the exponent part of \(\sqrt{lr}\) is roughly the mean of that of \(l\) and \(r\). Since these are the top few bits (excluding the sign bit), we can just do the following (for positive floating point numbers at least): read the floating point representation as an integer (this can be done in a type-safe manner by using std::bit_cast), take the midpoint of these integers, and convert it back to a floating point number. This clearly preserves monotonicity — this can be verified easily by hand. Note that this same thing works for when both range endpoints are negative numbers too — for this we simply invert the sign bit. The case where both are of opposite signs is also simple (if you disregard the fact that +0 and -0 are equal but have distinct representations and reciprocals) — check the predicate on \(0\) (both the zeroes if it matters for your predicate).

Now if we want to do some more handling (infinities, NaNs, denormals), we can do it as we wish. In the implementation in the TL;DR above, we decide to replace NaNs with the infinity in the correct direction, and since there can be many infinities, we try to bring down the range endpoint to the largest representable floating point instead.

Analysis

In all, there are at most \(w + O(1)\) calls to the predicate, where \(w\) is the length of the complement of the longest common prefix of the floating point representations of endpoints.

This implementation is also robust to predicates which are noisy near the boundary (i.e., near the boundary, there is a small range where the true values of the predicate can come after the false values) — in this case the algorithm returns something in this range near the boundary.

Note that we did not need to hardcode the number of iterations, nor did we require some carefully-written predicate for loop termination.

Also, since std::bit_cast is practically free compared to std::sqrt, it is much faster in cases where you need to do a lot of binary searches.

As a usage example, this submission uses the usual binary search with fixed number of iterations, and this submission uses the template above.

Some problems

The following are some problems that use binary search on floating points, and should be solvable using this template. If you encounter any bugs in the template while solving these problems, do let me know in the comments below. Thanks to jeroenodb and PurpleCrayon for problem suggestions.

Spoiler alert

TL;DR#

Introduction#

Arriving at the algorithm#

Analysis#

Some problems#

TL;DR

Introduction

Arriving at the algorithm

Analysis

Some problems