2025 Devlog

Also available as RSS. See archive for past years.

Fast-Quaternion-Normalization-Proof

2025-03-02

In my last post, I advocated for two different methods of accelerating inverse square roots when normalizing of quaternions:

  1. An approach for values very near to 1
  2. An approach for all other values

I was able to show empirically that method 1 converges for a reasonable range of inputs, but wasn’t able to prove it.

Matthew Lugg just sent me a proof that it converges for values between 0 and ~1.56!

In practice I switch to method 2 if I expect to need more than one iteration. Having a proof for the range of values it eventually converges for potentially allows for more precise trade offs, and at the very least gives me peace of mind that it’s correct within the range I’m using it.

My description of their proof follows.

Consider the approximation:

1/sqrt(x) ~= -0.5x + 1.5

Using this to normalize a quaternion, you get:

q / sqrt(magSq(q))

If we substitute in our inverse square root approximation, we end up with the following equation:

q * (-0.5 * magSq(q) + 1.5)

Since all we’re interested in here is the magnitude, let’s rewrite this in terms of the magnitude. Take x to be the original magnitude, and y to be the approximately normalized magnitude:

y = |x * (-0.5x^2 + 1.5)|

If we can show that within a given range out result y is always closer to 1 than our input x, then values in that range must converge.

We can do this by overlaying |y - 1| < |x - 1| on our graph. Anywhere this intersects with out curve, the approximation must converge:

This is actually overly conservative, technically we could bounce further away before eventually converging–we’ll come back to this idea later.

The red line first crosses the blue area at 0, and then again at near 1.55. Let’s calculate the exact intersection:

# System of equations
y = 2 - x
y = |x * (-0.5x^2 + 1.5)|

# Substitute and simplify so we can solve for 0s
2 - x = |x * (-0.5x^2 + 1.5)|
0 = |x * (-0.5x^2 + 1.5)| - 2 + x
0 = |-0.5x * x^3 + 1.5x| - 2 + x

# Flatten the absolute value, discarding the second intersection
0 = -0.5x^3 + 2.5x - 2

Now we have a cubic equation. We can solve this ourselves, or throw it into a tool like Wolfram Alpha to get the intersection:

x = 0.5 * (sqrt(17) - 1) = 1.5615528128088303

So, our approximation must converge in the range (0, 1.5615528128088303)!

We could stop here, but my original claim was that this would converge until around ~2.24, which includes that suspicious area of the graph where the red line temporarily stops intersecting the blue area:

Remember when I said that checking for intersection with the blue area was overly conservative? It’s possible to get further from normal before eventually converging.

If you consider that all of the points in the problem area have y values of values between 0 and 1, it’s clear that any magnitudes in the problem area immediately yeet themselves out of the problem area into the left side of the plot, and therefore aren’t an issue.

…except the point that bounces off the x axis. At zero, our approximation outputs zero and gets stuck. So for this single point at exactly sqrt(3), it will fail to converge.

Do any other points eventually send us to sqrt(3)? If so, they’ll also diverge.

We don’t have to worry about points in the area we already proved converges since they always get closer to 1 and would have to get further from 1 to reach sqrt(3). However, this problem area might have more divergent points!

If we solve for where our approximation outputs sqrt(3), we get another divergent input. We can repeat this process iteratively to generate more divergent inputs:

There appear to be an infinite number of these divergent inputs approaching sqrt(5). I was never going to catch this empirically, which is why it’s good to have proofs of this stuff.

It’s probably not possible to represent any of these precisely enough with 32 bit floats to trigger the issue, but now we have peace of mind that non shenanigans like this are waiting for us as long as we stay in that lower range.

In conclusion, the proposed approximation is proven valid for all magnitudes in the range (0, 1.5615528128088303), or more precisely, (0, 0.5 * (sqrt(17) - 1)), which is more than large enough for our purposes.

Outside of this range, you’ll start to have problems.

Thanks again to Matthew Lugg for figuring out this out and letting me share!


Fast-Quaternion-Normalization

2025-02-27

Floating point division is slow, and square root is slower. What if I told you that you could approximate 1/sqrt(x) as y = -0.5x + 1.5?

Looks pretty bad right? Try zooming in.

Hmm. Not so bad when you’re near x = 1.

If you’re using 1/sqrt(x) to normalize a quaternion after multiplication, this is likely all you need.

Multiplying two normalized quaternions by each other produces a third normalized quaternion with only a tiny bit of rounding error, so you’re always dealing with magnitudes near one. You just need to keep it from drifting so that your eventual rotation matrix doesn’t inadvertently scale your object.

This function isn’t magic–it’s just the tangent line of y = 1/sqrt(x) at x = 1. You can evaluate it in one instruction if your target supports fused multiply add:

/// Very fast approximate inverse square root that only produces usable
/// results when `f` is near 1.
pub fn invSqrtNearOne(f: anytype) @TypeOf(f) {
    return @mulAdd(@TypeOf(f), f, -0.5, 1.5);
}

(Side note–if this approximation is applied iteratively, the quaternion will approach normalcy if its initial magnitude is in the range (0, ~2.24), outside of that range it diverges. I verified this empirically but I’m not a good enough mathematician to prove it. If you manage to prove it let me know! Proof with corrected range is here!)

Alright, but what if you’re not near one? Maybe you lerped two quaternions together, or need a fast approximate inverse square root for some other purpose.

Turns out most CPUs have a hardware instruction for this now (e.g. on x86 it’s rsqrtss.) I’m not aware of any languages with platform independent intrinsics for this operation, but here’s a quick Zig snippet that will generate the instruction if your target supports it:

/// Fast approximate inverse square root using a dedicated hardware
/// instruction when available.
pub fn invSqrt(f: anytype) @TypeOf(f) {
    // If we're positive infinity, just return infinity
    if (std.math.isPositiveInf(f)) return f;

    // If we're NaN or negative infinity, return NaN
    if (std.math.isNan(f) or
        std.math.isNegativeInf(f) or
        f == 0.0)
    {
        return std.math.nan(f32);
    }

    // Now that we've ruled out values that would cause issues in
    // optimized float mode, enable it and calculate the inverse square
    // root. This generates a hardware instruction approximating the
    // inverse square root on platforms that have one, and is therefore
    // significantly faster.
    {
        @setFloatMode(.optimized);
        return 1.0 / @sqrt(f);
    }
}

This may also generate some extra instructions to refine the approximation. You could sidestep this if it’s undesired by writing per platform inline assembly, but it’s very fast regardless.

The above snippets are from my work in progress Zig geometry library for gamedev.


muladd

2025-02-22

Quick performance tip. If you’re comfortable requiring support for FMA3 (Intel & AMD CPUs have had it for the last 13 years!) you should probably be using your language’s fused multiply add intrinsic where relevant.

For example, in Zig you’d replace a * b + c with @mulAdd(f32, a, b, c). This is much faster, and more precise.

This is an easy optimization, but the compiler can’t do it for you because replacing your arithmetic with more precise arithmetic isn’t standards compliant.

(Unless you’re using –ffast-math, then all bets are off WRT precision and you can just let the compiler do its thing.)

The only catch is, make sure you’re actually setting your target correctly! The compiler can’t ignore your request for muladd since that would change the precision of you result, so if you’re targeting a baseline CPU that doesn’t have it, you’ll end up emulating it in software and you’ll go slower instead of faster.