Nov 16, 2022 3 min read programming

Why floating point numbers don't form a ring

Floating point numbers don't form a ring. There are at least two reasons. First, addition isn't associative. Second, multiplication doesn't distribute over addition. This means you can't rely on associativity or distributivity holding when writing code that uses floating point numbers.

I'm going to illustrate these points using a vastly simplified version of floating point arithmetic. This system is base-10. The mantissa is just one digit. The exponent ranges from -499 to 500. So, \(9 \cdot 10^{50}\) is a valid exactly-representable number in this system, as is \(-3 \cdot 10^{-4}\), but \(3.3 \cdot 10^0\) isn't. I believe this system preserves the relevant features of IEEE 754 floating point numbers for the purposes of this discussion.

With that caveat in mind, I'm going to explain these two points with some examples.

Addition isn't associative. It's not associative because you can pick some "large" positive number x and a small-magnitude number \(\epsilon\) and write \(x + (-x) + \epsilon\). Associating one way gives you 0 while the other gives you \(\epsilon\).

Concretely, pick \(x = 5\cdot 10^{1}\) and \(\epsilon = 1 \cdot 10^{0}\). Our expression is then \(5 \cdot 10^1 + (-5 \cdot 10^1) + 1\cdot 10^{0}\). Associating to the left, we get \((5 \cdot 10^1 + (-5 \cdot 10^1)) + 1\cdot 10^{0} = 0 + 1 \cdot 10^{0} = 1 \cdot 10^{0}\). Associating to the right, we get \(5 \cdot 10^1 + ((-5 \cdot 10^1) + 1\cdot 10^{0} )= 5 \cdot 10^1 + (-5 \cdot 10^1) = 0\).

Here's why \(-5 \cdot 10^1 + 1 \cdot 10^{0} = 0\). When we add \(-5 \cdot 10^1\) to \(1 \cdot 10^{0}\), we have to rewrite them so they share a common exponent. We could pick either exponent, in theory, but we'll pick the larger one because otherwise the math makes no sense – small numbers eliminate large numbers rather than making no difference to them. That choice means we have to "right shift" the digits of \(1 \cdot 10^0\) so that the right side becomes \(0.1 \cdot 10^1\). But we can't actually represent \(0.1\) because we only have one digit of mantissa. So it becomes \(0 \cdot 10^1\) and the sum is \(-5 \cdot 10^1\).

The necessary feature of this example is that the difference between the exponents of \(x\) and of \(\epsilon\) (in this case \(1 - 0 = 1)\)) is larger than the size of the mantissa in digits (in IEEE 754, bits). In my made-up system, the mantissa is just one digit so this happens as soon as there is any difference at all in the exponents. A similar thing happens with multiplication distributing over addition.

Multiplication doesn't distribute over addition. The reason is similar to the reason that addition isn't associative.

For example, suppose we want to muliply \(x = 4 \cdot 10^0\) with \(y + z\) where \(y = 1 \cdot 10^1\) and \(z = 9 \cdot 10^0\). That is, we want to calculate \(4 \cdot 10^0 \cdot \left(1 \cdot 10^1 + 9 \cdot 10^0\right)\). The \(y + z\) here becomes \(1 \cdot 10^1\) for much the same reasons as before, and when we multiply with \(4 \cdot 10^0\) we get \(4 \cdot 10^1\). Meanwhile, if we calculate the distributed expression \(x \cdot y + x \cdot z\), we get \(4 \cdot 10^0 \cdot 9 \cdot 10^1 + 4 \cdot 10^0 \cdot 9 \cdot 10^0\). Now \(4 \cdot 10^0 \cdot 9 \cdot 10^0\) should be \(3.6 \cdot 10^1\), but we have to round to get rid of the extra digit, so we end up with \(4 \cdot 10^1\). The other expression is the same as before, so we get \(x \cdot y + x \cdot z = 4 \cdot 10^1 + 4 \cdot 10^1 = 8 \cdot 10^1\).

Here the relevant feature of the example is slightly different. We need two numbers, without loss of generality, \(y > z\) that have incompatible but nearly compatible exponents, such that when multiplied by \(x\) they become compatible. By compatible I mean we don't need to right-shift away all of the smaller number's mantissa when we add the numbers, and by incompatible I mean that we do have to shift away all those digits. It's probably possible to specify the conditions on \(x, \y\) and \(z\) more precisely, but that sounds like a lot of fiddly work that I have no reason or interest to do at the moment.

So, bottom line: If you write code that touches floating point numbers, you can't depend on associativity or distributivity. These laws that work for real numbers don't work for the finite numbers we can actually compute with.

You might also like...

Things I learned calling shm_open() on a Mac

Lessons learned from battling Godot 3 for twelve hours

Beware map indexing in C++

Eval game rule 10

You could have invented PEM encoding