Software internals: floating-point

Integers are the basis for all computer calculation, but they’re not the only kind of numbers out there. Floating-point numbers are just as important when interfacing with the real world. They represent decimals, fractional quantities, anything other than simple whole numbers. But too many programmers use them without understanding them, and that tends to go horribly wrong.

First off, let’s get one link out of the way. What Every Computer Scientist Should Know About Floating-Point Arithmetic describes everything you need in more technical and precise terms than I ever could. It’s a good read, and it’s important stuff, so check it out if you want the gritty details.

Now, we’re fortunate to live in today’s world, because floating-point numbers are essentially standardized. IEEE 754 (along with some later revisions) defines a common floating-point format that’s used pretty much everywhere in modern tech. If it isn’t, then it’s still assumed to be. So we’ll base our discussion on that.

The theory

Floating-point numbers work a bit like scientific notation. In decimal, you can write something like 1.42 × 10^4^, and that’s understood to be the same as 14,200. But computers work in binary, so we need a binary form of scientific notation. In this case, for example, we can write 14,200 in binary as 11011101111000, or 1.1011101111 × 2^13^.

From there, we can create a way of packing this notation into a sequence of bits: floating-point numbers. What do we need? Well, each number can be either positive or negative, so we need some way of showing that. And we’ll have to store both the exponent (e.g, 13) and the mantissa (binary 1.101110111). The base (2, for binary) can be implied, as we know we’re working with binary. Put those three parts—mantissa, exponent, and sign—together, and you’ve got floating-point.

The practice

But it’s not as easy as that, and that’s why we have standards. First, how many bits are you going to use? Too few, and you don’t have much range. Too many, and you waste space on inconsequential fractions. However, sometimes you need those less-significant bits, so you might want to have options. Luckily, the standard gives us two main options: 32 and 64 bits. Unluckily, a lot of programming languages (like JavaScript) limit you to the latter. Some, like C, give you the choice between float and double (the latter meaning “double precision”, because that’s about what you get with more bits), but high-level programmers often don’t have that luxury. Since the “big” high-level languages tend to use 64-bit floating-point, then, we’ll look at it first.

Given our 64 bits, we need to divide them up among our three parts. The sign bit obviously only needs one, so that’s that. Of the remaining 63, the standard devotes 53 to the mantissa and 11 to the exponent. That lets us store binary exponents over 1000 and the equivalent of about 15 digits of precision. Add in a few special tricks (like denormalized numbers), and the total range runs from 10^-308^ to 10^308^. Huge. (32-bit still nets you 7 digits in a range of 10^±38^, which isn’t too shabby.)

Now, those of you better at math may have noticed a calculation error above. That’s intentional. The way IEEE 754 works, it saves a bit by a clever ruse. In decimal scientific notation, as you may know, the number to the left of the decimal point can’t be zero, and it has to be less than 10. (Otherwise, you could shift the point left or right one more spot.) The same is true for binary, but with a binary 10, i.e, 2. But there’s only one binary number that fills that role: 1. With a few exceptions, you’re always going to have the 1, so why bother putting it in?

The problem with this “implied” 1 comes when you have the one number that has no 1 anywhere in it. That, of course, is 0. But it’s okay, because the standard simply makes 0, well, 0. Exponent zero, mantissa zero. Sign…well, that’s different. Standard floating-point representation has two zeroes: negative and positive. They’re treated as equal essentially everywhere, but they do differ in that one sign bit.

The IEEE standard also does an odd thing with its exponents. Except for the case of a literal 0, every exponent is biased. For 64-bit numbers, the number 1023 is added to the exponent, so a number like 2.5 (binary 10.1 or 1.01 × 2^1^) would be stored as if it were 1.01 × 2^1024^. Why? Because it makes sorting and comparison easier, or so they claim.

In the rare event that you go outside the range, you get to infinity. Like zero, we’ve got two forms of that, one for either sign, but they’re considered nearly the same.

And then there’s NaN. This is a special value used mainly to make programmers scream, but it also represents invalid results like dividing by zero or taking the square root of a negative number. NaN is special in that it’s a whole class of values (anything with all bits in the exponent field set to 1), but they’re completely different. NaN equals nothing, not even another NaN. It’s a null value and an error code at the same time, which is where things inevitably go wrong.

Care and feeding

NaN, though, is only one of the pitfalls of using floating-point. You also have to watch out for infinities, since they don’t play nice with finite numbers. Also, unless you have a really good reason for doing so (such as being John Carmack), you probably don’t want to mess with the bits themselves.

More important than knowing how to use floating-point numbers is when to use them. Or, rather, when not to. They do give you precision, often more than you need, but sometimes that’s not enough. Take the classic example of 1/3. In decimal, it’s an endless string of 3s. Binary changes that to a repeating pattern of 01, but the principle is the same. No matter how many digits or bits you’ve got, you’re never getting to the end. So the simple code 1.0 / 3.0 will never give you exactly 1/3. It can’t. The same goes for any other fraction whose denominator isn’t exactly a power of two. So, if you need exact representation of an arbitrary rational number, floating-point won’t help you.

For 1/100, it’s no different, and that’s why floating-point isn’t a great idea for money, either. Sure, for most simple purposes, it’s close enough, but those tiny errors do add up, especially when multiplication and division get involved. If you’re serious about your money, you won’t be storing how much you have in a floating-point number. Instead, you’ll likely want a decimal type, something a lot of business-oriented languages offer.

In the general case, however, floating-point is the solution. You just have to know its limitations.

Leave a Reply

Your email address will not be published. Required fields are marked *