Boot.Sys: Floating Point Binary

I've discussed how to use binary to represent integer numbers on a computer. There are times, though, where we might want to represent the real domain of numbers and not just integers. Scientific computing in particular uses real numbers extensively. The way to do this is to use floating point representation.

Floats and Doubles

You've probably used floats and doubles before in Java or some other language. You also probably know about their quirks - that they can't represent all real numbers. Pi, for instance, can only be approximated because its true value requires an infinite amount of space, which computers don't have. You will also, on occasion, run into situations where the value is just slightly off. For example, if you type 1.1 + 1.1 + 1.1 + 4.0 into python, you'll get 7.300000000000001. This is an artifact resulting from the way real numbers are stored.

A float uses 32 bits to represent a real number. It consists of 1 sign bit, 8 exponent bits, and 23 bits of mantissa or fraction. The sign bit is self-explanatory - a zero means positive, a one means negative. It is the leftmost bit just as in signed magnitude representation. The exponent bits are stored in a format called excess-k.

Excess-k is yet another way to encode signed integer numbers in binary. K is a variable, and in the case of 32 bit floats is set to 127. To encode an integer in excess-k, add k to its decimal value and then encode it as an unsigned binary number. For example, if we want an exponent of -10, we would add 127 to get 117 and encode that in binary (01110101).

The mantissa/fraction works a bit differently. Instead of whole numbers, each bit represents a negative power of two. The leftmost bit is worth 0.5 (2^-1), the next 0.25 (2^-2) and so on. There is an implicit 1.0 added to this number, such that if the two leftmost bits are set the mantissa has a value of 1.75 (1.0 + 0.5 + 0.25).

In total, the exponent bits and mantissa work together to represent a real number. The exponent is applied to 2 as its base (e.g. an exponent of 10 implies 2¹⁰), and then the result is multiplied by the mantissa to recover the original decimal representation, paying attention to the sign bit.

Conversions

To convert a binary number from excess-k to decimal, you can use the multiply method as normal, then subtract K:

What represents 18 in 8-bit unsigned binary represents -109 in excess-127. To go in the other direction, add K and use the division method for unsigned binary.

For the mantissa, the multiplication method still applies:

The only difference is that now we are working with fractions, and we add 1.0 to the result. But to go from a decimal number to the mantissa, there is a slightly different method than the one we have used for everything else.

First we subtract 1.0 from the decimal number. That will give us the number to be encoded in binary. After that, we multiply by 2.0 for each bit. If after multiplying the result is greater than 1.0, we set that bit to 1 and subtract 1.0. Otherwise, we set that bit to 0. The result is read from top to bottom, not bottom to top.

Put these two things together and you understand how to encode/decode floating point binary.

A 64-bit floating point number (a double) works exactly the same way, except with different bits for each field. There is 1 sign bit, 11 exponent bits (encoded in excess-1023), and 52 mantissa bits. Use the same methods to convert to and from decimal.

Monday, January 30, 2017

Floating Point Binary

Floats and Doubles

Conversions

No comments:

Post a Comment