Understanding floating point numbers (part 0)

Hello, Khabrovites. I have been fond of the topic of floating point registers for a long time. I was always worried about how the output to the screen, etc. I remember, a long time ago at the university I was implementing my class of floating-point numbers consisting of 512 bits. The only thing I could not realize in any way was the output to the screen.

As soon as I had free time, I took up the old. I got myself a notebook and off we go. I wanted to think of everything myself, only occasionally looking at the IEEE 754.
And here is what came of it all. For those interested, I ask for cat.

To master this article, you need to know the following: what is a bit, a binary system, arithmetic at the level of knowledge of negative degrees. The article will not affect the engineering details of the implementation at the processor level as well as normalized and denormalized numbers. More emphasis is placed on converting a number into binary form and vice versa, as well as explaining how floating point numbers are generally stored in the form of bits.

Floating-point numbers are a very powerful tool that you need to be able to use correctly. They are not as commonplace as integer registers, but also not so complex if they are competently and slowly penetrated.

In today's article, I will use 32-bit registers as an example. Double precision numbers (64-bit) work exactly the same logic.

First, let's talk about how floating point numbers are stored. The oldest 31 bits are significant. A single means that the number is negative, and zero, respectively, is the opposite. Next come 8 bits of the exponent. These 8 bits are the usual unsigned number. And at the very end are 23 bits of the mantissa. For convenience, we denote the sign as S, the exponent as E, and the mantissa, oddly enough, M.

We obtain the general formula$ (- 1) ^ s \ times M \ times 2 ^ {E-127} $

The mantissa is considered to be one implicit single bit. That is, the mantissa will be 24 bits, but since the highest 23rd bit is always one, you can not write it down. Thus, this “restriction” will give us the uniqueness of representing any number.

Mantissa is an ordinary binary number, but unlike integers, the most significant bit is 2 ^ 0 degrees and then in decreasing degrees. This is where the exhibitor comes in handy. Depending on its value, the power of the high bit two increases or decreases. That's the whole genius of this idea.

Let's try to show this with a good example:

Imagine the number 3.625 in binary form. First, we divide this number into powers of two.$ 3.625 = 2 + 1 + 0.5 + 0.125 = 1 \ times 2 ^ 1 + 1 \ times 2 ^ 0 + 1 \ times 2 ^ {-1} + 0 \ times 2 ^ {-2} + 1 \ times 2 ^ { -3} $

The degree of the senior two is equal to one. E - 127 = 1. E = 128.

0 1000000 11010000000000000000000

That's all our number.

Let's try also in the opposite direction. Suppose we have 32 bits, arbitrary 32 bits.

0 10000100 (1) 11011100101000000000000

The same implicit high bit is indicated in brackets.

First, calculate the exponent. E = 132. Accordingly, the degree of the senior two will be equal to 5. Total we have the following number:
$ 2 ^ 5 + 2 ^ 4 + 2 ^ 3 + 2 ^ 1 + 2 ^ 0 + 2 ^ {-1} + 2 ^ {-4} + 2 ^ {-6} = $
$ = 32 + 16 + 8 + 2 + 1 + 0.5 + 0.0625 + 0.015625 = 59.578125 $

It is easy to guess that we can only store a range of 24 degrees two. Accordingly, if two numbers differ exponentially by more than 24, then when added, the number remains equal to the larger among them.

For a convenient conversion, I uploaded a small program in C.

#include 
union IntFloat {
    unsigned int integerValue;
    float floatValue;
};
void printBits(unsigned int x) {
    int i;
    for (i = 31; i >= 0; i--) {
        if ((x & ((unsigned int)1 << i)) != 0) {
            printf("1");
        }
        else {
            printf("0");
        }
        if (i == 31) {
            printf(" ");
        }
        if (i == 23) {
            printf(" ");
        }
    }
    printf("\n");
}
int main() {
    union IntFloat b0;
    b0.floatValue = 59.578125;
    printBits(b0.integerValue);
    b0.integerValue = 0b01000010011011100101000000000000;
    printf("%f\n", b0.floatValue);
    return 0;
}

The grid step is the minimum difference between two adjacent floating point numbers. If we represent the sequence of bits of such a number as a regular integer, then the neighboring floating-point number will differ in bits as an integer per unit.

It can be expressed otherwise. Two adjacent floating point numbers will differ by 2 ^ (E - 127 - 23). That is, by a difference equal to the value of the least significant bit.

As proof, you can change main in the code and compile again.

union IntFloat b0, b1, b2;
    b0.floatValue = 59.578125F;
    b1.integerValue = b0.integerValue + 1;
    b2.floatValue = b1.floatValue - b0.floatValue;
    printBits(b0.integerValue);
    printBits(b1.integerValue);
    printBits(b2.integerValue);
    printf("%f\n", b0.floatValue);
    printf("%f\n", b1.floatValue);
    printf("%f\n", b2.floatValue);
    short exp1 = 0b10000100;
    short exp2 =0b01101101;
    /* Крайний случай, когда вся мантиса состоит из единиц */
    b0.integerValue = 0b01000010011111111111111111111111;
    b1.integerValue = b0.integerValue + 1;
    b2.floatValue = b1.floatValue - b0.floatValue;
    printBits(b0.integerValue);
    printBits(b1.integerValue);
    printBits(b2.integerValue);
    printf("%f\n", b0.floatValue);
    printf("%f\n", b1.floatValue);
    printf("%f\n", b2.floatValue);
    /* Значения экспонент */
    printf("%d %d\n", exp1, exp2);

I think for today you can round off, otherwise it turns out too long. Next time I’ll write about adding floating point numbers and losing precision when rounding.

PS: I understand that I did not touch on the topic of denormalized numbers, etc. I just did not want to load the article very much, and this information can be easily found in the IEEE 754 standard almost at the very beginning.

Also popular now: