XakepRU May 22, 2015 at 10:02

That's it, point, have sailed! Learn to work with floating point numbers and develop an alternative with a fixed precision of decimal fractions.

Tutorial

Today we’ll talk about real numbers. More precisely, about their presentation by the processor in calculating fractional quantities. Each of us was faced with outputting a string of numbers of the form 3,4999990123 instead of 3.5, or worse, a huge difference after calculations between the theoretical result and what happened as a result of the program code execution. There is no terrible secret in this, and we will discuss the pros and cons of the floating-point representation approach, consider an alternative fixed-point path, and write a class of decimal fractions with fixed precision.

Where does the point go

It's no secret that the processor did not always understand real numbers. At the dawn of the programming era, before the first coprocessors appeared, real numbers were not supported at the hardware level and were emulated algorithmically using integers with which the processor got along well. So, the real type in the good old Pascal was the progenitor of current real numbers, but it was a superstructure on an integer in which the bits were logically interpreted as the mantissa and exponent of a real number.

Mantissa is essentially a number written without a dot. The exponent is the degree to which a certain number N (usually, N = 2) needs to be raised so that when multiplied by the mantissa, the desired number is obtained (up to the mantissa bit depth). It looks something like this:

x = m * N ^ e, где m и e — целые числа, записанные в бинарном виде в выделенных под них битах.

To avoid ambiguity, it is believed that 1 <= | m | <N, that is, the number is written as if it were with one sign before the comma, but the comma was maliciously deleted, and the number turned into an integer.

Mantissa is essentially a number written without a dot. The exponent is the degree to which you need to raise a certain number N (usually N = 2), so that when multiplied by the mantissa you get the desired number

More recently, floating-point operations, like all algorithms with real numbers, developers tried to avoid. The coprocessor processing operations with real numbers was not on all processors, but where it was, it did not always work efficiently. But time passed, now floating-point operations are built into the processor core, moreover, video chips also actively process real numbers, parallelizing operations of the same type.

Modern real numbers, supported by hardware at the processor level, are also divided into mantissa and exponent. Of course, all operations familiar to integer arithmetic are also supported by processor instructions for real numbers and are performed as quickly as possible.

Everything is so complicated because such a recording format, firstly, allows multiplication and division operations with such numbers to be quite effective, in addition, getting the original real number represented by such a format is also not difficult. This representation of numbers is called a floating point number .

Diving Standard

Floating-point real numbers supported at the processor level are described by the special international standard IEEE 754 . The main two types for any calculation are single-precision (single precision) and double-precision (double precision) floating-point ( floating-point numbers). These names directly reflect the bit depth of the binary representation of single and double precision numbers: 32 bits are allocated for a representation with a single precision, and, strangely enough, 64 bits are allocated twice for a double representation.

In addition to single and double precision, the new edition of the IEEE 754-2008 standard also provides types of extended accuracy, quadruple and even half precision. However, in C / C ++, besides float and double, there is perhaps the long double type, which is stubbornly not supported by Microsoft, which in Visual C ++ substitutes a regular double instead. The processor core at the moment also, as a rule, still does not support the types of half and quadruple precision. Therefore, choosing a floating point view, you have to choose only from float and double.

As the basis for the exponent of the number according to the standard, 2 is taken, respectively, the above formula reduces to the following:

x = m * (2 ^ e), где 1 <= |m| < 2; m, e — целые

The layout in bits in single precision numbers looks like this:

| 1 bit under the sign | 8 bit exponent | 23 bits of mantissa |

For double precision, we can use more bits:

| 1 bit under the sign | 11 bit exhibitor | 52 bits of mantissa |

In both cases, if the sign bit is 0, then the number is positive and 1 is set for negative numbers. This rule is similar to integers with the only difference being that, in contrast to integers, in order to obtain a number that is the opposite of addition, it is enough to invert one bit of the sign.

Since the mantissa is written in binary, the integer part is already equal to 1, therefore, in the mantissa record always means one bit that is not stored in the binary record. The bits of the mantissa store exactly the fractional part of the normalized number in binary notation.

The mantissa is written in binary form, and the integer part, obviously equal to 1, is discarded, so we never forget that the mantissa is one bit longer than it is stored in binary form

It is not necessary to have a doctorate in order to calculate the accuracy in decimal places of numbers that can be represented by this standard: 2 ^ (23 + 1) = 16 777 216; this clearly indicates to us the fact that the accuracy of representing real numbers with single precision reaches just over seven decimal places. This means that we can’t save in this format, for example, the number 123,456.78 is a small, in general, number, but starting with the hundredth part, we will get the wrong number. The situation is complicated by the fact that for large numbers of the form 1 234 567 890, which fits perfectly even in a 32-bit integer, we will get an error in hundreds of units! Therefore, for example, in C ++, for real numbers, the double type is used by default. The mantissa of a number with double precision already exceeds 15 digits: 2 ^ 52> = 4 503 599 627 370 496 and quietly accommodates all 32-bit integers, giving a failure only on really large 64-bit integers (19 decimal digits), where the error in hundreds of units, as a rule, is already insignificant. If you need more accuracy, then we will certainly help in this article.

Now for the exhibitor. This is the usual binary representation of an integer, in which you need to raise 10 to multiply by the mantissa in normalized form to get the original number. It’s just that in the standard, in addition, they introduced an offset that must be subtracted from the binary representation to get the desired power of the tens (the so-called biased exponent- offset exponent). The exponent is shifted to simplify the comparison operation, that is, for single precision, the value 127 is taken, and for double precision 1023. All this sounds extremely complicated, so many skip the chapter on floating point type. But in vain!

Approximate swimming

To make it a little clearer, consider an example. We encode the number 640 (= 512 + 128) in binary form as a real number of single precision:

positive number - the sign bit will be 0;
to get the normalized mantissa, we need to divide the number by 512 - the maximum degree of two, less than the number, we get 640/512 = 512/512 + 128/512 or 1 + 1/4, which gives 1.01 in binary, respectively, in the mantis bits will be 01000000000000000000000;
in order to get 640 from 1 + 1/4 again, we need to specify the exponent equal to 9, just 2 ^ 9 = 512, the number by which we divided the number when normalizing the mantissa, but in binary form there should be an offset representation, and for real numbers of single precision, you need to add 127, we get 127 + 9 = 128 + 8, which in binary form will be written like this: 10001000.

For double precision, almost everything will be the same, but the mantissa will contain even more zeros on the right in the fractional part, and the exponent will be 1023 + 9 = 1024 + 8, that is, a little more than zeros between the most significant bit and the number of the exponent: 100 00001000. In general , everything is not so scary if you carefully figure it out.

Homework: to figure out the binary notation of the following constants: plus and minus infinity ( INF - infinity), zero, minus zero and number-not-number ( NaN - not-a-number).

Don’t swim for buoys!

There is one important rule: each format for representing a number has its own limits for which you can’t swim. Moreover, it is up to the programmer to ensure that they do not go beyond these limits, because the behavior of a C / C ++ program is to make a calm face when a small negative is given as the addition of two large positive integers. But if for integers only the maximum and minimum values need to be taken into account, then for real numbers in the floating-point representation, more attention should be paid not so much to the maximum values as to the bit depth of the number. Due to the exponent, the maximum number for a floating-point representation with double precision exceeds 10 ³⁰⁸ , even a single-precision exponent makes it possible to encode numbers over 10 ³⁸. If you compare with the "miserable" 10 ¹⁹ , the maximum for 64-bit integers, we can conclude that the maximum and minimum values are unlikely to ever have to be taken into account, although you should not forget about them.

If for integers you need to take into account only the maximum and minimum values, then for real numbers in the floating-point representation, you should pay more attention not so much to the maximum values as to the bit depth of the number.

Another thing is the problem of accuracy. Miserable 23 bits under the mantissa give an error already at the 8th decimal place. For double-precision numbers, the situation is not so deplorable, but 15 decimal places very quickly turn into a problem if, for example, data processing requires 6 fixed characters after a point, and numbers to a point are quite large, only 9 characters remain under them. Accordingly, any multi-billion dollar amounts will give a significant error in the fractional part. With a high intensity of processing such numbers, billions of euros can disappear, simply because they "did not fit", and the error of the fractional part was summed up and accumulated a huge balance of unaccounted data.

If only it were a theory! In practice, even a thousandth of a cent should not disappear, the error of all operations should be strictly equal to zero. Therefore, for business logic, as a rule, they do not use C / C ++, but take C # or Python, where the Decimal type is already built into the standard library, which processes decimal fractions with zero error at the specified precision in decimal places. What should we C ++ programmers do if we are faced with the task of processing very large-sized numbers without using high-level programming languages? Yes, the same as usual: fill the gap by creating one small data type for working with decimal fractions of high precision, similar to the Decimal types of high-level libraries.

Add floating point cement

It's time to fix the floating point. Since we decided to get rid of the floating-point type due to problems with the accuracy of calculations, we are left with integer types, and since we need maximum bit depth, we need integer types with a maximum bit width of 64 bits.

Today, for educational purposes, we will consider how to create a representation of real numbers with guaranteed accuracy up to 18 characters after the period. This is achieved by simply combining two 64-bit integers for the integer and fractional parts, respectively. In principle, no one bothers, instead of a single number for each component, to take an array of values and get full “long” arithmetic. But it will be more than enough now to solve the problem of accuracy, making it possible to work with an accuracy of 18 digits before and after the decimal point, fixing the point between these two values and filling it with cement.

Pour decimal to me too!

First, a little theory. We denote our two components, the integer and fractional parts of the number, as n and f, and the number itself will be representable in the form

x = n + f * 10 ^-18 , where n, f are integers, 0 <= f <10 ¹⁸ .

For the integer part, the signed type of the 64-bit integer is best suited, and for the fractional one - unsigned, this will simplify many operations in the future.

class decimal
{
	…
private:
	int64_t  m_integral;
	uint64_t m_fractional;
};

The integer part in this case is the maximum integer less than the represented number, the fractional part is the result of subtracting from this number its integer part times 10 ¹⁸ and reduced to the integer: f = (x - n) * 10 ¹⁸ .

The integer part for negative numbers will turn out to be larger modulo the number itself, and the fractional part will not coincide with the decimal notation of the number itself, for example, for the number –1.67 the components will be: n = –2 and f = 0.33 * 10 ¹⁸ . But such a record allows us to simplify and speed up the addition and multiplication algorithms, since no branching is needed for negative numbers.

Decimal Type Operations

Of course, the type of a number with increased accuracy will be useless without arithmetic operations. Addition is relatively simple:

x = a + b * 1e-18,
y = c + d * 1e-18,
z = x + y = e + f * 1e-18,
a, c, e: int64_t;
b, d ,f: uint64_t;
0 <= b, d, f < 1e+18,
z = (a + b * 1e-18) + (c + d * 1e-18)
e = a + c + [b * 1e-18 + d * 1e-18]
f = {b * 1e-18 + d * 1e-18} * 1e+18

NB: hereinafter all entries in the form 1e are integers.

Here [n] is getting the integer part of the number, and {n} is getting the fractional part. Everything would be fine, but remember the restriction of integers. The value 1e + 18 is already close to the verge of the values of the unsigned 64-bit integer type uint64_t (that's why we chose it), but no one bothers us to simplify the expression a bit to be guaranteed to remain within the type boundaries, based on the initial conditions:

e =  a + c + (b + d) div 1e+18,
f = (b + d) mod 1e+18.

You should always consider two things when implementing operations with numbers, since they involve intensive use: first, you must always optimize the algorithm to minimize the operations of multiplication and division, so it’s worth simplifying the expression in advance mathematically, so that the first point is easily performed. In our case, everything must be minimized to integer divisions with the remainder. Secondly, it is necessary to check all possible situations of overflowing a number with going beyond the boundaries of the calculated type, otherwise you will get very unobvious errors when using your type.

Of course, it is worth checking the boundary values when adding a and c. Also, based on the fact that b and d are less than 1e + 18, we know that (b + d) <2 * 1e + 18, which means that the maximum is added as the last addition, therefore, algorithmically, division can not be considered at all to optimize execution speed operations:

e = a + c;
f = b + d;
if (f >= 1e+18) f -= 1e+18, ++e;

Here, the checks for the maximum integer for the value of e are omitted for ease of presentation.

For subtraction, everything is a little more complicated, here you already need to consider the transition below zero for an unsigned integer. That is, you need to compare two numbers before subtraction.

e = a - c;
if (b >= d) f = b - d;
else f = (1e+18 - d) + b, --e;

In general, everything is simple so far. Before multiplication and division, everything is always simple.

Fixed Precision Multiplication

Consider first the multiplication. Since the numbers in the fractional part are quite large and, as a rule, are located in the immediate vicinity of 1e + 18, we will either have to use the assembler language and operations with Q-registers, or bypass integer numbers for now, breaking each component into two parts of 1e + 9. In this case, the multiplication will give no more than 1e + 18 and assembler will not be needed for us yet, it will not be so fast, but we will have enough 64-bit integer, and we will remain inside C ++. Do you remember school and multiplication by a column? If not, it's time to remember:

a = s _a * a ₁ - a ₂ * 10 ^-9 ; b = b ₁ - b ₂ * 10 ^-9 ;
c = s _c * c ₁- c ₂ * 10 ^-9 ; d = d ₁ - d ₂ * 10 ^-9 ;
0 <= a ₂ , b ₂ , c _1,2 , c _1,2 <10 ⁹ ;
s _{a, c} = sign (a), sign (c)
0 <= a ₁ , s ₁ <MAX_INT64 / 10 ⁹

We introduce a matrix to simplify the calculation of multiplication:

U = (a ₁ , a ₂ , b ₁ , b ₂ ),
V = (c ₁ , c ₂ , d ₁ , d ₂) ^T ,
A = V * U,
A =
| a ₁ * c ₁ a ₁ * c ₁ b ₁ * c ₁ b ₂ * c ₁ |
| a ₁ * c ₂ a ₁ * c ₂ b ₁ * c ₂ b ₂ * c ₂ |
| a ₁ * d ₁ a ₁ * d ₁ b ₁ * d ₁ b ₂ * d ₁ |
| a₁ * d ₂ a ₁ * d ₂ b ₁ * d ₂ b ₂ * d ₂ |

The matrix is introduced not so much for ease of calculation as for ease of verification. Indeed, A ₁₁ = a ₁ * c ₁ must be strictly less than MAX_INT64 / 10 ¹⁸ , and the values with a diagonal below: A ₁₂ = a ₁ * c ₂ and A ₂₁ = a ₂ * c ₁ must be strictly less than MAX_INT64 / 109. Just because that we will multiply by these coefficients when adding the components:

e = A₁₁ * 10 ¹⁸ + (A ₁₂ + A ₂₁ ) * 10 ⁹ + (A ₁₃ + A ₂₂ + A ₃₁ ) + (A ₁₄ + A ₂₃ + A ₃₂ + A ₄₁ ) div 10 ¹⁸ ,
f = (A ₁₄ + A ₂₃ + A ₃₂ + A ₄₁ ) mod 10 ¹⁸ + (A ₂₄ + A ₃₃ + A ₄₂ ) + (A ₃₄ + A ₄₃ ) div 10 ⁹

Here we omit the term A ₄₄ div 10 ¹⁸simply because it is zero. Of course, before each addition it is worth checking to see if we go beyond MAX_INT64. Fortunately, we can operate with the unsigned type uint64_t for all components of the matrix and for the intermediate result. All that needs to be done at the end is to determine the sign of the result s _e = s _a xor s _c and correct the integer and the fractional part for a negative number: reduce the integer by one, subtract the fractional from unity. Here, in general, and all the multiplication, the main thing is to be very careful. With assembler, everything is much easier, but this material is beyond the scope of the C ++ Academy.

Division algorithm without registration and SMS

If you remember the column division algorithm, well done, but here it will not be needed. Thanks to mathematics and a little witchcraft with inequalities, it will be easier for us to calculate the inverse number x ^–1 and perform multiplication by x ^–1 . So, we solve the problem

y = x ^-1 = 1 / (a + b * 10 ^-18 ) = c + d * 10 ^-18 .

To simplify, consider finding the reciprocal of a positive x. If at least one of the components of x is equal to zero (but not both at once), the calculations are greatly simplified. If a = 0, then:

y = 1 / (b * 1e-18) = 1e+18 / b,
e = 1e+18 div b,
f = 1e+18 mod b;
если b = 0, a = 1, то y = e = 1, f = 0;
ecли b = 0, a > 1, то:
y = 1 / a,
e = 0, f = 1e+18 div a.

For the more general case, when x contains nonzero fractional and integer parts, in this case the equation reduces to the following:

если a > 1, b != 0 то:
y = 1 / (a + b * 1e-18) < 1, 
отсюда e = 0,
f = 1e+18 / (a + b * 1e-18).

Now you need to find the maximum degree of 10, which will be no more than a, and iteratively perform the following action:

k = max(k): 1e+k <= a,
u = 1e+18, v = (a * 1e+18-k + b div 1e+k);
f = (u / v) * 1e+18-k,
for (++k; k <=18; ++k)
{
	u = (u % v) * 10;
	if (!u) break; // дальше будут нули
	f += u / v * 1e+(18-k);
}

Here we just use the multiplication and division of the fraction by the same factor - the power of tens, and then step by step calculate the division and remainder of the division for the next power of tens.

It will be very useful to have an array of tens of degrees from 0 to 18 inclusive, since it is completely unnecessary to calculate them, we know them in advance and we will often need them.

Type conversions

We know and can do enough to turn vague float and double into our brand new decimal now.

decimal::decimal(double value)
   :    m_integral(static_cast(std::floor(value))
        m_fractional(static_cast(std::floor(
            (value - m_integral) * 1e+18 + 0.5))
{
    normalize();
}
void decimal::normalize()
{
    uint64_t tail = m_fractional % 1e+3;
    if (tail)
    {
        if (tail > 103/2)
            m_fractional += 1e+3 - tail;
        else
            m_fractional -= tail;
    }
}

Here 103 is, in fact, the error beyond which double ceases to be accurate. If desired, the error can be further reduced, here 10 ^{18-15 is} needed for clarity of presentation. Normalization after conversion will be needed anyway, since exactly double is obviously lower than even the fractional part of decimal. In addition, you need to consider the case when double goes beyond int64_t, under such conditions, our decimal will not be able to correctly convert the integer part of the number.

For float, everything looks similar, but the error is an order of magnitude higher: 10 ^18-7 = 10 ¹¹ .

All integers are converted to decimal without any problems, just by initializing the m_integral field. Converting backwards for integers will also just return m_integral, you can add m_fractional rounding.

The conversion from decimal to double and float comes down to the above formula:

return m_integral + m_fractional * 1e-18;

Separately, consider converting to and from a string. The integer part, in fact, is converted to a string as is, after that it remains only to insert the decimal separator and display the fractional part as a whole, discarding the trailing zeros. You can also enter the m_precision “precision” field and write only the number of decimal places indicated in it.

Reading from the line is the same, but in the opposite direction. Here the only difficulty is that the sign, and the integer part, and the separator of the fractional and integer parts, and the fractional part itself - all of them are optional, and this must be taken into account.

In general, I give complete freedom in the implementation of this class, but just in case with the article there are several files with the sources of one of the possible implementations of decimal, as well as with a small test of real numbers for better assimilation of the material.

GITHUB

Со статьей идет несколько файлов с исходниками одной из возможных реализаций decimal, а также с небольшим тестом вещественных чисел для лучшего усвоения материала.

Не уплывай, и точка!

In conclusion, I will only say that a similar type in C / C ++ can appear in a very specific task. As a rule, problems of numbers are solved with great accuracy by languages such as Python or C #, but if you already needed 15-18 digits before the decimal point and after, then safely use this type.

The resulting decimal type solves problems with the precision of real numbers and has a large margin of possible values covering int64_t. On the other hand, the double and float types can take a wider range of values and perform arithmetic operations at the processor command level, that is, as quickly as possible. Try to get by with hardware-supported types without getting into decimal once again. But do not be afraid to use this type if there is a need for accurate calculation without losses.

To help also the knowledge about the binary representation of floating point numbers obtained in this article. Knowing the pros and cons of the format of the double and float types, you will always make the right decision which type to use. After all, perhaps you really need an integer to store the mass, not in kilograms, but in grams. Be attentive to accuracy, because accuracy is certainly attentive to you!

First published in Hacker Magazine # 192.
Posted by Vladimir Qualab Kerimov, C ++ Lead Parallels Developer

Subscribe to Hacker

Tags: