vladec October 25, 2011 at 23:12

Fixed point calculations. Basic principles (part 1)

From the sandbox

Introduction or why this topic

Reading Habrahabr , I came across two topics, "displaying in clear water" floating point calculations.
In one of them , an extract from the IEEE754 standard and the main problems in floating point calculations are given in sufficient detail and quality, the other is a short topic note that not everything is so good when calculating on a PC. At the same time, recommendations are given when the mathematical accuracy of the result is important, use integer calculations, “fix a comma”, or at least check the results produced by the platform (compiler + processor).
Despite the fact that the advice is practical, understanding how to use integer calculations where there was a floating point before was not easy, especially without mathematical preparation. In this sense, an attempt by one of the “Khabrovsk citizens” to deal with a fixed point using the experimental method is quite interesting .
This topic is a brief introduction that should give an idea of fixed-point calculations. The mathematics in this article should not scare anyone - everything is very primitive. Immediately please forgive: among my friends, the established expression is precisely “fixed point” ( from the English, fixed-point ), and not “comma”, so I will adhere to this very term.

Once again about the mantissa and the exponent

In computational mathematics, fractional values are represented as a pair of integers (n, e) : mantissa and exponent (in Russian, the “exponent” is more true, but for brevity and habit, in the future I will use the word “exponent”). The pair represents a fractional number in the form n * 2 ^-e .
The exponent can be considered as the number of digits before the decimal point separating the fractional part of the number.
If the exponent is a variable written to the register and unknown at compilation, (n, e) is called a floating point number. If the exhibitor is known in advance, (n, e)called a fixed point number. Fixed-point numbers can be written to ordinary integer variables (registers) by storing only the mantissa. The exponent is usually indicated with the letter q . So, if you encounter anything in the spirit of “ q15 multiplier ” in the commentary on a variable , you should consider this variable as a fixed-point number and exponent equal to 15. However, I will return to the question of notation, which appears in various source codes and articles.

Calculations

So, we figured out that when working with a fixed point, the exponent is not recorded anywhere and is kept “in mind”.
How to make calculations? Computational arithmetic is a whole science with its own formulas, axioms and theorems. The purpose of this article was not to provide an introduction to this science. The approaches given below are primarily aimed at programmers who solve engineering and applied problems, i.e. such where the ranges of permissible values and the necessary accuracy of the calculations are known and limited.
Another limitation of the article is that algorithms for trigonometric and other complex operations are not given here. To make their full review in one article unrealistic (and hardly necessary). The article gives the basis necessary for understanding such algorithms (and developing your own) - the rules for performing basic operations (addition / subtraction, multiplication, division) and the general method of computing with a fixed point.

Addition and Subtraction

Addition is simple if you imagine in our minds that we have to put two decimal fractions “in a column” on a piece of paper. When performing this operation, the numbers are written in a column so that the commas separating the fractional part are located one below the other. In binary arithmetic, such an operation is called the reduction of exponentials.
If we go from the “paper” to the mathematical notation, we get the following:
Suppose there are two numbers a = n1 * 2 ^-q1 and b = n2 * 2 ^-q2 .
Then:
a + b = n1 * 2 ^-q1 + n2 * 2 ^-q2 = (n1 + n2 * 2 ^{(q1 - q2)} ) * 2 ^-q1 .
Multiplier 2 ^{(q1 - q2)}in the second term, it essentially means an arithmetic shift to bring numbers to one exponent.
It is worth noting that the calculation result can also be shifted to bring to the desired exponent value.
C code snippet:

int32_t a = 0x1000L;    // q15: a = 0.125
int32_t b = 0x20000L;   // q20: b = 0.125
int32_t c = 0;          // q25
c = (a << 5) + b;       // q20: (a * 2 ^ (20 - 15) + b); c = 0x40000L (0.25 в q20)
c <<= 5;                // q25: c = 0x800000L (0.25 в q25)

From the example it may be clear that in real computing, even in such a simple operation as addition, there is room for thought. It is always worth keeping in mind the questions:

sacrifice accuracy or not? After all, one can bring the terms to a smaller exponent by a shift to the right and drop the lower digits.
are variable values limited? A shift to the right in this case, for example, does not lead to a loss of accuracy.
is it possible to expand the capacity?

This is not a complete list, but it already shows that not everything is as simple as it might seem at first glance. In most practical applications, for each subject area, ranges of allowable values are known or can be obtained, so when working with a fixed point, some experience or research is required. Often, pre-written code is written with a floating point, after which ranges of values are examined, and small values are neglected.

Multiplication

Fixed-point multiplication can be performed without tricky alignments and reduction to a single exponent. Nevertheless, multiplication is a rather dangerous operation, which most often results in a loss of accuracy and requires special care in handling.
Let's start with the mathematical description of multiplication:
Let there be two numbers a = n1 * 2 ^-q1 and b = n2 * 2 ^-q2 .
Then:
a * b = n1 * 2 ^-q1 * n2 * 2 ^-q2 = n1 * n2 * 2 ^{- (q2 + q1)} .
It can be seen from the expression that the exponentials of numbers when multiplied are added: 2 ^{- (q2 + q1)}. The capacity of the data in this article is not considered, for now it is enough to just remember that for safe multiplication without overflow and loss of accuracy, the capacity of the result should be no less than the total capacity of the factors.
Due to the addition of the exponentials, the result of the multiplication must be adjusted to perform further calculations. When the exponent decreases, the least significant bits of the result are discarded. That is, there is a loss of accuracy. You can reduce the loss of accuracy (and sometimes it is necessary), but ways to deal with losses are always associated with overhead.
C code snippet:

int32_t a = 0x8000L;    // q16: a = 0.5
int32_t b = 0x100000L;  // q21: b = 0.5
int32_t c = 0xC0000L;   // q20: c = 0.75
int64_t d;              // Временная переменная с увеличенным числом разрядов, чтобы хватило на результат.
d = (int64_t)a * (int64_t)b; // q37 = q16 * q21; d = 0x800000000L (0.25 in q37)
d >>= 17;               // q37 / 2 ^ 17 = q20
c += (int32_t)d;        // q20: c = 0x100000 (1 in q20)

I note that the 15 least significant bits of the multiplication result were discarded to bring the number to the format of the term. It was possible, of course, to increase the bit depth of the variable c , but, as I already said, in practice, the ranges of values are usually limited and the lower digits of the multiplication are often neglected. In addition, the possibility of the presence of nonzero higher order bits in the initial factors is not taken into account.
But this article does not address overflow handling.

Division

Let's start with the mathematical expression for division:
Let there be two numbers a = n1 * 2 ^-q1 and b = n2 * 2 ^-q2 .
Then:
a / b = n1 * 2 ^-q1 / (n2 * 2 ^-q2 ) = n1 / n2 * 2 ^{- (q1 - q2)} .
The factor 2 ^{- (q1 - q2)} means that when performing the division, the exponent is automatically reduced. If no action is taken, part of the significant digits are discarded automatically.
The correction method is obvious - it is necessary to increase the bit depth of the divider in advance so that as a result of the division to obtain the desired number of significant bits:
a / b = n1 * 2 ^-q1* 2 ^q3/ (n2 * 2 ^-q2 ) = n1 / n2 * 2 ^{- (q1 - q2 + q3)} .
Thus, the private exponent is increased by q3 discharge.
C code snippet:

int32_t a = 0x4000L;    // q15: a = 0.5
int32_t b = 0x80000L;  // q20: b = 0.5
int32_t c = 0;          // q25
int64_t d;              // Временная переменная с увеличенным числом разрядов.
d = (int64_t)a  << 30;  // q45: d = 0x200000000000; (0.5 in q45)
c = (int32_t)(d / (int64_t)b);  // q25: c = 0x2000000; (1 in q25)

Obviously, if the number exceeds 32 bits, the problem can no longer be solved so simply. However, for simple engineering calculations, 32-bit numbers are usually more than enough.
There is one simple way to significantly reduce the loss of accuracy in the division - the preliminary normalization of the dividend. Normalization is actually the maximum shift of the mantissa to the left, at which no significant bits are discarded. You can determine how much you can shift the number by counting the leading zeros in the dividend, for which there are special algorithms (or even hardware processor instructions).
After dividing, the quotient should be shifted to the right by the same number of bits to restore the exponent.
The above code fragment may look like this:

int32_t a = 0x4000L;    // q15: a = 0.5
int32_t b = 0x80000L;  // q20: b = 0.5
int32_t c = 0;          // q25
int norm_shift = norm(a); // Вычисление нормирующего сдвига. norm_shift = 16
c = ((a << norm_shift) / b); // q(-5): c = 0x800 (1*2^norm in q(-5))
c <<= (30 - norm);      // q25: c = 0x2000000; (1 in q25)

As we see, the accuracy loss in this case did not occur without an increase in the capacity of the dividend.
However, this does not always happen, and if you want to stay within a certain bit depth (for example, 32 bits), you have to implement the division algorithmically. In a review article, it is hardly worth diving into such a jungle - to understand the division process and the difficulties associated with it, the above description is enough.

Notation accepted in the literature and various sources

In the last section of the article, I would like to return once again to the generally accepted notation used in the description of fixed-point algorithms.
This is a rather important point that you have to pay attention to when reading other people's sources.
The most common are two options for designating a fixed-point number:

Q M - where M is the number of digits after the decimal point. Used in article
Q N.M - where N is the number of digits before the decimal point, excluding the sign bit, and M - after.

The minus of the first notation is obvious: when working with a variable, you have to refer to the variable declaration (remember its bit depth) and make some calculations in your mind to understand how to bring the exponent to the desired one. Moreover, if we recall the rounding (int32_t) d in the example with multiplication, it can be noted that with comments in this notation it is difficult to understand whether shifting or discarding significant bits will lead to an error.
When using comments in the second notation, you can simply record the accuracy of the calculations, which eliminates the need to recall how the variable is declared.
Let me give you an example:

a = 0x1000; // Q15
b = 0x8000; // Q15
int32_t c = a + b; // ??? Без обращения к объявлениям неизвестно, поместится ли результат вычисления в переменную.
a = 0x2000; // Q0.15
b = 0x8000; // Q16.15
c = a + b;  // Q0.15 + Q16.15 = Q16.15: 16 + 15 = 31 бит + 1 знаковый бит

Comments in the second notation are obviously more convenient.
I will not discuss the usefulness of comments here (they are far from everywhere), I will only say that for myself I always paint the types of variables in my calculations, so as not to make mistakes and not get confused with the reduction of exponentials.
If there are no comments in principle, the reading and understanding of the code is of course complicated, but if you get confused, you can always add such decryption in Q-notation to understand where it came from “this shift to the left by 4, and then shift to the right by 10”.
I note that in the recently published Google sources of the VoIP engine GIPS ( webrtc) in the comments most often they simply write Q, meaning that all bits of the number are allocated to the fractional part. This confuses me personally, because I have to rummage through definitions to clarify how the code works.
For myself, I use another notation, which differs from the above and is close to the MATLAB toolbox notation for working with a fixed point. It ties mathematics to the capacity of variables and simplifies life when it is necessary to evaluate the result of an operation (capacity and exponent).
In my comments, I mark fixed-point numbers as QN.M , where N is the bit capacity of the number, M is the number of digits after the decimal point.
Let me explain why I found such a scheme convenient for myself:

Knowing the bit depth of a number, you can always predict the bit depth of the result, i.e. select a variable type sufficient to represent it.
I personally find it inconvenient to read a record of the form Q (-N) .M , which appears in the second notation after performing a shift to the right and lack of bits for the fractional part. For example, a record for a 16-bit number, in which the exponent is 18 ( n * 2 ^-18 ), looks q16.18 for me , and q (-3) .18 for the second notation . The record in the first notation, as already mentioned, in any case forces us to turn to the definition to understand the accuracy of the calculations, but in this case it is also not clear without a definition: the leading significant bits have already been discarded or not.
After making calculations using my own notation, it’s easier for me to see in which variable the bit depth will fit the result, and how to align the exponentials. For example, q32.15 * q16.4 = q48.19 . It is immediately clear that for a complete presentation of the result, 48-bit is needed. In the second notation, the entry looks like q16.15 * q11.4 = q27.19 , and you have to calculate that 27 + 19 = 47 + 1 sign from the first factor + 1 sign from the second = 48 bits . A trifle, but nice. Especially when there are a lot of source codes.

About the pros and cons of using a fixed point

Such a detailed description, even for basic operations, can frighten off engineers and programmers from using a fixed point in calculations, especially if a habit of floating point has already been developed without tracking the result. Nevertheless, the use of a fixed point has its advantages, some of which are not obvious.
In order to finally determine whether you need it, you can use the following summary of fixed-point calculations.
Pros:

The need to think.
Predictability of the result. With the right approach to coding, the calculation result will be the same on any platform (processor + compiler) accurate to the point. For this phenomenon, there is a special term "bitexactivity" (from the English, bit-exactness ). A properly encoded algorithm is always bit-locked and, therefore, can be investigated on a non-target platform. This is especially useful when debugging on the target platform is difficult or impossible and only input can be taken.
Full control over code behavior. A fixed point eliminates the appearance of "surprises" associated with the features of the implementation of floating point on the platform used.
Automatic "filtering" of negligible values. In floating point, calculation errors can accumulate, in a fixed point this does not happen (due to discarding small values) or the process of accumulating errors can be controlled algorithmically.
Algorithmically controlled range of variable values. The floating point gives more freedom in the calculations, but the result may go beyond the permissible limits, which leads to the need to control it separately. At a fixed point, this problem is solved automatically at the stage of development and debugging of the algorithm.
Algorithm Portability This plus is pretty much correlated with the first one, but it is worth noting that integer calculations are much better supported by many non-x86 processors than floating point calculations. So, having developed the algorithm once at a fixed point, it becomes much easier to port it to various "weak" platforms. Sometimes integer computing is generally the only thing available on the target platform.
The ability to control the complexity of calculations by lowering accuracy in the development of the algorithm.
This is sometimes interesting.

Minuses:

The need to think.
Reduced (in the simplest case) range of variable values compared to floating point.
The need to algorithmically control the range of variable values. A significant part of the development time is spent on the correct scaling and range selection.
The need to monitor the capacity at each stage of the calculation.
The need to write your own framework of basic functions (trigonometric, logarithmic, etc.) or modify an existing one.
The need to dive into the application area when developing an algorithm.
The need to improve the culture of writing and maintaining code - you can’t do without using your own experience at a fixed point. In a floating point, you can most often, without hesitation, rewrite math “head-on” using ready-made functions.

In conclusion

I did not touch upon such (basic!) Problems in calculations with a fixed point as overflow of the result and zero drift during rounding and methods of dealing with them. The text has already turned out to be voluminous and, possibly, boring, in order to supplement it with details.
In turn, I paid quite a lot of attention to the generally accepted notation when recording fixed-point operations in order to facilitate reading special literature and writing my own sources (and, possibly, understanding the second part of the article). Yes, comments with calculations in Q-notation have repeatedly saved me from serious debugging and parsing of source codes.
If the topic is in demand, I will supplement the article with the next part, in which I will describe the above points and try to tell how, in the general case, you can transfer the algorithm from a floating point to a fixed one.
NB Mathematicians, please do not worry, I think you are better acquainted with the issue being addressed.

References

UPDATES:

The notation that I use, as it turned out, came to me from MATLAB. I still could not remember where I dug it in my time. Thanks nerudo , recalled. The Fixed-point toolbox of the specified package uses just the “number of bits” + “number of bits per fractional” part to create objects, with an explicit indication: a signed or unsigned number.
Despite the fact that examples abound in 8-, 16-, 32- and 64-bit words, I did not try to attach the description only to x86 and other General Purpose processors. It’s simple, firstly, it’s easier to give examples in C, and secondly, I’m far from being an expert on FPGA, ASIC, etc. (read VERILOG, etc.), which allow you to choose arbitrary bit depths to tell them clearly . Perhaps people who are more knowledgeable in the matter will be able to supplement the article with their topics with examples.
The notation descriptions do not say what to do with unsigned numbers. In general, the original sources in which I saw them also do not say anything about signs. Basically, it was implied that all numbers are significant. I cannot say right away how the authors saw the record of signed numbers. I only note that in my / matlab notation I added the letter 'u' after Q to mark the unsigned number.

Tags: