Fixed-point arithmetic in C ++
Today I will tell you what fixed-point is, why it is needed and how it can be used.
There is such a problem when the performance of the application can noticeably deteriorate due to the peculiarities of floating point calculations. As a rule, the CPU is sharpened for integer operations, and the FPU (floating point unit) coprocessor in it works at an order slower. There are such platforms where there is no FPU at all and emulating operations with numbers would take a lot of time. For example, in the presence of FPU, multiplication of floating-point numbers is performed by only one fmul command, and in the absence of FPU, multiplication is performed by the __mulsf3 emulating function. Compared to the fmul command, the __mulsf3 function emulates operations on floating-point numbers, while the calculations are performed in integer form, which leads to an increase in machine code and the time it takes to execute, while the fmul command performs this operation quickly.
There is a solution to this problem, which allows performing calculations with a fixed point on an integer type.
The principle of this type is a fixed shift of the number by N bits, as a result of which the fractional number can be represented as an integer and it will have an accuracy of 2 ^ N after the point. An example of converting a floating-point number to a fixed-point number of the order of 8 bits (2 ^ 8 = 1024).
Here is an example of converting a floating-point number to a fixed-point number:
This number, after the point, has an accuracy of 2 ^ 8 after the decimal point.
An example of the reverse translation of a fixed-point number to a floating-point number.
In this case, the number after the reverse translation has the form 12345.6787109375 and is accurate 3 digits after the period, the maximum accuracy is actually 2 ^ 8 = 1024.
Sum and difference operations are equivalent to ordinary integer operations.
Multiplication of such numbers is made in this form.
Division.
When performing operations of multiplication and division, a case of overflow is possible, which will lead to an incorrect result. This will happen if, for example, a 32-bit integer type is used and during the calculations an overflow of this type will occur and as a result of this overflow the number will lose the most significant bits. There are two ways to eliminate overflow:
There is such a problem when the performance of the application can noticeably deteriorate due to the peculiarities of floating point calculations. As a rule, the CPU is sharpened for integer operations, and the FPU (floating point unit) coprocessor in it works at an order slower. There are such platforms where there is no FPU at all and emulating operations with numbers would take a lot of time. For example, in the presence of FPU, multiplication of floating-point numbers is performed by only one fmul command, and in the absence of FPU, multiplication is performed by the __mulsf3 emulating function. Compared to the fmul command, the __mulsf3 function emulates operations on floating-point numbers, while the calculations are performed in integer form, which leads to an increase in machine code and the time it takes to execute, while the fmul command performs this operation quickly.
There is a solution to this problem, which allows performing calculations with a fixed point on an integer type.
The principle of this type is a fixed shift of the number by N bits, as a result of which the fractional number can be represented as an integer and it will have an accuracy of 2 ^ N after the point. An example of converting a floating-point number to a fixed-point number of the order of 8 bits (2 ^ 8 = 1024).
Here is an example of converting a floating-point number to a fixed-point number:
Fixed(12345,6789) = 1024 * 12345,6789 = 12641975,1936
This number, after the point, has an accuracy of 2 ^ 8 after the decimal point.
An example of the reverse translation of a fixed-point number to a floating-point number.
Float(12641975) = 12641975 / 1024 = 12345,6787109375
In this case, the number after the reverse translation has the form 12345.6787109375 and is accurate 3 digits after the period, the maximum accuracy is actually 2 ^ 8 = 1024.
How do calculations on a fixed-point type occur?
Sum and difference operations are equivalent to ordinary integer operations.
Fixed(x) + Fixed(y) и Fixed(x) - Fixed(y)
, with any order. (1024 * x) + (1024 * y) и (1024 * x) - (1024 * y)
Multiplication of such numbers is made in this form.
(Fixed(x) * Fixed(y)) / p
, this is equivalent, with an order of 8 bits ((1024 * x) * (1024 * y)) / 1024
Division.
(Fixed(x) * p) / Fixed(y)
, also with an order of 8 bits, this(1024 * 1024 * x)*(1024 * y)
Overflow
When performing operations of multiplication and division, a case of overflow is possible, which will lead to an incorrect result. This will happen if, for example, a 32-bit integer type is used and during the calculations an overflow of this type will occur and as a result of this overflow the number will lose the most significant bits. There are two ways to eliminate overflow:
- Perform calculations in a 64-bit integer type.
- Perform calculations in a “disassembled” form, for example, when multiplying, (xi + xf) * (yi + yf) = xi * yi + xf * yf + xi * yf + yi * xf, the prefixes i and f mean the whole part and the part after points.
Class for working with fixed-point in C ++
#define DIGITS 1024 // экспонента
#define EPS 20 // константа устанавливающая границы приближенности вычисления корня
using namespace std;
typedef signed int __int32_t;
class Fixed {
signed int x;
Fixed(signed int a){
x = a;
}
public:
Fixed(){
x = 0;
}
static Fixed fromInt(signed int val){
return Fixed(val*DIGITS);
}
static Fixed fromFloat(float val){
return Fixed((signed int)(val*DIGITS));
}
float fixed2float(){
return ((float)x)/DIGITS;
}
Fixed sum(Fixed a,Fixed b){
return Fixed(a.x+b.x);
}
Fixed diff(Fixed a,Fixed b){
return Fixed(a.x-b.x);
}
Fixed mul(Fixed a,Fixed b){
signed int c=a.x*b.x;
if(c/b.x != a.x){
// Overflow!
signed int i1 = a.x/DIGITS;
signed int i2 = b.x/DIGITS;
signed int f1 = (a.x&(DIGITS-1));
signed int f2 = (b.x&(DIGITS-1));
return Fixed((i1*i2)*DIGITS+(f1*f2)/DIGITS+i1*f2+i2*f1);
}else{
return Fixed(c/DIGITS);
}
}
Fixed div(Fixed a,Fixed b){
if(a.x>(1<<21)){
// Overflow!
signed int i = a.x/DIGITS;
signed int f = (a.x&(DIGITS-1));
return Fixed(((i*DIGITS)/b.x)*DIGITS+(f*DIGITS)/b.x);
}else{
return Fixed((a.x*DIGITS)/b.x);
}
}
Fixed sqrt(Fixed k){
Fixed tmp(0);
tmp.x = k.x/2;
signed int min = 0;
signed int max = k.x;
Fixed quick(0);
do{
tmp.x = (min+max)/2;
quick = Fixed::mul(tmp,tmp);
if(abs(quick.x-k.x)k.x){
max = tmp.x;
}else{
min = tmp.x;
}
}while(true);
}
};