On the issue of transformations and other operations

    Blue Caterpillar: Well, you can’t kill us. We sit, we know: they are waiting for transformation. Why? And nothing! We sit, we smoke, we wait ...
    Alice- doll: What?
    Blue Caterpillar: What, what! Transformations. The house is in smoke, the smoke is in the lady, and the lady is in the mother. That's it. Do not interfere, do not jump forward, otherwise you will turn into some kind of butterfly prematurely.

    Looking through the code on one of the forums devoted to Arduino, I discovered a fun way to work with a floating-point number (PT). The second common name for numbers in this format is floating point, but the abbreviation that arises in this case (PZ) personally causes quite different associations to me, so we will use this option. The first impression (from the seen code) - what kind of garbage is written here (I must say that the second is the same, although there are nuances, but more on that later), but the question arises - how is it really necessary - the answer to which is given in further text.

    Part one - posing the question

    We formulate the task - we need to output to the console (turn into a character representation) a floating-point number, not using the print options, specifically intended for this purpose. Why do we want to do this on our own -

    1. The use of the% f format entails connecting a floating-point library and an extended version of the prntf function (or rather, makes it impossible to use its truncated version), which leads to a significant increase in the size of the executable module,
    2. A standard solution requires considerable time (it always works with double precision), which may be unacceptable in this particular situation,
    3. well (last, but not least), it's just interesting.

    To begin, consider the option that was proposed in the above material, something like:

    for (float Power10=10000.0; Power10>0.1; Power10/=10.0; )
    {char c=(int)(Fdata/Power10); Fdata -=Power10*c; };

    and agree that the task he is completely solves. Moreover, this is not the worst option, since its speed can be quite acceptable. Now let's take a closer look at this point - we see the division of numbers of PTs, but if we delve deeper into the essence of the issue, it turns out that it is almost as fast as the division of integers of the corresponding capacity. Actually, before evaluating the performance of the algorithm, you should evaluate the performance of various elementary operations, which we will do.

    Part two - performance evaluation of elementary operations

    The first interesting operation is addition (subtraction, in the sense of time spent, they are equivalent) for integers, and we can assume that it takes a unit of time (tact) with the following proviso — this is true only for “native” data. For example, for the AVR MK series it is an 8-bit word, for MSP430 it is a 16-bit word (and, of course, smaller), for Cortex-M it is a 32-bit word and so on. Then the operation of adding numbers of length N times larger than the native can be estimated as H cycles. There are exceptions, for example, AddW in AVR controllers, but it does not cancel the rule.

    The next operation is the multiplication of integers (but not division, it differs in the sense of speed) and for it is not so simple. First of all, multiplication can be implemented in hardware and, for example, in AVR MEGA it requires 2 ticks, and in the improved 51 integers 6 (to multiply the native numbers).

    But consider the case when there is no hardware implementation and we will have to implement multiplication as a subroutine. Since multiplying H bit numbers yields a 2N bit product, we can find an estimate of the classical version with shifts as follows: we need H shifts of a factor with 1 tick per shift, H shifts of the second factor of 2H length with 2 ticks per shift, then N decisions and , on average, N / 2 addition of numbers of length 2H, at the end of the organization of a cycle of 2 cycles. Total H + 2H + H + 2H / 2 + 2H = 7N cycles, and the actual execution of arithmetic operations of them takes only H cycles (efficiency wow, although we managed to get around the engine).

    That is, to multiply two 8p numbers by 8p MK 56 clocks will be needed, and to multiply 16p numbers there are already 112 (slightly less, but neglect the exact value) clocks, which is slightly more than desired. Fortunately, the direction of the shifts can be modified and there is, moreover, the only way to carry out multiplication, which will require only H shifts of a number of 2H digits and H / 2 additions of native numbers, which improves the operation time of the multiplication algorithm to 0 + 2 + 1 + 1/2 + 2 = 5.5N - of course, it’s impossible to compare with the hardware implementation, but at least some gain without loss of functionality. There are improvements to this algorithm, for example, analysis of 2 bits per cycle, but they do not fundamentally assess the situation - the time of multiplication is orders of magnitude greater than the time of addition.

    But with the division, the situation is worse - even the hardware-implemented division loses the multiplication by almost two times, besides there are MCs with hardware multiplication, but without hardware division. Under certain conditions, it is possible to replace division by multiplication by the reciprocal, but these conditions are specific and give a similar result - two iterations of multiplication with the subsequent sum are required, so the loss is 2 times. If we implement division as a subroutine, then we need H divider shifts of 2H length, H subtractions from the dividend of 2H length, H result shifts, 2H cycle organization, but all this is preceded by alignment, which will take another 5H ticks, so the total figure will be 2 + 2 + 1 + 2 + 5 = 12H, which is about 2 times worse than multiplication.

    Now let's consider the operations of the PT, and here the situation is somewhat paradoxical - the multiplication operation takes almost as much time as for integers (the corresponding bit depth, as a rule, 24 bits), since we have to multiply the mantissas and just add up the orders, the normalization is not required. The division is also good, we divide the mantissas and subtract the orders, normalization is again not needed. Therefore, for these two operations, the loss compared with the integers is not very significant, although it does take place.

    But the operation of addition and subtraction requires, first of all, alignment of orders (and these are shifts and there may be many, although there are nuances), then the actual operation, and (when subtracting) normalization (if added, too, but no more than 1 shift is required ), which is wasteful in time, therefore, operations of this class are significantly slower for PT than for integers, especially in relative terms.

    Let us return to our sheep and agree that, based on earlier estimates, the proposed method may not be too long, especially since it immediately gives a result, but it has a significant limitation - it is applicable to a very limited range of input values ​​of PT. Therefore, it will look for a universal (more or less) solution.

    Immediately make a reservation that our solution should not use floating point operations at all (from the word at all) to emphasize the merits of our option. And the puzzling question is where from when there will be a number of this type, if operations are not available, we answer - it may well appear, for example, when reading information from the light sensor (as it was in the original example), which gives the data in PT format.

    How exactly the number of PTs are arranged, you can easily find on numerous sites, there was an article on Habré recently, this should not cause problems. Nevertheless, a number of questions are of interest to the PT format in the style of “if I were the director” - why this is so and not otherwise. I will give my answers to some of them, if anyone knows the more correct ones, I ask in the comments.

    The first question is why the mantissa is stored in the direct code, and not in the additional code? My answer is because it is easier to work with a normalized mantissa with a hidden (optional) bit.

    The second question is why the order is stored with offset, and not in any other way? My answer is because in this case it is easy to compare the modules of two PTs as integers, with other methods it is more difficult.

    The third question is why a negative sign is encoded by a unit, and not by zero, because then it would be possible to simply compare two PTs as integers? My answer is - I don’t know, just “it’s so accepted here.”

    Part Three - Necessary Explanations

    In the previous paragraph, I could give incomprehensible terms, so a little about the representation of numbers. Of course, they are different, otherwise there would be no need to discuss them. Immediately, we note that the memory of the MC (the same is true about computers, although I’m not so categorical about the most modern architectures - they are so complex that you can expect everything) there are no numbers, there are only elementary storage units - bits, grouped in bytes and further into words. When we talk about the representation of a number, we mean that we somehow interpret a set of bits of a specific length, that is, we define a law by which we can find a certain number corresponding to a given set of bits, and nothing more.

    There are countless such laws, but some of them will have a number of useful properties in terms of various operations, so they will be more often used in practice. One of these properties, which is implicitly implied, for example, is determinism, and the other - independence from the environment - properties, at first glance, are obvious, although there are nuances. Other properties of the type of one-to-one correspondence are already a subject of discussion and do not always have a place in a particular representation. In itself, the topic of the representation of numbers is extraordinarily fascinating, with Knut (in volume two) it is fully revealed, so that there is a depth behind it, and we will run over the surface.

    Assuming that a set of bits has a length n (we number them in a row from 0 to n-1) and weighted evenly in increments of 2 and the least significant bit (number 0) has a weight of 1 (which, generally speaking, is not necessary at all, we just accustomed to such things, and they seem obvious to us) we get a binary representation of a number, in which the reduction formula looks like this: a number displayed by a set of bits (Ч2) = В(0)*2^0 + В(1)*2^1 + ... + В(н-1)*2^(н-1)or in cascade form Ч2(н) = В(0)+2*(В(1)+2*(...+2*(В(н-1))..))), hereinafter B (k) denotes a bit with number k. Note that this view does not impose any restrictions on and the location of the bytes of the number in memory, but it would be more logical to place the low byte in the lower addresses (that's how easy and easy I resolved the “eternal dispute between Slavs among themselves” regarding which end it is more convenient to break an egg).

    With this interpretation of a set of bits of length n (= 8), we get a representation for numbers from 0 to (2 ^ n) -1 (= 255) (hereinafter, in brackets there will be a specific value for a set of 8 bits), which has a number of remarkable and useful properties, why it is widespread. Unfortunately, it also has a number of drawbacks, one of which is that we cannot, in principle, present negative numbers in such a record.

    You can offer a variety of solutions to this problem (representation of negative numbers), among which there are those that have practical significance, they are listed below.

    The representation with offset is described by the formula Ч = 22 (n) - the offset (C), where 22 is the number obtained in binary recording with n bits, and C is some preselected value. Then we represent numbers from 0-С to 2 ^ (n) -1-С and, if we choose C = 2 ^ (n-1) -1 (= 127) (this is not at all necessary, but very convenient), then get the range from 0- (2 ^ (n-1) -1) (= - 127) to 2 ^ (n-1) (= 128). The main advantage of this view is monotony (moreover, increase) throughout the interval, there are also drawbacks, among which we highlight asymmetry (there are others related to the complexity of performing operations on the number in this view), but the developers of the IEEE 457 standard (this is the standard for PT) turned this disadvantage into dignity (using the unnecessary meaning for encoding the situation nan), which once again underlines the correctness of the class statement: “If you are above the enemy, then this is your advantage. If the enemy is higher than you, then this is also your advantage. ”

    Note that since the total number of possible combinations of any number of bits is even (if you do not have combinations forbidden for religious reasons), the symmetry between positive and negative representable numbers is fundamentally unattainable (or rather, achievable, but under certain additional conditions, more on) .

    Representation in the form of a direct code when one of the bits (most significant) represents the coded sign of the number H = (-1) ^ B (n-1) * P2 (n-1) has a range from 0- (2 ^ (n-1) -1) (= -127) to 2 ^ (n-1) -1 (= 127). It is interesting to note that I have just stated that symmetry is impossible in principle, and here it clearly is - the maximum representable positive number is equal to the modulus of the minimum representable negative number. This result is achieved by having two representations for zero (00 ... 00 and 10 ... 00), which is usually considered the main disadvantage of this method. This is really a disadvantage, but not as scary as is commonly believed, since there are more significant ones that have limited its use.

    Representation in the form of a reverse code, when in the direct representation we invert all bits of the value for negative numbers H = (1-B (n-1)) * Ch2 (n-1) + B (n-1) * (2 ^ (n -1) -CH2 (n-1)) - this is from the definition, you can make a much more understandable formula H = P2 (n-1) -B (n-1) * (2 ^ (n-1) -1), which allows to represent numbers from 0-2 ^ (n-1) +1 (= - 127) to 2 ^ (n-1) -1 (= 127). It can be seen that this representation is with displacement, but the displacement changes in steps, which makes this representation not monotonous. Again we have two zeros, which is not very scary, the occurrence of a circular transfer when adding is much worse, which creates certain problems in the implementation of the ALU.

    It is extremely simple to eliminate the last drawback of the previous view, it is enough to change the offset by one, then we get H = P2 (n-1) -B (n-1) * 2 ^ (n-1) and we can represent numbers from 0-2 ^ ( n-1) (= - 128) to 2 ^ (n-1) -1 (= 127). It is not difficult to see that the representation is asymmetric, but zero is unique. The following property is much more interesting, “it is absolutely clear that” ring transfer for an operation of the type of addition does not arise, which was the reason (along with other pleasant features) of the general distribution of this particular method of coding negative numbers.

    Create a table of interesting values ​​for different ways of coding numbers, denoting by H the value 2 ^ (n-1) (128)
    H (n)0H-1 (127)H (128)2 * Н-1 (255)
    H (n-1)0H-1 (127)0H-1 (127)
    Offset H-N + 1 (-127)0oneH (128)
    Straight0H-1 (127)0-N + 1 (-127)
    Back0H-1 (127)-N + 1 (-127)0
    Additional0H-1 (127)-N (-128)-one

    Well, in the end of the topic, we present the graphs for the listed representations, from which their merits and demerits are immediately visible (of course, not all that bring to mind the interesting saying “The advantage of graphical presentation of information is in clarity, and it has no other advantages”).

    Part Four - the actual solution to the original problem (better late than never).

    Small retreat

    For a start, I wanted to print the PT in hexadecimal format (and ultimately I did it), but quite expectedly / completely unexpectedly (I needed to substitute) came across the following result. What, in your opinion, will be printed as a result of the execution of statements:

    printf("%f %x", 1.0,1.0); printf("%f %x",2.0,2.0);
    printf("%x %d",1.0,1.0); printf("%x %d",2.0,2.0);

    , also pay attention to the following construction and its result:

    printf("%x %x %f",1.0,1.0);

    I will not give explanations to this phenomenon, “smart enough”.

    However, how can we correctly print a hexadecimal representation of the PT? The first solution is obviously union, but the second is for fans of one-liners printf ("% x", * ((int *) (& f))); (I apologize if anyone was offended by the extra brackets, but I could never, and was not going to, remember the priorities of operations, especially if one considers that the brackets of the code do not generate, so I will continue this way further). And here it is, the solution of the task - we see a string of characters, 0x45678, which unambiguously determine the required number for us, but in such a form that we (I don’t know how you are, I’m for sure) cannot say anything intelligible about this number. I think that Academician Karnal, who could have indicated an error in the punched tape with the source code, would have coped with this task, but not everyone is so advanced, so we will continue.

    We will try to get information in a more understandable form.

    To do this, we return to the PT format (hereinafter, I consider only float), which is a set of bits from which three sets of bits can be extracted (according to certain rules) to represent three numbers — the sign (h), the mantissa (m) and the order (n), and the desired number encoded by these numbers will be determined by the following formula: CH * CHM * CHP. Here, the characters H denote the numbers represented by the corresponding set of bits, therefore, in order to find the desired number, we need to know the laws by which we extract these three sets from the original set of bits, as well as the type of coding for each of them.

    In solving this problem, we turn to the IEEE standard and find out that the sign is one (most significant) bit of the original set and the formula for encoding is Fz = (- 1) ^ B (0). The order occupies the next most significant 8 bits, written in a code with an offset of 127, and is a power of two, then PE = 2 ^ (22 (8) -127). The mantissa occupies the following order of 23 digits and is a number Fm = 1 + P2 (23) / 2 ^ 23.

    Now we have all the necessary data and we can easily solve the task - to create a string with characters, which, with a certain reading, will represent a number equal to that encoded in the PT. To do this, we should by simple operations extract the above numbers, and then print them out, providing them with the necessary attributes. We assume that we can convert an integer with no more than 32 bits into a character string, then it is not difficult.

    Unfortunately, we are just at the beginning, as very few of the readers of this post in the record "+ 1.625 * 2 ^ 3" find out the unlucky number that is encoded by the more familiar decimal "13", and only guess in the record "1.953125 * 2 ^ 9 "simple" 1E3 "or" 1 * 10 ^ 3 "or just the usual" 1000 ", only a few people are capable of anything, I definitely don’t belong to them. It was strange how it happened, because we completed the original task, which once again shows how to be attentive to the wording. And the point is not that decimal is better or worse than binary (in this case we mean two at the base of the degree), but the fact that we are used to decimal from childhood and it’s much harder to remake people than the program, so we’ll bring our write to more familiar.

    From the point of view of mathematics, we have a simple operation - there is a record PT = (- 1) ^ c * m * 2 ^ n, and we need to convert it to the form PT = (-1) s' * m '* 10 ^ n'. We equate, transform, and obtain (one of the possible variants) the solutions s '= s', m '= m, n' = n * lg (2). If we leave out the need to multiply by an obviously irrational number (this can be done if the number is rationalized, but we will talk about this later), then the problem looks solved until we see the answer, because if the record is "+1.953125 * 2 ^ 9 "seems to us incomprehensible, then the entry" + 1.953125 * 10 ^ 2.70927 "is even less acceptable, although it seemed that there was no place worse.

    We continue to improve the solution and find the following solution - the equations of reduction to the degree on the base 10 m '= m * 10 ^ {n * lg (2)}, n' = [n * lg (2)], where fractional brackets and square brackets and the integer part of a number, respectively. Then for the considered example we get (1.953125 * 10 ^ 0.7 0927) * 10 ^ 2 = "10 * 10 ^ 2", which is much more acceptable, although not ideal, but it can be implemented.

    Things are easy, we need to learn:

    1. multiply the integer (p) by the previously known irrational (lg (2)) (this is not difficult under certain restrictions on the accuracy of the result);
    2. take the integer and fractional part of a fixed-point number (it's easy);
    3. to build the known integer (10) to an irrational degree (mda ...);
    4. multiply the whole by an arbitrary irrational (“we will simplify the calculations, they said ...”).

    Nevertheless, we will try to move in this direction and consider what is easy to do, namely paragraph 1. We note right away that this task is fundamentally unsolvable and we cannot calculate n * lg (2), we cannot get from the word “absolutely”, with the exception of the trivial case, n = 0 (well, and the obvious case, n = k / log (10)). An interesting statement, especially after the statement "it is not difficult," but the apparent contradiction is removed with the phrase "with a certain accuracy." That is, we can still calculate the product of an arbitrary integer by a known irrational, and this is not difficult for the result with a certain accuracy. For example, if we are interested in the result with an accuracy of one percent, then by presenting the desired result n '= n * lg (2) as n * [lg (2) * 256 + 1/2] / 256 we get the value with the accuracy we need , since the possible relative error cannot exceed 1/2/77 = 1/144, which is clearly better than the required 1/100. One important consideration should be taken into account - the small value of the relative deviation says nothing about the behavior of the function when a nonlinear transformation is applied to it, and the operation of taking the whole part is obviously nonlinear. Let's give a simple example [4.501] = 5, and [4.499] = 4 and, despite the fact that the relative deviation in the source data will be 0.002 / 4.5 = 0.04%, the result deviation will be 1/4 = 25%. Unfortunately, in general, the problem is not solved at all, when applying any rounding algorithm. It is possible to solve only the special case where the input data are limited and, moreover, take a fixed set of values, then the selection of the initial displacement and the angle of inclination can be obtained absolutely accurate,

    For our case, such an ideal approximation will be the function n '= n * 77/256

    Before proceeding with the design of the algorithm, we must evaluate the accuracy we need. Since the mantissa is 24-bit, the number represented by it has a relative error of 2 ^ -24 = 2 ^ -4 * 2 ^ -20 = 16 ^ -1 * (2 ^ 10) ^ - 2 ~ (10) ^ - 1 * (10 ^ 3) ^ - 2 = 10 ^ -7, which means 7 exact decimal digits. Multiplying two 24 bit numbers will be enough to keep accuracy in this range (well, almost enough). Immediately, we note that the transition to 32-bit numbers (both factors) reduces the relative error by more than 100 (256) times, this fact will come in handy later.

    The most unpleasant in the sense of accuracy formula calculates a new mantissa and looks like this

    m '= m * 10 ^ {n * lg (2)}

    Why is it the most unpleasant - 1) it contains a chain of calculations regarding n and the error will accumulate, 2) there is an extremely bad operation from the point of view of accuracy, and this is not taking the fractional part, because if you do it simultaneously with taking the whole part, all not so bad, but an exhibitor. If the remaining operations are multiplications and the relative errors are simply added up, are predictable and depend solely on the length of the discharge grid of the operand representation, then for the exponent everything is extremely bad and it is clear that the relative error will be very large with large values ​​of the argument.

    “Well, yes, it is absolutely obvious that”

    q (10 ^ x) = Δ (10 ^ x) / 10 ^ x = (10 ^ (x + Δx) - 10 ^ x) / 10 ^ x = 10 ^ Δx -1 = 10 ^ (x * qx) -1,
    10 ^ (x * qx)> ~ 10 ^ (x * 0) + (10 ^ (x * 0)) '* qx = 1 + x * ln (10) * 10 ^ (0) * qx = 1 + x * ln (10) * qx,

    from here we get

    $ q (10 ^ x) = x * ln (10) * qx. $

    What this expression means is that at the edge of the range of values, with n = 127, the relative error will increase by 292 times and in order to keep the accuracy of the result in the required limit, we need to significantly increase the accuracy of the argument.

    Recall that the transition from 24 to 32 bits gives us the required increase in accuracy (not quite gives, but very close), hence we understand that the first multiplication (n * lg (2)) should be carried out with 32 bit operands, that is, with such precision should be expressed as a logarithm of two, then it will be equal to 1'292'914'005 / 2 ^ 32. Note that in the code the numerator of this constant should be written as (int) ((lg (2) * float (2 ^ 32)) + 0.5), but in no case as mysterious 0x4d104d42, even with a commentary on its computing, since well-written code is self-documented.

    Next, we need the integer part of the result, which is easy, since we know the exact position of the decimal point in both factors and, accordingly, as a result.
    But then we have to calculate 10 to the power from 0 to 1 and here we will use a little trick to get the required accuracy. Since, according to the formula for the error, the accuracy on the right edge of the range falls by more than two, then if we present the value of the argument as the sum of the logarithm of the decimal two and some remainder, n "= lg (2) * i + (n" - lg ( 2) * i), then the first term of the sum will lead to multiplication by 2 to the appropriate extent, which is easy to do with zero error (until it overflows), and the rest will be limited by the value of lg (2) and we will not lose accuracy in calculating 10 ^ n "(assuming the correct subtraction).

    However, the exponential function for the argument limited to the value of lg (2) will still have to be computed and the only way I see is expansion in the Taylor series. Unfortunately, it does not converge too quickly on our range, and, for example, to achieve accuracy in 10E-7, we need 9 sum members, which makes it necessary to carry out 1 + 9 * 2 = 19 multiplications of 32-bit integers, which is somewhat exceeds the desired performance. There are also still vague doubts about our ability to carry out the calculation of n '= n * lg (2) with such an accuracy that it would suffice at the maximum value of n.

    Nevertheless, the agitime turned out to be quite efficient and we only need 32-bit multiplication to get the result in the amount of 1 + 19 + 1 = 21 operations, which determines the computational complexity of the algorithm

    Is it possible to reduce the computational complexity of our transformation - it seems to be no, we all carefully counted - but suddenly it turns out that all the same is possible. A somewhat unexpected statement, the key to understanding this possibility lies in the nature of the order of the PT - it accepts a fixed (and relatively small) set of values, and we did not take this into account when deriving the transformation formulas, but implicitly worked with a continuous value.

    The simplest solution - we change the memory for a while - to calculate in advance for all possible (2 ^ 8 = 256) exponents n and the corresponding values ​​of [n '] (the most appropriate exponent 10) and {n'} (the correction factor for the mantissa), enter them in the table and then just use in the calculation process. The formula turns out quite simple - PT = m * 2 ^ n = m * 10 ^ n '* (2 ^ n / 10 ^ n') = (m * (2 ^ n / 10 ^ n '

    In the simplest case, we will need 256 * 3 (a correction factor of 24 bits, no longer needed) + 256 * 1 (order on base 10 is guaranteed to be less than order on base 2) = 1kbyte of constants. In this case, we need to do only one multiplication of 24 * 24 bits (most likely it will be 32 * 32), which significantly speeds up the work compared to the version of the present calculation.

    Let's see what can be done from the point of view of saving memory (while again we have to pay time, so we are looking for a reasonable compromise). First of all, if we take into account the order sign separately, then we can get by with only half of the required memory (out of 256 bytes for about 10) and invert the result if necessary. Unfortunately, with a correction factor, it’s not so easy, because

    2 ^ -n / 10 ^ -n '= 1 / (2 ^ n / 10 ^ n')! = 2 ^ n / 10 ^ n ',

    but sorry. We must either leave a long table or, for negative indicators, divide by a constant for positive indicators. Of course, division is not 18 multiplications, but all the same in terms of speed, it is exactly equivalent to two multiplications, so time will exactly double in order to save memory twice, up to 512 bytes. Is it worth it - the question is not easy, but, fortunately, we have a much more beautiful way, which allows you to get rid of the suffering of choice.

    This method is generally called piecewise linear approximation and consists in setting the constants not for each point of the original value, but only for some and calculating the missing values ​​(with the required accuracy) using the specified values ​​using a simple formula. With reference to our task, it turns out (disregarding the sign)

    PT = m * 2 ^ n = m * 2 ^ (n0 + n1) = m * 10 ^ n '* (2 ^ (n0 + n1) / 10 ^ n') = m * (2 ^ n0 / 10 ^ n ') * 2 ^ n1 * 10 ^ n',

    where n0 is some reference value, and n1 = n-n0. Then the new mantissa is calculated by multiplying two fixed-point numbers with a subsequent shift of the result, which does not worsen the accuracy.

    Then a legitimate question arises - why do we need a table at all, because you can take the minimum indicator as n0 and get by with just one value of the correction factor? Unfortunately, this approach is counterproductive due to two complementary circumstances - the need to obtain the most appropriate exponent of 10 and the appearance of very long shifts with a similar approach. The latter circumstance suggests the limits of applicability of this method - if we multiply 32 * 32, and the original mantissa has 24 digits, then a shift of 8 digits will not overflow and we will need one control point for 8 binary order values.

    You can still slightly reduce the amount of constants due to the symmetry of the exponent with respect to 0, which I mentioned earlier, but the savings will be 32/2 = 16 bytes, not sure that this will justify the complication (and increase in the code length) of the program itself.

    By the way, I recently looked at the code widely known in narrow circles of the adafruit library and was slightly surprised by the following code snippet

    const UINT8 Bits[] = {0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80};
    ... data = data | Bits[n];

    with the comment that operation 1 << n on the AVR is long. In my post I have already shown what miracles the compiler does with a constant parameter, but this is not the case.

    It seemed to me doubtful that taking a bit mask from an array was faster than directly performing shift operations and subsequent code analysis (using the godbolt website, although it is highly unlikely that its creator read Habr, nevertheless I once again bring him my sincere gratitude) showed that this is true.

    The code generated by the compiler for both options (here is the correct version with shifts, taking into account the specifics of the problem, because we only need 1 bit)

      ldi r18,lo8(4)
      sbrs r25,1
      ldi r18,lo8(1)
      sbrc r25,0
      lsl r18
      sbrs r25,2
      swap r18

    took exactly the same place in memory, and if you do everything neatly in assembler, then the index option breaks forward 8: 7 due to the extra 8 bytes of the program (of course, if we do not take seriously a truly delightful solution with separate storage of an inverted mask that will cost 16 bytes - and IT is used everywhere - "I knew it would be bad, but I did not know that so soon"). Well, the mentioned package is generally a separate song, which could not be better described by the following quotation from one remarkable book: “This castle asked for the cover of the fortification work with the signature“ How not to build locks or find 12 mistakes ”(“ The Last Ring Men ”, if those who have not read, I recommend).

    Let's go back to our sheep with a floating point and build the resulting formula

    PT = m * 2 ^ n = (m * nk [n / 8]) * 2 ^ (n% 8) * 10 ^ nn [n / 8],

    where square brackets mean taking an element of the arrays pc-correction indicator and nn index order. Immediately visible is the computational complexity of the algorithm, which is determined by multiplying 32 * 32 (24 * 24) and subsequent shifts. Then you can take into account the possibility of combining the exponent 10 and the correction factor in one 32-bit word, this will be left to the inquisitive (and patient, because he has read to the end) the reader of this post.

    The only remark finally - when we create a table of constants, in no case can this be done in the following style

    const uint32_t Data [32] PROGMEM = {0xF82345, ...}

    and the point, of course, is not in the attributes of the array description, but in the data itself in the form of magic numbers. As the authors rightly pointed out, it is certainly not more stupid than me, well-written code is self-documented and, if we write the above constant (and the rest) in the form

    #define POWROUD(pow) ((uint8_t)((pow & 0x07)*log(2)+0.5))
    #define MULT(pow) (2^pow / 10^POWROUND(pow))
    #define MULTRAW(pow) (uint32_t((MULT(pow) << 24) +0.5))
    #define BYTEMASK 0xFF
    #define POWDATA(pow) ((POWROUND(pow) & BYTEMASK)|
    (MULTRAW(pow) & (~BYTEMASK)))
    const uint32_t Data[(BYTEMASK/8)+1] = { POWDATA(0x00),POWDATA(0x08), ..POWDATA(0xF8)}

    then no one will send us puzzled questions, and if someone sends, we can not exactly answer them, it is still useless.

    It is possible to suggest a modification of this method, in which the appropriate degree of ten will be calculated not for the right edge of the segment, but for the left and then the result will shift not to the right to take into account the power of two, but to the left. From the point of view of mathematics, the methods are absolutely equivalent. Let's look at the result:

    1.953125 * 2 ^ 9 = 1.953125 * 2 ^ (8 + 1) = 1.953125 * 42949673/256/256/256 (2.56) * 2 * 10 ^ 2 = 10 * 10 ^ 2

    it’s already very easy to know 1000. Of course, we still need to convert the received mantissa and order into lines, accurately round off, adjust the result to the required format, add a sign, take into account special cases and so on, but this is not so interesting, the main part We have completed the transformations.

    Also popular now: