To the question of Bezier curves, speed Arduino and one interesting site, or how I spent the weekend
“Anyone can solve the Gray paradox with dolphins, and you try to do it without dolphins. "
Actually, I planned to spend the weekend a little differently, to go to the Copter Huck (not that I was a fan of the copters, just to see what young people think of, hang out like), but the older sister was totally against it. Of course, I insisted (that is, he chuckled a couple of times and said, “Well, maybe, all the same ... it will be cool”), but she was inexorable, and when the spouse took her side, there were no chances for a trip. Well, okay, “I didn’t really want to,” but then I sat down for a little while on the amusing programming problem that I thought of myself, which I report on.
(A necessary note - it meant the previous weekend, like this always - writing a program requires a couple of hours, writing a report on it and five days in public transport is not completed.)
In one recent post, the author considered the problem of acceleration (among other things) of the calculation of Bezier curves (CBs) on MCs with relatively weak parameters. Well, in fact, these parameters are at the level of the average mainframe of the 70s, but today they are considered to be clearly insufficient. As a result of certain actions, the author managed to speed up the calculations somewhat, in my opinion, is clearly not enough, so I decided to write how to do this in the first approximation. I know very well the universal recipe for solving problems with speed - to take the MK with a higher frequency or switch to another family, but I come from those times when we studied it costs what it is, simply because there was nothing else, from the word at all. For the time being, the approach is outdated, but it seemed to me that modern Habr's readers would not be without interest.
We formulate the task - we want as quickly as possible to calculate the coordinates of points on the Bezier curve given by the extreme points A and B and the imaginary focus C. The formula for calculating the point P on the curve is given by the expression
where T varies from 0 to 1 inclusive. (In Wiki they write that this formula was secret at one time, strange as that, but everything is possible). It is clear that we will not count in a complex form, instead we will look for the X and Y coordinates separately. We estimate the complexity of the calculation using this formula, simply by counting the number of characters of arithmetic operations in this expression - 7 multiplications and 5 additions (=> 7 * 5 + ). It is possible that a good compiler (and now all compilers are good and perfectly optimized, if you don’t explicitly prohibit it) will reduce costs to 7 * 3 +, although it would be better to help him by calculating (1-T) in advance. Generally speaking, a good compiler can do wonders if all values in a formula are represented by constants, but we assume all values are statically undefined.
Part One, Mathematical
We start the optimization process, for which we open the brackets and group the terms at T (maybe someday the compiler will be able to do it for us, but so far this part of the work rests on natural intelligence), getting
=> 5 * 5 +, which is clearly better than the original value of 7 * 5 +, but with respect to the improved 7 * 3 +, you should still think.
If we take the execution time of the addition operation as a unit, then the execution time of the multiplication will be exactly not less than one, as a rule, longer, but how much it depends on the implementation of the MC (first wrote from the architecture, but this is not quite true). When there is no hardware multiplier on the crystal, the multiplication time will be tens (30+) times greater than one, and when it is present, it will be several times (1-6). Therefore, it is safe to believe that the replacement of multiplication by addition gives a gain (and often significant) in execution time almost always. Well, we immediately note that the transition from fixed-point numbers to a floating point (let us leave aside the proof of this fact) leads to an increase in the execution time by 20+ times for addition (alignment is very strongly affected here) but only to a slight increase for multiplication. Therefore, for floating-point numbers, the times of addition and multiplication differ little, especially in relative terms (one can expect 2 times maximum), but they still differ and are not in favor of multiplication, so the gain is also here.
Returning to the previous paragraph, we find that for PT 5 * 5 + score should not have a significant advantage over 7 * 3 +, but we still have reserves. Let's pay attention to the fact that we have to calculate the set of values of points on the Bezier curve, when the parameter T changes, and all other parameters of the curve are fixed (but not constant, but sorry), then the rest of the formula can be calculated in advance and we get
=> 3 * 2 +, where and already not bad, but if you remember Horner’s scheme and write
=> 2 * 2 +, then compared with the decision "in the forehead" we have to win more than 2 times, almost 3, and these optimizations are completely obvious.
Let's test theory with practice (although this is completely unnecessary, we are confident in our estimates, but suddenly I underestimated the compiler), for which we need to measure the real time of execution of different variants on a real hardware. Well, it just so happens that I have many different debug boards for MK from various companies at home (including rare books like Luminary Micro or Intel Edisson debugs, try to buy this now), but there is not a single Arduino board (“Well we do not have pineapples "). It would seem a dead end, but there are options - a very interesting site tinkercad.com comes to our rescue, on which you can build your scheme on a breadboard using the Arduino module, write a sketch and execute it immediately. You can set breakpoints
We turn to this site and begin to measure. To begin with, we check our assumptions about the execution time of operations and, after clearing away the attendant circumstances, we obtain the following data for integers:
8 + 8 => 8 - 1 clock cycle, 16 + 16 => 16 - 2,
8 * 8 => 16 - 2, 16 * 16 => 16 - 14 (the only thing that was unexpected, I thought to get 4 * 2 + 4 * 2 = 16, there are interesting optimizations there),
8/8 => 8 - 230, 16/16 => 16 - 230.
Pay attention to the last two digits, of which it is clear that the division operation refers to the forbidden, if we really want to count quickly. Now (finally) we measure the time for performing operations on numbers of PTs with a mantissa of 24 bits
a + b - 126 (and strongly depends on operands), a * b - 140, a / b - 482.
The obtained data correlate well with our theoretical assumptions, it is clear that there is hardware implementation on board this MC: there is multiplication, no division, no PT operation.
Now we begin to measure the time of the full calculation. We are given the values A = 140, B = 120, C = 70, and we build 170 evenly distributed points in KB. Why exactly these values - they were given in the specified post when evaluating performance. Below are the algorithms and the corresponding test execution time.
Formula (1) => 20 msec or 1900 cycles per countdown
Formula (1) => 18 msec or 1660 cycles per counting (count 1-T separately)
Formula (2) => 16 msec or 1540 cycles per counting
Formula (3) => 10ms or 923 clocks per sample
Formula (4) => 8 ms or 762 clocks per count.
It can be seen that the resulting reduction in execution time (from 20 ms to 8 ms) correlates well with the expected one and we were able to speed up the calculations by more than 2 times. We note that in addition to completely obvious considerations and mathematics that does not go beyond the course of secondary school, we did not need to.
And now let's talk about what we should do if the result obtained is not enough, and we have already squeezed everything out of the calculation formulas. They wrote to me here (in the comments to another post) that in general any task can be reduced to calculations with FT and, despite the obvious controversy of the assumption (try to do this to solve the Navier-Stokes equations numerically), in this particular case this recommendation applies , although, as always, there are nuances.
Part Two, Computing
Once the modifications of the algorithm have been exhausted, only the data structures remain and we enter the ground of fixed-point numbers. Here we are waiting for a lot of pitfalls, which we did not think about for PT - the range and accuracy (in general, for PT, these questions should be thought about, but here everything is simpler, much has been done for us). It is necessary to conduct a small study of the problem to determine the necessary representation of FT (selected in the above post 9.7, judging by the results, is clearly insufficient), but I suggest to go a little different way. By the way, if we take not 170 steps in the interval, but 128 (I see no reason forbidding this step to us), then this presentation would be completely satisfactory. If we take into account the fact that the constants defining the KB are given by integers,
We use only the last formula and rewrite it in the new notation.
(=> 2 * 2 + 2 /), where A1 and B1 are calculated in the same way as for PT. Obviously, all integers and corresponding operations should be performed much faster. In order not to lose accuracy in the operation of dividing integers (2/3 = 1! = 1.5) and carry out the division at the very last moment, we slightly transform the formula to the form
(=> 4 * 2 + 2 /). All the numbers of FT, so we implement this algorithm and get ... so here you are, grandmother, and St. George's day ... 1869 cycles, but this is much worse than the PT, we started with this, some garbage, because whole numbers are much faster.
We begin the debriefing and it turns out that it is not enough just to change the type of variables. First, we must use numbers not 8 and not even 16, but 32 bits, otherwise overflow will occur, but long numbers, although faster than PT, but not enough to compensate for the flaws of the algorithm. Second, these flaws are we again had constants calculated on each clock cycle - we remove them by preliminary calculation B2 = B1 * I, A2 = A * I * I. Then we get
(=> 2 * 2 + 2 /) with the result of 1684 - better than the previous one, but still it was not for this that we left PT.
We exclude the calculation of another constant I2 = And * And we get
(=> 2 * 2 + 1 /), with the execution time of 956 cycles - and this is interesting, the exclusion of one operation led to a significant increase in productivity.
That's what slows us down - division, since this is a very time-consuming operation, but we have an interesting trick to deal with it. To calculate the expression 1 / And we can carry out elementary transformations 1 / И = 1 / И * (Н / Н) = 1 * (Н / И) / Н. If we choose the power of two as H, then division by H can be replaced by shifts, and if the exponent is a multiple of 8, then even shifts are not needed. And the value of H / I will have to be calculated honestly, but only once, after which only multiplication remains in the calculation cycle.
We draw attention to the fact that we did an incompletely correct transformation and replaced H / AND with its rounded value K in order to proceed to operations with integers only. The incorrectness lies in the loss of accuracy and additional research should be conducted to prove the applicability of this approach to our case. We write H / I as (K * I + d) / I = K + (d / I), where d is less than I. Then the absolute error in going to H / I k will be d / I, and the relative error - d / And / (K + d / I)> = d / I / (K + 1) ~ d / I / K, provided that K >> 1 (this is not a shift). It follows that the value of H should be chosen as large as possible, since the absolute error of the calculation is A * d / I / K> = A * 1 / H / I. If we want the error to be no more than one, we must withstand the condition A / K <= 1, then K> = A, convert K * And> = A * And, which means H> = A * And, then we do not lose exactly. For our case A <= 256 and AND <= 256, we get H> = 2 ** 16, which is quite acceptable. Obviously, in the above formulas should use the modules of the original numbers.
We note for the future that if we will not round down, but towards the nearest integer, then the requirements are somewhat reduced and H should be two times smaller, although there are nuances here.
In any case, we can ensure the required accuracy and obtain the following algorithm: H = 2 ** 16; K = [H / I] (And <256); 0 <= and <= AND;
(=> 4 * 2 + 2 >> 16) where all operations are performed on long integers. We implement this algorithm and get 583 clocks ... but this is already close to the ideal, but not yet an ideal.
Then there are minor settings for a specific MK - working with global variables is faster. than with local ones, but even faster with registered local ones, which leads to a decrease in time to 506 cycles.
Further, we note that the last multiplication before the shift can be carried out with 16-bit numbers, which will give 504 - a trifle, but nice.
In total, we have accelerated the calculations compared with the implementation of the "head on" in 1900/504 - more than 3 times, and not lost exactly from the word at all. I call this result a time optimization, and not 20% obtained in the original post.
Is it possible to achieve even better indicators - it is possible, but this is the topic of the next post.