On the question of multiplication, square root, import substitution, and Milander

    “Entropy, ergodic source, multidimensional message space, bits, multi-sense, Markov process — all these words sound quite impressive, in whatever order they are arranged. If we arrange them in the correct order, they acquire a certain theoretical content. And a real expert can sometimes with their help find a solution to everyday practical tasks. ”

    John PEARCE“ I see no evil ”


    This post is full of discussions about the fine optimization of performing mathematical operations on MCs with limited resources, as well as subjective evaluations of various aspects of embedded software development.

    Those whom this warning is not scared, please under the cat.

    Before we proceed to the description of the procedure for extracting the square root of an integer, the operation inverse to squaring and, accordingly, multiplication, let's talk about the latter.

    Suppose that we have the ability to multiply an 8-bit number by an 8-bit one, getting a 16-bit result (8 * 8 = 16), how to get an implementation of the operation 16 * 16 = 32 on the basis of this operation. The obvious way is to represent 16 as the sum of two 8, then we get

    А(16)*Б(16)=(а1(8)*256+а2(8))*б1(8)*256+б2(8)) =а1*б1*256*256+а1*б2*256+а2*б1*256+а2*б2

    If the resulting expression replaces the multiplication by 256 by a left shift of 8 bits, then we get a completely working algorithm. We estimate the time spent on implementation - we will need 4 multiplications of 8 * 8 = 16 and 4 additions of 4 byte numbers 32 + 32 = 32. For AVR type MCs, we get 4 * 2 + 4 * 4 = 24 clocks, but this is for a head-on solution. Let's try to improve the result. The fact that we need not 4, but 3 additions and one assignment simplifies the situation somewhat, since the initial zeroing of the result is not required, but we still did not take it into account, although it was necessary and the total time should be 24 + 4 = 28 cycles. But, if we take into account the presence of a shift in the first three terms (respectively, we have that the youngest (two lower bytes) is zero and there is no sense to add it to the result), then we will have to add not 4 bytes, but three and two, which reduces the execution time by 1 * 2 + 2 = 4 clock cycles and we get 20 clock cycles. Next, we can draw attention to the fact that the first and last addends do not intersect at all, which will replace the zeroing of the upper half of the result with the assignment of the first addend and reduce the execution time by 2 more cycles to 18. Next, using the architecture features, namely the presence of the register transfer command couples, save two more bars and the final result - 16 cycles instead of the original 28 - a trifle, but nice.

    Similar optimization methods work for operation 32 * 32 = 32, for which you can shorten the execution time from the expected 4 * 4 * (2 + 4) + 4 = 100 cycles to (3 + 5 + 4 + 3) + (5 + 3 3) + (4 + 3) + 3 = 36 cycles, which is quite good. Well, at the end of the consideration of various multiplication options, we note that 16 * 16 = 16 can be obtained in 3 + 3 + 3 = 9 cycles. Note that all these considerations are valid only on the assumption of the operation 8 * 8 = 16 for 2 cycles, and if it is not on the target MC, the execution time of all other versions of the operation will not become faster.

    Let us summarize the time required for the multiplication (8 * 8 = 8 2, 8 * 8 = 16 9, 16 * 16 = 16 16, 16 * 16 = 32 36) and consider now the original problem.

    We need to extract the square integer root of 32 bit number H, that is, to find the largest 16 bit number n such that n * n <= H. We all from the middle school course know the method of successive approximation to the square root (n = (N / N '+ n) / 2), but when using it we will have to divide integers, and this is a very time consuming operation.

    Therefore, other computational schemes have been developed, one of which is the bitwise approximation method, which in the pseudo-code looks like this:

    • initial values ​​-> n = 0; b = 0x8000;
    • perform 16 times -> if ((n + b) * (n + b)> = H) n = n + b; b = b >> 1;

    You can immediately estimate the time spent on this option 16 (number of bits of the result) * (2 (cycle organization) +2 (addition) + X (multiplication) +5 (comparison and solution) +2 (result modification) / 2 (on average half times) +2 (bit shift)) = 16 * (12 + X). You ask, why in the formula X instead of the number 16, and it turns out that an ambush awaited us, since we write in C, and not in assembler. The fact is that in the standard library there is no multiplication operation with changing the digit capacity and we cannot use 16 * 16 = 32, but are forced to use 32 * 32 = 32, which leads to X = 36 instead of X = 16 and the final figure is 16 * 48 = 768 cycles to extract the integer value of the square root of a 32-bit number.

    Of course, this is much better than Newton's method, but a bit too much, let's see what can be done.
    So, it is obvious that most of the time is spent on the calculation of the next result of the multiplication. Of course, you can rewrite in assembler and use a less expensive version of multiplication, getting 16 * (12 + 16) = 448 cycles, but we will leave this way as a last resort. Consider the process more carefully and see that we do not calculate the random number multiplying by itself, but the multiplication of the previous value with some increase, and the square of the previous value is known. Hence, we can resort to the difference scheme, based on the formula (n + b) * (n + b) = n * n + 2 * n * b + b * b. At first glance, it looks like a mockery: instead of one multiplication, we will need to perform four pieces, and even two additions of long (32-bit) numbers. But let's start to understand: n * n we already have, b * b, taking into account the fact that b = b '/ 2 is easy to get, like b' * b '/ 4, and similarly 2 * n * b = 2 * n * b '/ 2.

    The following calculation scheme appears:

    1. initial values ​​-> nn = 0; n = 0; b = 0x8000; bb = b * b;
    2. repeat 16 times -> if (nn + n + bb> = n) {n = n + b; nn = nn + bb + n}; bb >> 2; b> 1;

    We estimate the implementation costs 16 * (2 (cycle organization) +12 (assignment and two additions) +5 (comparison and solution) + (2 (addition) +8 (two additions)) / 2 (about half times) +8 (right shift by 2) +2 (right shift) = 16 * 34 = 544 cycles. Already better than with the wrong multiplication 32 * 32, but we still have reserves.

    What are they - pay attention to the most costly operation - adding and comparing a total of 17 clock cycles and remake the main loop of the algorithm:
    2. repeat 16 times -> T = Nbbn; if (T> = 0) {H = T; n = n + b);}; bb >> 2; b> 1;
    Then the cycle time will be 16 * (2 (organization of the cycle) +12 (calculation of the new difference) +1 (comparison and solution) + ((4 (assignment) +2 (addition)) / 2 (an average half times) +8 2) = 16 * 28 = 448 cycles, if we take into account the peculiarities of the architecture, you can save another 2 + 2 = 4 * 16 = 64 cycles and fit less than 400 cycles.

    We get even a slightly better result, as with the correct multiplication is 16 * 16 = 32, but without assembler, “on pure C." However, there is a significant minus - if in the variant with multiplication everything is intuitive, then the variant with difference scheme without commentaries s impression of a session of black magic, you choose. We also note that we have exchanged the number of measures on the additional memory for the intermediate variables, and usually happens.

    The necessary remark is that we did not receive a significant (at times) gain compared to multiplications, since we have a fast implementation of 8 * 8 = 16. If it is absent in MK (and this happens) or not so fast (and this also happens), then the difference scheme becomes multiple times faster, since it uses only standard addition and shift operations, which are guaranteed to be in any MK.

    It seemed that it would be better not to succeed, but it turns out that there are still reserves for improving the performance of the algorithm. Let's try to use another classic method of acceleration - divide and conquer. What if we first extract the square root from the upper half of the argument, and then refine it? First of all, we show that it is fundamentally possible. Indeed, we represent the argument in the form H = H '<< 16 + H' 'and the result in the form n = n' << 8 + n ''. Since n '' <256, then the square of it will certainly be less than the square of the number n = n '<< 8 + 256 = (n' + 1) << 8. It follows that the highest part of the result does not exceed the square root of the highest part of the argument.

    The implementation of this approach to leave to the inquisitive reader.
    That will give us a similar approach, because the total number of iterations will remain unchanged - we can spend the first half of iterations with numbers of shorter length, and this leads to a decrease in time costs. This approach can be applied to the multiplication option and the difference option, the total gain will be up to a quarter of the total execution time.

    The necessary note is that the applicability of this approach is not at all obvious; when implemented for AVR-type MCs, execution acceleration does take place, but for some architectures, for example, for x86, a slowdown appeared. Apparently, working with non-native data (16 bit) in this architecture is significantly more expensive in time than with native (32 bit) data. I did not conduct a deep research, but the fact took place and should report about it in order to avoid misunderstanding.

    But that's not all. Since we have already embarked on the path of separation and rule, then why not go further - extract the root from the bits, step by step, starting with the older ones (starting with the younger ones is counter-productive in our case). The algorithm of the algorithm is the same - we add the next bit of bits to the current result and try to add the next bit to the result, checking whether we have gone beyond the root value. The peculiarity is that we can only check the high bits of the argument, until we reach the younger bits.

    When implementing, we use another trick - instead of moving our subtracted numbers to the right, we will move our diminishing argument to the left, the meaning of this does not change, and the speed increases. Increasing due to two factors - 1) we only need to subtract only 16 bit numbers (there is one feature, and it must be taken into account, but we consider the training example, howl) and 2) we will not need to shift the square of the next bit, because it will always equal to one. But for everything in this world you have to pay and we will have a shift of the extended difference (6 byte) to the left, and by 2 bits per clock. We look at the pseudocode

    1. initial values ​​-> n = 0; H1 = 0;
    2. repeat 16 times -> (H1, H) << 2; T = H1-n-1; if (T> 0) {H1 = T; n = n + 2}; n << 1;

    and estimate the execution time by getting 16 * (12 (extended shift) +4 (calculation of the race) +1 (decision) +2 (assignment) +1 (increase) +2 (shift)) = 16 * 22 = 352 cycles, perhaps The result is close to perfect. When implementing this option there are small pitfalls, I leave it again to the inquisitive reader (well, he gets the work).

    Well, in conclusion of the section that I was inspired to write this post. There is an absolutely wonderful McuCpp library, authored by Anton Chizhov, in which, based on the loki class of authorship, Andriescu is unusually elegant (well, as far as elegance can be applied to C ++ templates), the work with the pins <a " github.com/KonstantinChizhov/ Mcucpp“I have great respect for the author mentioned (for both of them) and recently, due to circumstances I’ll talk about later, I watched the source code of this library and once again admired.

    However, among other files I saw template_utils.h, in which some auxiliary subroutines were implemented, and among them an integer root from a 32-bit number. The fact that it uses the simplest algorithm of sequential approximation with multiplication is not terrible, because this algorithm does not lose much in speed, and in comprehensibility it gives a lot of points ahead and still wins. But I didn’t like the fact that it was implemented a little sloppy (in terms of speed), because "children can see it." Inaccuracy consists in representing a selectable number of 32 bits, because we firmly know that the root of the 32-bit number does not exceed 16 bits, so why should we make zero-byte shifts. And this is exactly the case

    Obvious function conversion

    staticinline uint32_t sqrt(uint32_t value){
      uint16_t result = 0;
      uint16_t add = 0x8000;
      for (uint8_t i = 16; i !=0; ++i) {
    	uint32_t rootGuess = result | add;
    	uint32_t guess = rootGuess * rootGuess;
    	if (value >= guess) {
    		result = rootGuess;
    	}
    	add >>= 1;
      }
      return result;
    }

    allows us to save 2 clocks on a bit shift and 2 clocks on creating the next multiplier on each cycle, and organizing the cycle in the specified form is 4 more clocks (I know that the compiler can do this optimization for us, but why not help it explicitly ), which is quite good for purely cosmetic code changes that do not affect its clarity at all.

    Late note - one comment made me think that it would be more correct

    for (uint_fast8_t i= ...)

    Thanks, Oleg, for the hint.

    The cherry on the cake is located just below the function of extracting the whole square root of the sign number, which states that √-1 = 65635 = -1. On the other hand, why not, than this result is worse than any other, it’s not an exception call in MK, and the whole square root of a negative number does not exist.

    And the conclusion about why I turned to the library of Anton Chizhov. He pushed me to this recent post on the national RTOS for MK called MAKS (MultiAgent Coherent System) - see the epigraph to the post advertised by its creators and which is ported to Milandr produced by MK. Remark - this post is by no means a promotional material and it will soon become clear to readers. From the mcucpp mentioned above, the OS authors used the implementation of the ring buffer (without detracting from the advantages of the library of Anton, I must declare that this part of it is not a reference, and this is still a mild formulation, which I wrote in another post that I can’t post in any way). Since I work closely with MK production, I was interested in the material and I followed the link to the developers site.

    Here begins another cry of Yaroslavna.

    Last year, when the creation of a national RTOS was first announced, I downloaded a description of a software product from this site, but somehow my hands did not reach the point of studying. By the nature of my business, I have to deal with domestic components (I understand enough ...), so having the appropriate software would be quite good. Remembering how in the last year's release the director of the company told about the millions of rubles spent on development and a large team working on the creation of this software, I decided to see a trial version available for download for free, so I share the results.

    For a start, the description for half a year has almost halved in volume (from 115 to 55 pages), and if the disappearance of applications with screenshots describing the process of launching third products from the Program Description is welcome, then the appearance of these materials (to create it was spent, though not very significant, but still time and money) in a document like “Operator's Guide”, I personally am bewildered. Further, in the first sentence of the document, we see a frank deviation from the truth, since the RTOS itself is not intended to “create programs”, for some reason the authors did not allow themselves such statements in the previous version of the document, the influence of the marketing service is felt. It also delivers that if before the description was in the / docs folder of the root directory, and that was logical,

    I start to look at the description, looking at the source code (it is kindly included in the trial version) and I am surprised to find the absence of any peripheral device drivers adapted to work with this OS. At first I suggested that this is a feature of the trial, then on the forum in the information from the developers I find out that there really are no drivers, but they are working on it. More than half a year (half a year, Carl, generally close to a year) from the moment of OS release for MK, and they work on drivers. Naturally, or as they say, it goes without saying that no third-party products (file system, network stack, USB stack) are out of the question. A funny idea from the authors about the requirements for software development for MK, okay, drove again.

    That is, the declared OS, the underlined feature of which is the organization of interaction within the multicontroller system, has no native means of organizing this interaction. What we have in the bottom line - and we have task management, the actual scheduler itself, the minimum time service and task synchronization tools, and that’s all fun, if not more. Okay, we will look further, even in such a set of components interesting solutions are possible, especially if we consider that on one site (not the manufacturer’s company), following the link, I saw the “expertise” of the source code of this OS. In this document it was said that the software product does not use third-party (imported) components and is distinguished by originality, it is necessary to make sure.

    The first remark is that if the original ARM files included in the source package are used to port to a specific Cortex-M0 (1986 BE1T) architecture, this is very similar to the use of third-party (imported) text fragments - personally it seems to me that this is exactly the use, but I guess I do not know everything. Well, secondly, the source code of the sheduler and its associated task management components are really original and have no analogues (at least, they are not known to me), but this is the kind of originality that recalls the phrase of the old shaman from the movie “Evil Spirit of Yambuya” the great hunter: "Ears cut, cooked and eaten - would you have guessed?".

    I will try to explain myself - in the design of the OS in general and in the RTOS in particular, one of the most difficult is the question of ensuring access of all processes in the system to a shared resource - processor execution time. The fact is that an incorrectly designed system (and an unsuccessfully written task) can block the execution of all tasks with a lower priority, which will certainly puzzle the programmer. This is not about performing prohibited operations such as interrupt management (this is a topic for a separate conversation and does not have a simple MK solution, although the authors of the operating system claim that they solved this task using the MPU), but about continuous execution without waiting.

    This problem is widely known, it can be solved in different ways, there is an extensive literature on this subject, as a rule, a set of task queues with different priorities and a modified sheduler are used. This requires certain costs for queuing with access time O (1) and, for example, in FREE-RTOS, when trying to set more than 20 possible priority levels, the programmer gets a puzzled compiler question, and does he really need that much (and even if this question is there is an affirmative answer, without modifying the source code so many priorities do not get).

    Therefore, I was slightly surprised to find that the operating system in question allows us to have a number of priorities up to 60 (and even more). The surprise dissipated when I looked through the source. Instead of separate queues of tasks with equal priorities, the authors use one queue (there is also the second queue of blocked tasks) of tasks ready for execution, which results in saving memory (probably, this was the purpose of such a solution) due to the fact that

    1. the operation of putting the task into the queue becomes O (n) and
    2. this makes the use of a modified sheduler impossible - in my opinion, a little expensive for 20 * (3 * 4) = 240 bytes of RAM. The decision is unusually original, but, from my point of view, this is its only virtue.

    In general, I did not understand what the authors are going to take money for (but they still haven’t decided whether to do this, judging by the forum), and which solutions and features allow us to give the software such a resounding name. Especially considering how much software is provided for free by numerous suppliers of MC (of course, imported). When I looked at the company’s forum in an attempt to find an answer, I saw references to the previously mentioned mcucpp software product (the authors of MAKS were allegedly inspired by Chizhov’s ideas - three times ha), in which I discovered the minor flaw described above.

    The conclusion from this section - if other solutions for import substitution in the field of software are carried out in similar ways with similar results in quality, then the prospects for building domestic embedded systems seem to me to be very poorly defined.

    In conclusion, I would like to refer to the manual (no, not the OS developer, I don’t even want to mention it alongside the good developers) of the developer company and manufacturer of the mentioned MC - Milander. You are doing quite good microcontrollers (I will not lie, inferior in terms of parameters to foreign analogues, but not fatally), for example, at one time (in 2013) the BE1T was almost top among classmates, but in the yard in 2019 and during that time many people caught up with it and overtook.

    But, if there are no good MKs produced by the company:

    1. normal (better to be good, but I'm realistic) documentation (I know that you think it is, I assure you, you are mistaken),
    2. inexpensive versions in convenient cases (you have it),
    3. inexpensive and easy-to-use (small-sized) debugging boards based on clause 2,
    4. a set of libraries for working with peripherals such as HAL, CMSIS (there is something),
    5. a set of well-documented examples of the use of the IC component,
    6. a set of adapted and tested stacks for standard interfaces,
    7. adapted and tested external components (3rd part), including RTOS,
    8. a large set of ready-made examples of the implementation of standard devices
    9. verified adaptation packages for popular programming environments,
    10. my own development environment, integrated with the listed components (I understand that I was drowned, but maybe all the same ..) completely ready for use "out of the box",
    11. educational materials based on the above, including printed materials and the organization of seminars of different levels (you have MIET at your side, of course, MIT would be better, but “I don’t have other writers for you”), the scope of possible application of these MK data is somewhat narrowed, is not it (howl?).

    Of course, all this does not appear by itself and costs money, but it seems to me that the company of your level could afford to work 5 people for six months to create all of the above (except for the item, perhaps, clause 10, although now all significant and many small MK manufacturers have got their own IDE). If this is implemented, then the ground will disappear for the appearance of handicrafts of the type of OS described in this post.

    Although it may well be that I do not know something, and in fact everything is not so simple, it is a pity if this is true.

    I apologize in advance if my notes seem too harsh, it was you (Milander) who I did not want to offend.

    Also popular now: