To the question of division
We got the opportunity to conduct a small but extremely interesting tactical exercise
In the process of researching a new MK from a well-known company based on the Cortex-M4 architecture (I’ll definitely write about this later), the question arose of how quickly the integer division operation in hardware implementation can work. The full-scale experiment gave a somewhat unexpected result: dividing a 32-bit number into a 32-bit number is performed over 3 clock cycles of the processor frequency - well, never mind how fast. It turned out that this takes place only with certain operands, but further studies have shown that the time it takes to complete a division never exceeds 7 measures. The obtained results caused a slight rush ("and this is not a certain figure of speech, which does not know what it means, but a very specific verb" - Divov, as always, is incomparable).
Well, you can’t just take and quickly divide such long numbers, it’s strange like that, but the facts are a stubborn thing. I imagined the picture that the President of the Russian Federation calls me to him tomorrow and sets before me the task of making MK no worse than that of ARM (I agree that the picture is delusional, but it doesn’t happen in the world), but I look at him confused and understand that I won’t be able to make such a division of such numbers in such a time, and I will not live up to the expectations imposed on me (well, in fact, I can always quietly buy a license from ARM, and pretend that I invented everything myself, many do it, but GDP is expecting something completely different from me, and then - I can deceive it, but I’m unlikely to).
And I was sad that the guys in ARM are much smarter than me, and I went looking longingly at the Internet to see how they do it. I did not find any information on the execution time on the ARM website, in one of the materials on STM32 it was indicated that the division takes from 2 to 7 clock cycles, which corresponds to the observations, but there is no information on how this is done.
In general, the omnipotent Internet didn’t help much, there are tricks with division by constant, I wrote about them in one of the posts, but we have a different situation, there is Newton’s algorithm and its accelerated version, but this is clearly not the case, there is an algorithm based on Fourier transform, but this is for very large numbers and is unlikely to be completed in 7 cycles even on ARM architecture. I had to come up with it myself and a solution was found, and so simple and obvious that it becomes somewhat embarrassing from the fact that this was not done immediately after the task was formulated.
Before looking at my decision, I suggest you find your own, and then compare with mine, and if they differ, I’m waiting for you in the comments.
So, how do we quickly (in no more than 7 cycles) divide two 32-bit numbers to get a 32-bit result.
To begin with, we recall how division in binary arithmetic in
classical form is generally carried out . The algorithm is quite simple and straightforward - we subtract the divisor from the dividend. If the result is non-negative (we divide the unsigned numbers), then make the next digit of the result equal to one and consider the result as the next dividend, otherwise the next bit of the result is 0. Before the next measure, we halve the divisor (either shift it to the right, or shift the dividend to the left) and reduce the bit weight by 2 times (by similar shifts). Thus, we get one bit of the result in one clock cycle and the whole operation will last 32 clock cycles. There is still an initial shift in this process, but it does not affect the assessment of the situation as a whole. We will accelerate, but how?
We note that the resulting algorithm strongly resembles the operation of an ADC with a sequential approximation and recall that there are other conversion methods, much faster - parallel conversion. And what if ...
We subtract from the divider not only the dividend, but the dividend * 2 and the dividend * 3 (simultaneously, on three adders), then we get three bits (signs of the results) of information that take 4 different values, which means 2 bits of the result can be extracted right away. Next, we extrapolate a similar approach for 3.4.5 bits of the result.
To get 5 bits of information per cycle, we need 31 adders, on each of which the Dividend-Divisor * n operation (1-31) will be performed, the signs of the result are passed through the encoder and we will immediately receive 5 bits of the result. Then we shift the dividend by 5 digits to the left and repeat until ready. Then we need 32/5 = 6.4 => 7 measures to complete the operation.
For work, we need 31 + x adders, it seems to be a lot, but we already have them, because we have a 32 * 32 multiplication operation per cycle, and to implement it we can’t do without 32 adders (well, I think so ... ), so that we already have the necessary equipment, the only question is to build a control circuit and a heap of multiplexers for realizing a fast shift, but this is completely solvable.
So the task of dividing in 7 steps is solved, the question remains - how can this time be reduced, because in the studied MK it is less than 7. The obvious solution is to determine the number of the most significant digit of the dividend (H) and divisor (3) at the stage of preparation of the algorithm it will immediately become clear how many high bits of the quotient are equal to zero, so that we can skip the first or several phases of the algorithm. For example, if C <3, then the result is immediately zero and we complete the operation, for sure you can derive a formula for the number of measures, but I was already bored.
Interestingly, the udiv operation gives only the quotient, although the remainder obviously remains somewhere inside. In principle, it is not difficult to get it in two steps, which was done in the studied fragment of the machine code by executing the pseudo-code Divisible-Private * Divider, but this is for any 2 steps, why not give it out immediately in the register pair - I do not know the answer to this question.
In general, meet the GDP, tell him that we’ll definitely do the division in MK if it’s still interesting to him.
PS: By the way, when I was looking for KDPV (as you noticed, I didn’t find it), I noticed one with the frankly incorrect inscription "You must not divide by zero." I must say with certainty that it is possible to divide by zero, cannot be divided. But seriously, in different architectures they divide by zero differently, in x86 we get an exception (this is an unforgettable mistake 200), in some we get a dividend or zero, but I have never seen the maximum integer. In ARM n / 0 = 0/0 and it turns out 0.
Only registered users can participate in the survey. Please come in.
Traditional rating
- 3.7% this does not have a place on Habr, too simple 8
- 64.3% it was interesting, let it lie 139
- 28.7% Wow, this is what I always wanted to read about division 62
- 3.2% didn’t understand anything, this has no place on Habré 7