To the question about bitset
“Is it not time, my friends, for us to swing at William, you see, our Shakespeare? ".
I recently read a post about a custom keyboard and once again thought that it would be nice to write a reference (that is, one that is not ashamed to sign with its name and put it on public display) a keyboard implementation. This idea does not come to me for the first time, but it somehow stops at the first stage — reading the initial information, because I want to make this stage easy to configure, convenient to use, universal and effective, and I don’t like the offer “choose any two”.
Necessary note - I see 4 keyboard implementation layers: 0 - event detection, 1 - data reading, 2 - data cleaning and storage, 3 - message generation, 4 - transcoding, etc. The most promising for layer 1 and the layer 0 associated with it seems to me the use of Anton Chizhov templates for working with MK pins (based on the Loki class) and, perhaps, someday, the resulting result will not be ashamed to share, but certainly not today. Now I'm thinking about layer 2.
Let us formulate the problem - it is necessary to store information about a fixed set of switches that take one of two predefined values - “closed” and “not closed”. The most natural candidates are boolean variables and the bitset library, as a way to store a set. Since the requirement of efficiency is a categorical imperative, it is desirable to evaluate the implementation of the module from this point of view.
The first thought was to look at the source codes and everything immediately became clear, but after a brief acquaintance with them, I realized that learning about other people's templates was not very interesting (and not very productive). In addition, the sources do not give an accurate assessment of the effectiveness of the implementation, since it is directly closed to the compiler. In fact, the source text still had to be studied, otherwise making changes to it becomes a very lengthy process (unless, of course, we are interested in achieving a certain result), but this is the topic of a separate post.
Therefore, the technique of studying the "black box" is adopted - we feed various fragments of code to the input and consider the generated assembler. Unfortunately, it is not possible to use the favorite godbolt site for the familiar AVR architecture, since there is no library under study in this implementation. You can, of course, drag it with pens, but there is no guarantee that it will be the correct source code.
Therefore, we will look at a different architecture. x51 is not presented for the gcc compiler, I never liked x86, ARM has a not very convenient (for a person) and understandable assembler, MIPS is very specific and not too common, all kinds of SPARCs are even worse (well, I’ll not offend anyone with their favorite architecture, not better), but there is a great candidate MSP430, which was based on the crystal clear and elegant architecture of PDP and TI could not spoil it much (although the guys tried). A library of many bits for this architecture is presented, so you can begin to study.
Let's start, as it does not sound trivial, from the beginning, that is, with the announcement of the multitude. Immediately we will see that the memory for storage is allocated in four-byte words, despite the fact that the natural unit in this architecture is a two-byte word, and quite convenient and efficient work with bytes is provided, which leads to strange incidents. You can understand the author, the implementation of a 32-bit number should be everywhere and rely on it quite naturally, but in this case, 8-bit would be preferable, and for AVR, 8-bit would be the only reasonable solution.
An interesting question, but how can you determine the bit depth of the architecture during the compilation process, you will need to try through uint8_t_fast. We note a possible optimization and move on.
In addition to allocating memory, initialization is of interest - for global sets it is done in the standard way - zeroing before calling main, for local sets it is also done in the standard way, that is, in no way if the initial value is not explicitly specified. Well, and, as always, if it is possible to describe a static set with an initial value outside the function, this should be used so as not to turn on unnecessary flags and not spend execution time on them. But here we did not expect any revelations, we just checked the general rules.
Let's start working with modifying the set, for which we have left square brackets and set and reset methods. We can expect to see something like this to set the element n in the set M:
M[n / w] |= (1<<(n % w))
where w is the number of bits in the base element, which for a given architecture, statically defined n (for example, 4) and the included optimization leads to a code of the form:
bis.w #0x0010, m
Indeed, we see such a code in the right half of the window, and it is unlikely that anyone would seriously risk asserting that a more effective solution is possible. But this is only for the indicated conditions, for an arbitrary n the picture completely changes, for methods we observe the validation of the argument for admissibility with the generation of the corresponding exception, and for the brackets we see the restriction of the argument with a bit mask of the permissible range with completely predictable undefined behavior, both cases are quite consistent with the documentation. Negative values are processed quite correctly, since indexes are considered as unsigned numbers.
We draw attention to the fact that the assigned value for an element of a set can be not only 0 and 1, as one might expect, but also any integer to which the rule “What is unit? Not zero ”, it is quite logical, but poorly reflected in the documentation. A little strangely done, yet Boolean values would be more natural, tick off and move on.
Comparison of the code generated for the case of a statically indeterminate number of an element of the set shows that the efficiency of the code in both cases ( and methods) is very close and small, because for the calculation (1 <
Therefore, a universal and quick method would be to store bit masks in an array and index, and on this architecture it is also very efficient, the code then looks like this:
M[n/w] |= BitArray[n %w]
getting assembler like:
Since we are talking about patterns and w is a multiple of the size of a byte, division operations are implemented very efficiently here. We note the clear advantage of a minimally implemented storage element: for a byte, an 8-byte array is required for a byte, -2 * 16 = 32 bytes for word organization, and 4 * 32 = 128 bytes for a long word of 32 bits to store all the required masks, why pay more if the result does not change. Let's remember one more possible optimization and move on.
We note one more fact - much more efficient implementations of operations with a set element are possible if the target architecture has a bit-marked memory region (here again, the ARM rejected earlier is recalled), where the element installation operation generally turns into a BitSetAddr [n] = pumpkin 1, which translates into a single assembler command for constant n, but there are already enough effective shifts there, so such an optimization would be more redundant, especially taking into account its limitations. In principle, there is a similar bit-addressable area in both x51 and AVR, but there are effective commands only for constant element numbers, and with the general case, everything is not so good (frankly bad).
Well, now take a close look at the resulting code and note that we observe artifacts associated with storing the set in double words. The compiler for the operation of modifying an element of a set generates a sequence of commands that read the corresponding double word from memory into 2 registers (I recall that we have 16-bit registers), modifies them and sends the results back to memory. If we change only one element, then the operation mask will contain exactly one unit out of 32 possible, the remaining zeros. When we apply a statically defined element number, operations that do not change the result should be excluded at the optimization stage. As a rule, this happens, but for various operands something goes wrong and artifacts of the form leak into the final code:
which looks especially funny if you notice that the register is not written back to memory, although it is read. Strictly speaking, the results of optimizations are not regulated anywhere, so they can be anything, and there’s nothing to be offended at, but anyway, "it’s not neat how it works out." We can’t directly influence this process, if we do not seriously consider the source code of the compiler, so let's bypass it - we will help the optimizer by simplifying his task.
By the way, I still can’t find the answer to the question - is it possible in C ++ at the macro or template level to define a different implementation for statically defined at the compilation stage against a statically undefined parameter. If anyone knows the way of the samurai, tell me in the comments, I tried constexpr, it didn’t work.
We continue our research and find that the compiler is uncontrollably optimizing calls to the set (of course, if optimization is turned on), that is, the order of installation / cleaning of various elements is completely not guaranteed and is not connected in any way with the order of the source code operators. But I failed to create a volatile set, maybe I'm doing something wrong? As in the case of any local optimization, the call to an external function forces the compiler to force all prepared operations for the global array, but this is too strong a solution and does not help with local ones. Well, there’s probably nothing to be done, and you just need to take into account a similar feature and not use sets to transfer information between streams using serial interfaces (that is, their software counterparts).
A general conclusion can be made: the use of bitset in its current form for architectures with limited resources cannot be recommended due to excessive costs both in memory and in runtime. A possible modification, which takes into account all the data on the text of the comment, lies on Github, everyone can use it. The history of the creation of this mod will soon be posted on Habré, there were funny moments.
In conclusion, a small remark - the implementation of the data warehouse "head-on", even on the optimized version of the set, will require N / 8 bytes of data memory (for 128 switches, 16 bytes will be required) and although operations will require O (1) time, the multiplier will be many units ( and even up to 10 or more) cycles of MK. Therefore, taking into account the requirements of the problem and introducing certain restrictions, we can offer an alternative implementation of data storage.
If we believe that no more than one switch can be closed at any time (we ignore all the others until the button that is pressed at the moment opens), then we can bypass one byte (provided that there are no more than 256 switches) and writing / reading will take O (1) processor cycles, and the multiplier will be quite modest.
This approach is easy to expand and store information about simultaneously closed n switches, but you should not make n too large, since the required amount of memory increases, and the time for performing inversion operations increases linearly with an increase in the number of elements in the set, although it remains O (1) by in relation to the number of switches. The indicated increase in time can be significantly reduced using the triangular implementation of the binary tree to O (loq2 (n)), but for small n this is not so important. Yes, and it is doubtful that the complexity of calculating the next index during the search would compensate for the decrease in the number of simple iterations. The disadvantages of this implementation include a possible failure to record the element of the set,
The implementation of this approach will be given there.
Only registered users can participate in the survey. Please come in.
Optimization of standard libraries
- 28.5% typical example of premature optimization 4
- 21.4% When someone talks about premature optimization, he almost certainly doesn’t mean at all what Knut 3 had in mind
- 50% but how could you not do it 7