How to blink 4 LEDs on CortexM using C ++ 17, tuple and a bit of fantasy

    Good health to all!

    When teaching students how to develop embedded software for microcontrollers at the university, I use C ++ and sometimes I give students who are especially interested in all kinds of tasks to identify gifted students who are especially sick .

    Once again, such students were given the task of blinking 4 LEDs using the C ++ 17 language and the standard C ++ library, without connecting additional libraries, such as CMSIS and their header files with a description of register structures, and so on ... The one with the code wins in ROM will be the smallest size and least spent RAM. Compiler optimization should not be higher than Medium. IAR compiler 8.40.1.
    The winner goes to Canary and gets 5 for the exam.

    Before that, I myself did not solve this problem either, so I will tell you how the students solved it and what happened to me. I warn you right away that it is unlikely that such code can be used in real applications, which is why I posted the publication in the “Abnormal Programming” section, although who knows.

    Conditions of the problem


    There are 4 LEDs on the ports GPIOA.5, GPIOC.5, GPIOC.8, GPIOC.9. They need to blink. To have something to compare with, we took the code written in C:

    void delay() {
      for (int i = 0; i < 1000000; ++i){
      }
    }
    int main() { 
       for(;;)   {
         GPIOA->ODR ^= (1 << 5);
         GPIOC->ODR ^= (1 << 5);
         GPIOC->ODR ^= (1 << 8);
         GPIOC->ODR ^= (1 << 9);     
         delay();
       }  
      return 0 ;
    }

    The function delay()here is purely formal, a normal cycle, it cannot be optimized.
    It is assumed that the ports are already configured for output and clocking is applied to them.
    I’ll also say right away that bitbanging wasn’t used to make the code portable.

    This code takes 8 bytes on the stack and 256 bytes in ROM on Medium Optimization
    255 bytes of readonly code memory
    1 byte of readonly data memory
    8 bytes of readwrite data memory

    255 bytes due to the fact that part of the memory went under the table of interrupt vectors, calls to IAR functions to initialize a floating-point block, all sorts of debugging functions and the __low_level_init function, where the ports themselves were configured.

    So, the full requirements are:

    • The main () function should contain as little code as possible
    • You cannot use macros
    • IAR 8.40.1 compiler supporting C ++ 17
    • CMSIS header files such as "#include" stm32f411xe.h "cannot be used
    • You can use the __forceinline directive for inline functions
    • Medium Compiler Optimization

    Students decision


    In general, there were several solutions, I will show only one ... it is not optimal, but I liked it.

    Since header files cannot be used, students first of all made a class Gpiothat should store a link to port registers at their addresses. To do this, they use a structure overlay, most likely they took the idea from here: Structure overlay :

    class Gpio {
    public:
    __forceinline inline void Toggle(const std::uint8_t bitNum) volatile {
        Odr ^= bitNum ;
      }  
    private:
      volatile std::uint32_t Moder;
      volatile std::uint32_t Otyper;
      volatile std::uint32_t Ospeedr;
      volatile std::uint32_t Pupdr;  
      volatile std::uint32_t Idr;      
      volatile std::uint32_t Odr;       
      //Проверка что структура выравнена
      static_assert(sizeof(Gpio) == sizeof(std::uint32_t) * 6); 
    } ;

    As you can see, they immediately identified a class Gpiowith attributes that should be located at the addresses of the corresponding registers and a method for switching state by the number of the legs:
    Then they determined the structure for GpioPincontaining the pointer to Gpioand the number of the legs:

    struct GpioPin
    {
      volatile Gpio* port ;
      std::uint32_t pinNum ;
    } ;

    Then they made an array of LEDs that sit on the specific legs of the port and went over it, calling the method of Toggle()each LED:

    const GpioPin leds[] = {{reinterpret_cast(GpioaBaseAddr), 5},
                          {reinterpret_cast(GpiocBaseAddr), 5},
                          {reinterpret_cast(GpiocBaseAddr), 9},  
                          {reinterpret_cast(GpiocBaseAddr), 9} 
    } ;
    struct LedsDriver {
      __forceinline static inline void ToggelAll()  {
        for (auto& it: leds)    {
          it.port->Toggle(it.pinNum);
        }
      }
    } ;

    Well, actually the whole code:
    
    constexpr std::uint32_t GpioaBaseAddr = 0x4002'0000 ;
    constexpr std::uint32_t GpiocBaseAddr = 0x4002'0800 ;
    class Gpio {
    public:
    __forceinline inline void Toggle(const std::uint8_t bitNum) volatile {
        Odr ^= bitNum ;
      }  
    private:
      volatile std::uint32_t Moder;
      volatile std::uint32_t Otyper;
      volatile std::uint32_t Ospeedr;
      volatile std::uint32_t Pupdr;  
      volatile std::uint32_t Idr;      
      volatile std::uint32_t Odr;        
    } ;
    //Проверка что структура выравнена
    static_assert(sizeof(Gpio) == sizeof(std::uint32_t) * 6);
    struct GpioPin {
      volatile Gpio* port ;
      std::uint32_t pinNum ;
    } ;
    const GpioPin leds[] = {{reinterpret_cast(GpioaBaseAddr), 5},
                          {reinterpret_cast(GpiocBaseAddr), 5},
                          {reinterpret_cast(GpiocBaseAddr), 9},  
                          {reinterpret_cast(GpiocBaseAddr), 9} 
    } ;
    struct LedsDriver {
      __forceinline static inline void ToggelAll()  {
        for (auto& it: leds)    {
          it.port->Toggle(it.pinNum);
        }
      }
    } ;
    int main() { 
       for(;;)   {
         LedsContainer::ToggleAll() ;
         delay();
       }  
      return 0 ;
    }


    Statistics of their code on Medium optimization:
    275 bytes of readonly code memory
    1 byte of readonly data memory
    8 bytes of readwrite data memory

    A good solution, but it takes up a lot of memory :)

    My decision


    Of course, I decided not to look for simple ways and decided to act in a serious way :).
    LEDs are on different ports and different legs. The first thing you need is to make a class Port, but to get rid of pointers and variables that occupy RAM, you need to use static methods. The port class might look like this:

    template 
    struct Port {  
     //здесь скоро что-то будет
    };

    As a template parameter, it will have a port address. In the header "#include "stm32f411xe.h", for example, for port A, it is defined as GPIOA_BASE. But we are forbidden to use headings, therefore it is just necessary to make the constant. As a result, the class can be used like this:

    constexpr std::uint32_t GpioaBaseAddr = 0x4002'0000 ;
    constexpr std::uint32_t GpiocBaseAddr = 0x4002'0800  ;
    using PortA = Port ;
    using PortC = Port ;
    

    To blink, you need the Toggle method (const std :: uint8_t bit), which will switch the required bit using an exclusive OR operation. The method must be static, add it to the class:

    template 
    struct Port {  
     //сразу применяем директиву __forceinline, чтобы компилятор воспринимал эту функцию как встроенную
      __forceinline inline static void Toggle(const std::uint8_t bitNum)  {
        *reinterpret_cast(addr+20) ^= (1 << bitNum) ; //addr + 20 адрес ODR регистра
      }
    };

    Great to Port<>eat, it can switch the state of the legs. The LED sits on a specific leg, so it’s logical to make a class Pinwith the Port<>leg number as the template parameters . Since Port<>we have a template type , i.e. different for different ports, we can only transmit universal type T.

    template 
    struct Pin {
      __forceinline inline static void Toggle()   {
        T::Toggle(pinNum) ;
      }
    } ;

    It’s bad that we can pass any nonsense of type Tthat has a method Toggle()and this will work, although it is assumed that we should pass only the type Port<>. To protect ourselves from this, we will make it Port<>inherit from the base class PortBase, and in the template we will verify that our passed type is indeed based on PortBase. We get the following:

    constexpr std::uint32_t OdrAddrShift = 20U;
    struct PortBase {
    };
    template 
    struct Port: PortBase {  
      __forceinline inline static void Toggle(const std::uint8_t bit)  {    
        *reinterpret_cast(addr ) ^= (1 << bit) ;
      }
    };
    template ::value>> //Вот и защита
    struct Pin {
      __forceinline inline static void Toggle()  {
        T::Toggle(pinNum) ;
      }
    } ;

    Now the template is instantiated only if our class has a base class PortBase.
    In theory, you can already use these classes, let's see what happens without optimization:

    using PortA = Port ;
    using PortC = Port ;
    using Led1 = Pin ;
    using Led2 = Pin ;
    using Led3 = Pin ;
    using Led4 = Pin ;
    int main() { 
       for(;;)   {
         Led1::Toggle();
         Led2::Toggle();
         Led3::Toggle();
         Led4::Toggle();
         delay();
       }  
      return 0 ;
    }

    271 bytes of readonly code memory
    1 byte of readonly data memory
    24 bytes of readwrite data memory

    Where did these extra 16 bytes in RAM and 16 bytes in ROM come from. They came from the fact that we pass the bit parameter to the Toggle function (const std :: uint8_t bit) of the Port class, and the compiler, when entering the main function, saves 4 additional registers on the stack through which this parameter passes, then uses these registers in which the values ​​of the leg number for each Pin are stored and when leaving main restores these registers from the stack. And although in essence this is some kind of completely useless work, since the functions are built-in, but the compiler acts in full accordance with the standard.
    You can get rid of this by removing the port class in general, passing the port address as a template parameter for the class Pin, and inside the method, Toggle()calculate the address of the ODR register:

    constexpr std::uint32_t OdrAddrShift = 20U;
    template (addr + OdrAddrShift ) ^= (1 << bit) ;
      }
    } ;
    using Led1 = Pin ; 

    But this does not look very good and user friendly. Therefore, we hope that the compiler removes this unnecessary register preservation with a little optimization.

    We put optimization on Medium and see the result:
    251 bytes of readonly code memory
    1 byte of readonly data memory
    8 bytes of readwrite data memory

    Wow wow wow ... we have 4 bytes less
    code
    255 bytes of readonly code memory
    1 byte of readonly data memory
    8 bytes of readwrite data memory


    How can this be? Let's take a look at the assembler in the debugger for C ++ code (left) and C code (right):

    image

    It is clear that, firstly, the compiler made all functions built-in, now there are no calls at all, and secondly, it optimized the use of registers. It can be seen, in the case of C code, the compiler uses either the R1 or R2 register to store the port addresses and does additional operations each time the bit is switched (save the address in the register either in R1 or in R2). In the second case, it uses only the R1 register, and since the last 3 calls for switching are always from port C, there is no longer any need to save the same port C address in the register. As a result, 2 teams and 4 bytes are saved.

    Here it is a miracle of modern compilers :) Well, okay. In principle, one could stop there, but let's move on. I don’t think it will be possible to optimize anything else, although it’s probably not right, if you have ideas, write in the comments. But with the amount of code in main () you can work.

    Now I want all the LEDs to be somewhere in the container, and you could call the method, switch everything ... Something like this:

    int main() { 
       for(;;)   {
         LedsContainer::ToggleAll() ;
         delay();
       }  
      return 0 ;
    }

    We will not stupidly insert the switching of 4 LEDs into the function LedsContainer :: ToggleAll, because it is not interesting :). We want to put the LEDs in a container and then go through them and call the Toggle () method on each.

    Students used an array to store pointers to LEDs. But I have different types, for example , and pointers to different types of I stored in the array can not. You can make a virtual base class for all Pin, but then there will be a table of virtual functions and udelat win the students I do not succeed. Therefore, we will use the tuple. It allows you to store objects of different types. This case will look like this:PinPin



    class LedsContainer {  
     private: 
       constexpr static auto records = std::make_tuple (
                                                       Pin{},
                                                       Pin{},
                                                       Pin{},
                                                       Pin{}    
        ) ;
      using tRecordsTuple = decltype(records) ;
    }

    There is a great container, it stores all the LEDs. Now add a method to it ToggleAll():

    class LedsContainer {  
     public:
       __forceinline static inline void ToggleAll()   {
            //сейчас придумаем как тут перебрать все элементы кортежа
       }    
     private: 
       constexpr static auto records = std::make_tuple (
                                                       Pin{},
                                                       Pin{},
                                                       Pin{},
                                                       Pin{}    
        ) ;
      using tRecordsTuple = decltype(records) ;
    }

    You can’t just walk through the elements of a tuple, since the tuple element should only be received at the compilation stage. To access the elements of the tuple there is a template get method. Well i.e. if we write like this std::get<0>(records).Toggle(), then the method Toggle()for the class object is called , if , then the method for the class object is called and so on ... You could wipe your students nose and just write like this:Pinstd::get<1>(records).Toggle()Toggle()Pin



     __forceinline static inline void ToggleAll()   {
       std::get<0>(records).Toggle();
       std::get<1>(records).Toggle();
       std::get<2>(records).Toggle();
       std::get<3>(records).Toggle();
       }    

    But we do not want to strain the programmer who will support this code and allow him to do additional work, spending the resources of his company, say, in case another LED appears. You will have to add the code in two places, in the tuple and in this method - and this is not good and the owner of the company will not be very pleased. Therefore, we bypass the tuple using helper methods:

    class class LedsContainer {  
      friend int main() ;
      public:         
       __forceinline static inline void ToggleAll()     {
         // создаем последовательность индексов 3,2,1,0 и вызываем соответствующий метод, куда передается эта последовательность
          visit(std::make_index_sequence::value>());
        }   
      private:       
        __forceinline  template    
        static inline void visit(std::index_sequence)   { 
          Pass((std::get(records).Toggle(), true)...); // распаковываем в последовательность get<3>(records).Toggle(), get<2>(records).Toggle(), get<1>(records).Toggle(), get<0>(records).Toggle()
        }
        __forceinline template 
        static void inline Pass(Args... )  {//Вспомогательный метод для распаковки вариативного шаблона
        }
       constexpr static auto records = std::make_tuple (
                                                       Pin{},
                                                       Pin{},
                                                       Pin{},
                                                       Pin{}    
        ) ;
      using tRecordsTuple = decltype(records) ;
    }

    It looks scary, but I warned at the beginning of the article that the shizany method is not very ordinary ...

    All this magic from above at the compilation stage does literally the following:

    //Это вызов 
    LedsContainer::ToggleAll() ;
    //Преобразуется в эти 4 вызова:
    Pin().Toggle() ;
    Pin().Toggle() ;
    Pin().Toggle() ;
    Pin().Toggle() ;
    //А поскольку у нас метод Toggle() inline, то в это:
     *reinterpret_cast(0x40020814 ) ^= (1 << 9) ;
     *reinterpret_cast(0x40020814 ) ^= (1 << 8) ;
     *reinterpret_cast(0x40020814 ) ^= (1 << 5) ;
     *reinterpret_cast(0x40020014 ) ^= (1 << 5) ;

    Go ahead and compile and check the code size without optimization:

    The code that compiles
    #include 
    #include 
    #include 
    #include 
    #include 
    //#include "stm32f411xe.h"
    #define __forceinline  _Pragma("inline=forced")
    constexpr std::uint32_t GpioaBaseAddr = 0x4002'0000 ;
    constexpr std::uint32_t GpiocBaseAddr = 0x4002'0800 ;
    constexpr std::uint32_t OdrAddrShift = 20U;
    struct PortBase
    {
    };
    template 
    struct Port: PortBase
    {  
      __forceinline inline static void Toggle(const std::uint8_t bit)
      {    
        *reinterpret_cast(addr + OdrAddrShift) ^= (1 << bit) ;
      }
    };
    template ::value>>
    struct Pin
    {
      __forceinline inline static void Toggle()
      {
        T::Toggle(pinNum) ;
      }
    } ;
    using PortA = Port ;
    using PortC = Port ;
    //using Led1 = Pin ;
    //using Led2 = Pin ;
    //using Led3 = Pin ;
    //using Led4 = Pin ;
    class LedsContainer {  
      friend int main() ;
      public:    
          __forceinline static inline void ToggleAll()     {
         // создаем последовательность индексов 3,2,1,0 и вызываем соответствующий метод, куда передается эта последовательность
          visit(std::make_index_sequence::value>());
        }   
      private:   
        __forceinline  template    
        static inline void visit(std::index_sequence)         { 
          Pass((std::get(records).Toggle(), true)...);
        }
        __forceinline template 
        static void inline Pass(Args... )     {      
        }
        constexpr static auto records = std::make_tuple (
                                                         Pin{},
                                                         Pin{},
                                                         Pin{},
                                                         Pin{}    
          ) ;
        using tRecordsTuple = decltype(records) ;
    } ;
    void delay() {
      for (int i = 0; i < 1000000; ++i){
      }
    }
    int main() { 
       for(;;)   {
         LedsContainer::ToggleAll() ;
         //GPIOA->ODR ^= 1 << 5;
         //GPIOC->ODR ^= 1 << 5;
         //GPIOC->ODR ^= 1 << 8;
         //GPIOC->ODR ^= 1 << 9;     
         delay();
       }  
      return 0 ;
    }


    Assembler proof, unpacked as planned:
    image

    We see that the memory is overkill, 18 bytes more. The problems are the same, plus another 12 bytes. I did not understand where they came from ... maybe someone will explain.
    283 bytes of readonly code memory
    1 byte of readonly data memory
    24 bytes of readwrite data memory

    Now the same thing on Medium optimization and lo and behold ... we got code identical to C ++ implementations in the forehead and more optimally C code.
    251 bytes of readonly code memory
    1 byte of readonly data memory
    8 bytes of readwrite data memory

    Assembler
    image

    As you can see, I won, and went to the Canary Islands and am pleased to rest in Chelyabinsk :), but the students were also great, they passed the exam successfully!

    Who cares, the code is here

    Where can I use this, well, I came up with, for example, this, we have parameters in the EEPROM memory and a class describing these parameters (Read, write, initialize to initial value). The class is template, like , and you need, for example, to reset all parameters to default values. This is where you can put all of them into a tuple, since the type is different and call the method for each parameter . True, if there are 100 such parameters, then the ROM will eat a lot, but the RAM will not suffer.Param>Param>SetToDefault()

    PS I must admit that at maximum optimization this code is the same in size as in C and in my solution. And all the efforts of the programmer to improve the code come down to the same assembler code.

    P.S1 Thank you 0xd34df00d for good advice. You can simplify unpacking a tuple with std::apply(). The function code ToggleAll()then simplifies to this:

     __forceinline static inline void ToggleAll() 
        {
          std::apply([](auto... args) { (args.Toggle(), ...); }, records);
        }   

    Unfortunately, in the IAR, std :: apply is not yet implemented in the current version, but it will work as well, see implementation with std :: apply

    Also popular now: