Application Optimization (Iphone armv6)

        Most recently, I knocked a year since our first application appeared on the AppleStore. At first it was pretty hard to figure it out. Especially when you consider that before that I was not involved in application development for MacOS. A lot has been written over this year. Unfortunately, I can’t name the applications that we wrote (I don’t remember everyone, and the manual does not approve of such things), but I can safely tell you about several ways to optimize applications for this platform.
        About half a year (or even more) ago I had to write an application whose main task was sound processing. For this, a simple engine was written that did all this. The application was released and gradually this engine began to be often used in other applications of this kind. But recently, the development of the 2nd version of this program has begun. Requirements have increased, and the resources of old iPhones have not changed. Here I had to look for ways to improve the already written code.


    Compiler Settings (thumb)



        The first thing that comes to mind is to try to squeeze out everything that the compiler can do. Perhaps the most important parameter that can be changed here is compiling the application for thumb mode. If this mode is enabled, then a reduced set of commands will be used to perform our tasks. This set of instructions will be encoded with a more compact code, but we will not be able to use all the processor resources. In particular, VFP cannot be used directly. In places where we perform operations on floating-point numbers, you can find code like this:

    double prevTime=CFAbsoluteTimeGetCurrent();
    {
      ...
    }
    double nextTime=CFAbsoluteTimeGetCurrent();
    double dt = nextTime-prevTime;
    printf("dt=%f",dt);

    * This source code was highlighted with Source Code Highlighter.


        After compilation, it will look something like this:

    blx  L_CFAbsoluteTimeGetCurrent$stub
    mov  r5, r1
    blx  L_CFAbsoluteTimeGetCurrent$stub
    mov  r3, r5
    mov  r2, r4
    blx  L___subdf3vfp$stub
    ldr  r6, L7
    mov  r2, r1
    mov  r1, r0
    mov  r0, r6
    blx  L_printf$stub

    * This source code was highlighted with Source Code Highlighter.


        Not in thumb mode, the code would be:

    bl  L_CFAbsoluteTimeGetCurrent$stub
    fmdrr  d8, r0, r1
    bl  L_CFAbsoluteTimeGetCurrent$stub
    fmdrr  d6, r0, r1
    ldr  r0, L7
    fsubd  d7, d6, d8
    fmrrd  r1, r2, d7
    bl  L_printf$stub

    * This source code was highlighted with Source Code Highlighter.


        As you can see the difference is quite significant. There is no extra function call and all floating-point operations occur on the spot, and not somewhere far away and not with us. Perhaps you will have a question: is this faster? The answer is naturally yes, it works faster. Although if your program does not carry out heavy calculations, then it will do. As a plus to thumb mode - more compact code, which means that theoretically the program will load faster.
        By the way, in Xcode tools it is possible to set personal parameters for each file, and thumb mode can be turned off (or vice versa turned on) only for individual fragments of the project, which is quite convenient.

    Algorithm optimization


        The next step to speed up the calculations is to throw out as many floating-point operations as possible. Instead, convert our numbers to integers, multiplied by a certain coefficient. Naturally, the coefficient is better to choose a multiple of the power of 2, so that it is then convenient to obtain the necessary data.
        Well, we forced the compiler to use all the processor resources in places where it is important, got rid of floating point operations if possible. Now let's look at the spec on ArmV6 (for example here ). If you carefully read the descriptions of functions, you can see a lot of interesting commands there (many of them are also not available in thumb mode).
        For example, you have the task of making a simple low-pass or high-pass filter. The algorithm ultimately comes down to calculating this formula:
        tmp = b0*in0+b1*in1+b2*in2 -a1*out1-a2*out2;

    * This source code was highlighted with Source Code Highlighter.

    (b0, b1, b2, a1, a2 are constants for a given cutoff frequency)

        Now look at the description of the smlad command. This command does the multiplication of 2 16-bit numbers, summarizes the results and the register you specified. The formula will look like this (bits are indicated in square brackets):

        result[0:31] = a[0:15]*b[0:15] + a[16:31]*b[16:31] + с[0:31]

    * This source code was highlighted with Source Code Highlighter.


        Those. the calculation of our formula itself can be done in 3 operations. It remains only to solve the question of how to use this function. Fortunately, I have had a lot of experience with assembler since the days of Dosi, and in gcc, inserts written in assembler just work wonderfully. In general, we will write a function that will use this command:

    inline volatile int SignedMultiplyAccDual(int32_t x, int32_t y, int32_t addVal)
    {
      register int32_t result;
      asm volatile("smlad %0, %1, %2, %3"
             : "=r"(result)
             : "r"(x), "r"(y), "r"(addVal)
            );
      return result;
    }

    * This source code was highlighted with Source Code Highlighter.


        By the way, for convenience, you can make a version of the function for the simulator. And then the test will not be convenient. I got it like this:

    #if defined __arm__
    inline volatile int SignedMultiplyAccDual(int32_t x, int32_t y, int32_t addVal)
    {
      register int32_t result;
      asm volatile("smlad %0, %1, %2, %3"
             : "=r"(result)
             : "r"(x), "r"(y), "r"(addVal)
             );
      return result;
    }

    inline volatile int SignedMultiplyAcc(int32_t x, int32_t y, int32_t addVal)
    {
      register int32_t result;
      asm volatile("mla %0, %1, %2, %3"
             : "=r"(result)
             : "r"(x), "r"(y), "r"(addVal)
              );
      return result;
    }

    #else

    inline volatile int SignedMultiplyAcc(int32_t x, int32_t y, int32_t addVal)
    {
      register int32_t result;
      result = x*y+addVal;
      return result;
    }

    inline volatile int SignedMultiplyAccDual(int32_t x, int32_t y, int32_t addVal)
    {
      register int32_t result;
      result = int16_t(x & 0x0000FFFF) * int16_t(y & 0x0000FFFF);
      result += int16_t(x >> 16) * int16_t(y >> 16);  
      result += addVal;
      return result;
    }
    #endif  

    * This source code was highlighted with Source Code Highlighter.


        As a result, the calculation of our formula will look like this:
      tmp = fParamsHigh[0]*fValsHigh[0];
      tmp = SignedMultiplyAccDual(*(int32_t *)&fParamsHigh[1],*(int32_t *)&fValsHigh[1],tmp);
      tmp = SignedMultiplyAccDual(*(int32_t *)&fParamsHigh[3],*(int32_t *)&fValsHigh[3],tmp);
      tmp = tmp >> PARAMS_SHL_VAL; 

    * This source code was highlighted with Source Code Highlighter.


        Let's look into the dysasm:
    ldrh  r3, [r4, #196]
    ldrh  r0, [r4, #206]
    ldr  r2, [r4, #208]
    smulbb  r3, r3, r0
    smlad r3, r1, r2, r3
    ldr  r1, [r4, #202]
    ldr  r2, [r4, #212]
    smlad r3, r1, r2, r3
    mov  r3, r3, asr #10

    * This source code was highlighted with Source Code Highlighter.


        Everything is beautiful and clear. As one friend of mine said, “downloaded. fulfilled. uploaded. spat out. " What was before is better not to watch. It was just awful. So in my program there were 2 channels on which there was a delay effect. For each such effect, 2 filters were needed (one low-pass filter, the other high-pass filter). Total 4 filters. After optimization, looking at the processor load in Instruments, we see that instead of ~ 45%, the program eats ~ 35% of the processor time. Quite not a bad result :)
        By the way, after reading the documentation, I was surprised to find the absence of integer division. As a result, having slightly modified the linear interpolation algorithm (used for resampling on all active channels), the load generally dropped to ~ 30% :)
      This is how a couple of simple and fairly obvious optimizations reduced the processor load by about 1/3.
    PS Everything was tested on the iPhone 3g.

    Also popular now: