# Optimization of applications for Android x86: proven methods

Any Android application, even written only in scripting and non-native languages (such as Java or HTML5), ultimately uses the basic components of the runtime, which must be optimized. Good examples to illustrate optimization approaches and needs are applications that use the multimedia and augmented reality technologies described below. For the Android platform (smartphones and tablets), Intel uses various types of Atom processors with SSSE3 level of vectorization and usually 2 cores with hypertreading - consider this a hint :) For those who understand the hint, under the cut is the history of optimization and parallelization of one specific application of an Israeli company iOnRoad - iOnRoad.

#### Formulation of the problem

iOnRoad is an augmented reality smartphone app that helps you drive safely. Using the GPS, sensors and video camera of the smartphone, as well as modern computer vision algorithms, the application warns the driver about leaving the lane, as well as a possible collision with other cars and obstacles. The application is extremely popular (more than a million downloads!), It was marked by various awards, and had only two drawbacks:1. It does not warn about drunk drivers of nearby cars and beautiful girls, as well as other interesting fellow travelers who vote on the road as your car follows.

2. It requires optimization, including power consumption, since the original version could not be used without connecting the smartphone to power for more than 30-40 minutes, during which time the battery completely drained.

At the moment, there is only one drawback. First.

So, working in real time, the application converts each source frame of the YUV420 / NV21 format from the smartphone’s camera to the RGB format before further processing.

Initially, the function that implements this transformation used up to 40% of the processor resources, thereby limiting the possibilities for further image processing. Thus, the need for optimization seemed urgent.

The only existing optimized function that we found is the YUV420ToRGB function from the IPP package (Intel Integrated Performance Primitives library ), but it does not have the required combination of supported input and output formats for iOnRoad. In addition, it is not multithreaded.

Therefore, it was decided to write a new optimized code that implements the necessary transformation.

#### Transformation from YUV420 / NV21 to RGB

The YUV420 / NV21 format contains three 8-bit components - brightness Y (black and white) and two color components U and V.To obtain four pixels in the standard RGB format (with its three color components for each pixel), each four components Y require only one pair of the corresponding components U and V.

In the picture above, the corresponding quadruples Y and the pairs U and V serving them are marked with the same color. This format (commonly called YUV) provides 2x compression over RGB.

#### Transformation of YUV to RGB - an integer approach using tables (look-up table, LUT)

The transformation of YUV to RGB is carried out according to a simple linear formula. To avoid conversion to floating-point numbers, iOnRoad used the following well-known integer approximation:Intermediate results of calculations using this formula are greater than 2

^{16}- this is an important point for further discussion of vectorization.

For scalar calculations, iOnRoad used the so-called look-up table (LUT): since all components Y, U, and V are 8-bit, multiplications in the above formulas can be calculated in advance and 32-bit results are stored in five tables with 256 inputs.

#### Converting YUV to RGB - A General Idea for Using SSE Fixed-Point Computing

SSE does not have vector gather instructions for working with LUTs; using vector multiplication of 16-bit packed numbers seems faster than a combination of scalar LUT operations and subsequent packaging. However, a simple 16-bit SSE multiplication (PMULLW) cannot be used, since the expected intermediate results can be more than 2^{16}. The SSE has a PMULHRSW instruction that combines the full 16-bit multiplication and right shift of the 32-bit intermediate result to the required 16-bit rounding. To use this instruction, the operands must be previously shifted to the left, providing the maximum number of significant bits in the final result (in our particular case, we can get a 13-bit final result).

#### Built-in functions (intrinsics) as a means of writing manual SSE code

To help avoid writing manual SSE code using assembler, all known C / C ++ compilers (MS, GNU, Intel) have a set of special APIs called intrinsic functions.From a programmer’s point of view, an inline function looks and behaves like a regular C / C ++ function. In fact, it is a wrapper of one assembly instruction SSE and in most cases it compiles only as this instruction. Using built-in functions replaces the writing of assembler code with all its complexities at the same performance indicators.

For example, to call the PMULHRSW instruction mentioned above, in the C code we used the built-in function _mm_mulhrs_epi16 ().

Each SSE instruction has a corresponding built-in function, so that the necessary SSE code can be completely written using the built-in functions.

#### YUV to RGB Transformation - SSE Fixed Point Computing Implementation

The process starts with loading 2 servings of 16 8-bit Y and 8 8-bit pairs (U, V).As a result, this data will be converted to 16 32-bit RGB pixels (in the form of FRGB when the high byte is set to 0xff).

The number 16 is subtracted from 16 8-bit Y using the operation of 8-bit subtraction with saturation, so there is no need to check the result for negativity.

8 pairs (U, V) “serve” 2 lines with 16 values of Y.

To unpack the input data, a shuffle operation is used, which results in 2 servings of:

- 2 x sets of 8 16-bit Y;
- 1st set of 4 16-bit double U;
- 1st set of 4 16-bit double V.

Below is a detailed diagram of the manufacture of one portion.

Before using U and V, 128 are subtracted from them using the 16-bit _mm_sub_epi16 () instruction.

After subtraction, all 8 16-bit values of Y, U and V are shifted to the left to optimally fit for _mm_mulhrs_epi16 (); This instruction is used with appropriately packed odds.

*Note: These preparatory steps (subtractions and shifts) mentioned above are used instead of LUT operations in the scalar algorithm.*

Multiplication results are summarized to obtain the final 16-bit values, limited between 0 and 2

^{13}-1 (8191) using _mm_min_epi16 () and _mm_max_epi16 ().

After completing all the calculations, we get the result in the form of separately packed 16-bit values of R, G and B.

Repacking them to the FRGB format (where F is the alpha channel filled with units according to the requirements of iOnRoad) is done in two steps.

In the first step, we repackage the 16-bit separate values of R, G, and B into 16-bit FR and GB using an extra register filled with 16-bit <0xff00>. This repackaging phase is performed using logical left and right shifts and logical OR / AND operations, as shown in the figure:

In the second step, the intermediate results FR and GB are finally packaged in FRGB using the unpacking instructions _mm_unpacklo_epi16 () and _mm_unpackhi_epi16 ():

The code described above, which implements the conversion of YUV to RGB using the built-in vector functions of SSE, provides 4-fold acceleration compared to the original scalar code using pre-computed tables (LUT).

#### Using CILK + for parallelization: trivial

All versions of Atom processors used in smartphones and tablets have at least two cores (at least logical ones - HT), and in the future they will have even more, therefore parallelization of algorithms is very important for them.The simplest parallelization approach is implemented in the Intel CILK + extension for the C and C ++ languages (the famous TBB works only for C ++!). The simplest parallelization operator cilk_for (used for the external conversion of YUV to RGB instead of the standard for C / C ++ language) provides a twofold increase in performance on the dual-core Clover Trail + processor.

Using SSE's internal vectorization functions in conjunction with CILK + parallelization provides an 8x overall acceleration.

#### Using CILK + for vectorization: Array Notation, function mapping, and reduction

CILK + contains a very important extension called Array Notation, which can significantly increase the efficiency of vectorization and at the same time improve the readability of the code.Array Notation provides platform scalability: the same code can be optimally vectorized for both 128-bit Atom, 256-bit Haswell, and 512-bit MIC / Skylake - unlike code based on internal SSE / AVX functions : It has to be manually rewritten for each specific platform. Array Notation allows you to use the so-called sections of the array as arguments to the function (function mapping), as well as for reduction (summation, maximum / minimum search, etc.).

#### CILK + Array Notation Example

Look at the two code snippets.Source code with complex definitions and scans (taken from a real application):

And a single-line combination with CILK + elements consisting of an Array Notation, function mapping and reduction section:

These two options are completely identical from a functional point of view, but the CILK + version works 6 times faster !

#### Conclusions and a call for developers

Internal SSE features (SSSE3 level) significantly improve application performance on Atom / Intel devices.Using CILK + Array Notation (built into the Intel compiler) provides great opportunities for automatic vectorization.

CILK + is a great way to parallelize applications on Atom / Intel devices.

**Our recommendation for Atom / Android developers in the new “android” world: do not hesitate to optimize your multimedia applications and games using SSE and CILK + - these proven tools will provide you with a huge leap in performance!**

The author of the text is Grigory Danovich, Senior Engineer for Applied Solutions, Intel Israel.