CooperMaster August 7, 2015 at 14:33

Using Intel SSSE3 instruction set to accelerate the implementation of the DNN algorithm in speech recognition tasks performed on mobile devices

Transfer

Over the past thirty years, speech recognition technologies have made serious headway, starting their journey in research laboratories and reaching a wide range of consumers. These technologies are beginning to play an important role in our lives. They can be found at the workplace, at home, in the car. They are used for medical purposes and in other fields of activity. Speech recognition is in the top 10 promising world-class technologies. ^{Original picture by Vladstudio}

Overview

As a result of research in recent years, there has been a change in the basic speech recognition algorithms. So, before these were the algorithms GMM (Gaussian Mixture Model) and HMM-GMM (Hidden Markov Model - Gaussian Mixture Model). From them there was a transition to the DNN (Deep Neural Network) algorithm . The operation of this algorithm resembles the activity of the human brain. It uses complex calculations and a huge amount of data.

Thanks to the Internet, any smartphone owner can use modern speech recognition technologies. At its service are countless servers. But without the Internet, speech recognition services on mobile devices are almost useless. They are rarely able to correctly understand those who are trying to "talk" with them.

Can I transfer the implementation of the DNN algorithm from the server to a smartphone or tablet? The answer to that is yes. Thanks to support by Intel processors for the SSSE3 instruction set, DNN-based speech recognition applications can be used on mobile devices. However, an Internet connection is not required. As a result of our tests, the accuracy of speech recognition by such an application was more than 80%. This is very close to what is achievable using server systems. In this article, we will talk about the DNN algorithm and how the Intel SSSE3 instruction set can help speed up the calculations necessary to implement this algorithm.

Preliminary information

DNN (GNS) is an abbreviation for Deep Neural Network. This is a direct distribution network containing many hidden layers. DNN is at the forefront of modern machine learning technology. There were many practical applications for this algorithm.

Deep neural networks have a large number of hidden layers. When training them, it is necessary to modify tens of millions of parameters. As a result, training such networks is time consuming.

Speech recognition is a typical DNN application. Simplified, speech recognition applications can be represented as consisting of an acoustic model, a language model, and a decoding subsystem. An acoustic model is used to model the probability distribution of pronunciation variants. The language model is used to model the relationships between words. At the decoding stage, two of the above models are used, speech is converted to text. A neural network can simulate any verbal constructions. While a deep neural network has a stronger ability to extract essential data attributes than a shallow network, it models the structure of the human brain, and thus is able to more accurately “understand” the characteristics of things. As a result

Areas of application of the DNN algorithm

Typical deep neural network diagram

Typically, a typical deep neural network contains many linear and non-linear layers that overlap each other.

Four hidden layers in an acoustic model based on DNN.

The network, the diagram of which is shown here, consists of a set of linear layers. Each neuron from the previous layer is associated with each neuron from the next. The connection of a network input with its output can be described by the following formula:

Y ^T = X ^T W ^T + B

X ^T is a row vector, an input of a neural network. As applied to speech recognition, we usually place 4 pieces of data for simultaneous work on them, thus creating a 4xM input matrix. W ^tand B is, respectively, a linear matrix of neural network transformation and a displacement vector. Usually the dimension of such a network is very large, in all layers there is the same number of neurons, that is, the network has a square shape.

Intel SSSE3 instruction set

Intel calls the Supplemental Streaming SIMD Extensions 3 instruction set, or, for brevity, simply SSSE3, the SSE3 instruction set extension. It is part of SIMD technology integrated into Intel microprocessors. This technology is designed to improve multimedia processing capabilities. It is designed to speed up the tasks of encoding and decoding information and to speed up the performance of various calculations. Using the SSSE3 instruction set, we can process several data streams with one instruction in one clock cycle. This can significantly improve application performance. In particular, the SSSE3 commands are applicable to matrix computing.

To use the SSSE3 instruction set, the corresponding SIMD header files must be connected:

#include   //MMX
#include   //SSE
#include   //SSE2
#include   //SSE3
#include   //SSSE3
#include   //SSSE4.1
#include   //SSSE4.2
#include   //AES
#include   //AVX

The tmmintrin.h header file provides work with SSSE3, the following is a description of the functions that are defined in it.

/*Горизонтальное сложение [с насыщением] упакованных слов, двойных слов,
{X,}MM2/m{128,64} (b) to {X,}MM1 (a).*/
//a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7)
//then r0=a0+a1,r1=a2+a3,r2=a4+a5,r3=a6+a7,r4=b0+b1,r5=b2+b3,r6=b4+b5, r7=b6+b7
extern __m128i _mm_hadd_epi16 (__m128i a, __m128i b);
//a=(a0, a1, a2, a3), b=(b0, b1, b2, b3)
//then r0=a0+a1,r1=a2+a3,r2=b0+b1,r3=b2+b3
extern __m128i _mm_hadd_epi32 (__m128i a, __m128i b);
//SATURATE_16(x) is ((x > 32767) ? 32767 : ((x < -32768) ? -32768 : x))
//a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7)
//then r0=SATURATE_16(a0+a1), ..., r3=SATURATE_16(a6+a7),
//r4=SATURATE_16(b0+b1), ..., r7=SATURATE_16(b6+b7)
extern __m128i _mm_hadds_epi16 (__m128i a, __m128i b);
//a=(a0, a1, a2, a3), b=(b0, b1, b2, b3)
//then r0=a0+a1, r1=a2+a3, r2=b0+b1, r3=b2+b3
extern __m64 _mm_hadd_pi16 (__m64 a, __m64 b);
//a=(a0, a1), b=(b0, b1), 则r0=a0+a1, r1=b0+b1
extern __m64 _mm_hadd_pi32 (__m64 a, __m64 b);
//SATURATE_16(x) is ((x > 32767) ? 32767 : ((x < -32768) ? -32768 : x))
//a=(a0, a1, a2, a3), b=(b0, b1, b2, b3)
//then r0=SATURATE_16(a0+a1), r1=SATURATE_16(a2+a3),
//r2=SATURATE_16(b0+b1), r3=SATURATE_16(b2+b3)
extern __m64 _mm_hadds_pi16 (__m64 a, __m64 b);
/*Горизонтальное вычитание [с насыщением] упакованных слов, двойных слов,
{X,}MM2/m{128,64} (b) from {X,}MM1 (a).*/
//a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7)
//затем r0=a0-a1, r1=a2-a3, r2=a4-a5, r3=a6-a7, r4=b0-b1, r5=b2-b3, r6=b4-b5, r7=b6-b7
extern __m128i _mm_hsub_epi16 (__m128i a, __m128i b);
//a=(a0, a1, a2, a3), b=(b0, b1, b2, b3)
//then r0=a0-a1, r1=a2-a3, r2=b0-b1, r3=b2-b3
extern __m128i _mm_hsub_epi32 (__m128i a, __m128i b);
//SATURATE_16(x) is ((x > 32767) ? 32767 : ((x < -32768) ? -32768 : x))
//a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7)
//then r0=SATURATE_16(a0-a1), ..., r3=SATURATE_16(a6-a7),
//r4=SATURATE_16(b0-b1), ..., r7=SATURATE_16(b6-b7)
extern __m128i _mm_hsubs_epi16 (__m128i a, __m128i b);
//a=(a0, a1, a2, a3), b=(b0, b1, b2, b3)
//then r0=a0-a1, r1=a2-a3, r2=b0-b1, r3=b2-b3
extern __m64 _mm_hsub_pi16 (__m64 a, __m64 b);
//a=(a0, a1), b=(b0, b1), 则r0=a0-a1, r1=b0-b1
extern __m64 _mm_hsub_pi32 (__m64 a, __m64 b);
//SATURATE_16(x) is ((x > 32767) ? 32767 : ((x < -32768) ? -32768 : x))
//a=(a0, a1, a2, a3), b=(b0, b1, b2, b3)
//then r0=SATURATE_16(a0-a1), r1=SATURATE_16(a2-a3),
//r2=SATURATE_16(b0-b1), r3=SATURATE_16(b2-b3)
extern __m64 _mm_hsubs_pi16 (__m64 a, __m64 b);
/*Умножение и сложение упакованных слов,
{X,}MM2/m{128,64} (b) to {X,}MM1 (a).*/
//SATURATE_16(x) is ((x > 32767) ? 32767 : ((x < -32768) ? -32768 : x))
//a=(a0, a1, a2, ..., a13, a14, a15), b=(b0, b1, b2, ..., b13, b14, b15)
//then r0=SATURATE_16((a0*b0)+(a1*b1)), ..., r7=SATURATE_16((a14*b14)+(a15*b15))
//Параметр a содержит байты без знака. Параметр b содержит байты со знаком.
extern __m128i _mm_maddubs_epi16 (__m128i a, __m128i b);
//SATURATE_16(x) is ((x > 32767) ? 32767 : ((x < -32768) ? -32768 : x))
//a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7)
//then r0=SATURATE_16((a0*b0)+(a1*b1)), ..., r3=SATURATE_16((a6*b6)+(a7*b7))
//Параметр a содержит байты без знака. Параметр b содержит байты со знаком.
extern __m64 _mm_maddubs_pi16 (__m64 a, __m64 b);
/*Упакованное умножение старших элементов целых чисел с округлением и масштабированием,
{X,}MM2/m{128,64} (b) to {X,}MM1 (a).*/
//a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7)
//then r0=INT16(((a0*b0)+0x4000) >> 15), ..., r7=INT16(((a7*b7)+0x4000) >> 15)
extern __m128i _mm_mulhrs_epi16 (__m128i a, __m128i b);
//a=(a0, a1, a2, a3), b=(b0, b1, b2, b3)
//then r0=INT16(((a0*b0)+0x4000) >> 15), ..., r3=INT16(((a3*b3)+0x4000) >> 15)
extern __m64 _mm_mulhrs_pi16 (__m64 a, __m64 b);
/*Упакованная перестановка байтов
{X,}MM2/m{128,64} (b) by {X,}MM1 (a).*/
//SELECT(a, n) extracts the nth 8-bit parameter from a. The 0th 8-bit parameter
//is the least significant 8-bits, b=(b0, b1, b2, ..., b13, b14, b15), b is mask
//then r0 = (b0 & 0x80) ? 0 : SELECT(a, b0 & 0x0f), ...,
//r15 = (b15 & 0x80) ? 0 : SELECT(a, b15 & 0x0f)
extern __m128i _mm_shuffle_epi8 (__m128i a, __m128i b);
//SELECT(a, n) extracts the nth 8-bit parameter from a. The 0th 8-bit parameter
//is the least significant 8-bits, b=(b0, b1, ..., b7), b is mask
//then r0= (b0 & 0x80) ? 0 : SELECT(a, b0 & 0x07),...,
//r7=(b7 & 0x80) ? 0 : SELECT(a, b7 & 0x07)
extern __m64 _mm_shuffle_pi8 (__m64 a, __m64 b);
/*Знак упакованных байтов, слов, двойных слов, {X,}MM2/m{128,64} (b) to {X,}MM1 (a).*/
//a=(a0, a1, a2, ..., a13, a14, a15), b=(b0, b1, b2, ..., b13, b14, b15)
//then r0=(b0 < 0) ? -a0 : ((b0 == 0) ? 0 : a0), ...,
//r15= (b15 < 0) ? -a15 : ((b15 == 0) ? 0 : a15)
extern __m128i _mm_sign_epi8 (__m128i a, __m128i b);
//a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7)
//r0=(b0 < 0) ? -a0 : ((b0 == 0) ? 0 : a0), ...,
//r7= (b7 < 0) ? -a7 : ((b7 == 0) ? 0 : a7)
extern __m128i _mm_sign_epi16 (__m128i a, __m128i b);
//a=(a0, a1, a2, a3), b=(b0, b1, b2, b3)
//then r0=(b0 < 0) ? -a0 : ((b0 == 0) ? 0 : a0), ...,
//r3= (b3 < 0) ? -a3 : ((b3 == 0) ? 0 : a3)
extern __m128i _mm_sign_epi32 (__m128i a, __m128i b);
//a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7)
//then r0=(b0 < 0) ? -a0 : ((b0 == 0) ? 0 : a0), ...,
//r7= (b7 < 0) ? -a7 : ((b7 == 0) ? 0 : a7)  
extern __m64 _mm_sign_pi8 (__m64 a, __m64 b);  
//a=(a0, a1, a2, a3), b=(b0, b1, b2, b3)  
//r0=(b0 < 0) ? -a0 : ((b0 == 0) ? 0 : a0), ...,  
//r3= (b3 < 0) ? -a3 : ((b3 == 0) ? 0 : a3)  
extern __m64 _mm_sign_pi16 (__m64 a, __m64 b);  
//a=(a0, a1), b=(b0, b1), 则r0=(b0 < 0) ? -a0 : ((b0 == 0) ? 0 : a0),  
//r1= (b1 < 0) ? -a1 : ((b1 == 0) ? 0 : a1)  
extern __m64 _mm_sign_pi32 (__m64 a, __m64 b);  
/*Упакованное выравнивание и сдвиг вправо на n*8 битов,
{X,}MM2/m{128,64} (b) to {X,}MM1 (a).*/
//n: константа, которая задаёт, на сколько байтов 
//вправо будет сдвинут промежуточный результат, 
//если n > 32, итоговое значение будет нулём.
//CONCAT(a, b) это 256-битное беззнаковое промежуточное значение,
//которое представляет собой объединение параметров a и b. 
//Результат – это промежуточное значение, сдвинутое вправо на n байт.
//then r= (CONCAT(a, b) >> (n * 8)) & 0xffffffffffffffff
extern __m128i _mm_alignr_epi8 (__m128i a, __m128i b, int n);
//n: целочисленная константа, которая указывает, на сколько байтов вправо
//нужно сдвинуть промежуточный результат.
//Если n > 16, в результате получится ноль.
//CONCAT(a, b) это 128-битное беззнаковое промежуточное значение,
//которое представляет собой объединение параметров a и b.
//Результирующие значения - правые 64 бита, полученные
//после сдвига этого промежуточного результата вправо на n байтов. 
//then r = (CONCAT(a, b) >> (n * 8)) & 0xffffffff
extern __m64 _mm_alignr_pi8 (__m64 a, __m64 b, int n);
/*Абсолютное значение упакованных байтов, слов, двойных слов,
{X,}MM2/m{128,64} (b) to {X,}MM1 (a).*/
//a=(a0, a1, a2, ..., a13, a14, a15)
//then r0 = (a0 < 0) ? -a0 : a0, ..., r15 = (a15 < 0) ? -a15 : a15
extern __m128i _mm_abs_epi8 (__m128i a);
//a=(a0, a1, a2, a3, a4, a5, a6, a7)
//then r0 = (a0 < 0) ? -a0 : a0, ..., r7 = (a7 < 0) ? -a7 : a7
extern __m128i _mm_abs_epi16 (__m128i a);
//a=(a0, a1, a2, a3)
//then r0 = (a0 < 0) ? -a0 : a0, ..., r3 = (a3 < 0) ? -a3 : a3
extern __m128i _mm_abs_epi32 (__m128i a);
//a=(a0, a1, a2, a3, a4, a5, a6, a7)
//then r0 = (a0 < 0) ? -a0 : a0, ..., r7 = (a7 < 0) ? -a7 : a7
extern __m64 _mm_abs_pi8 (__m64 a);
//a=(a0, a1, a2, a3)
//then r0 = (a0 < 0) ? -a0 : a0, ..., r3 = (a3 < 0) ? -a3 : a3
extern __m64 _mm_abs_pi16 (__m64 a);
//a=(a0, a1), then r0 = (a0 < 0) ? -a0 : a0, r1 = (a1 < 0) ? -a1 : a1
extern __m64 _mm_abs_pi32 (__m64 a);

The __m64 and __m128 data structure definitions are in the header file for MMX (mmintrin.h) and SSE (xmmintrin.h).

__m64:

typedef union __declspec(intrin_type) _CRT_ALIGN(8) __m64  
{  
	unsigned __int64    m64_u64;  
	float               m64_f32[2];  
	__int8              m64_i8[8];  
	__int16             m64_i16[4];  
	__int32             m64_i32[2];      
	__int64             m64_i64;  
	unsigned __int8     m64_u8[8];  
	unsigned __int16    m64_u16[4];  
	unsigned __int32    m64_u32[2];  
} __m64;

__m128:

typedef union __declspec(intrin_type) _CRT_ALIGN(16) __m128 {  
	float               m128_f32[4];  
	unsigned __int64    m128_u64[2];  
	__int8              m128_i8[16];  
	__int16             m128_i16[8];  
	__int32             m128_i32[4];  
	__int64             m128_i64[2];  
	unsigned __int8     m128_u8[16];  
	unsigned __int16    m128_u16[8];  
	unsigned __int32    m128_u32[4];  
} __m128;

Example: using SSSE3 functions to speed up computations trying on the DNN algorithm

Here we look at a couple of functions. Their example will show how SSSE3 is used to speed up calculations when implementing the DNN algorithm.

__m128i _mm_maddubs_epi16 (__m128i a, __m128i b) Saturation addition

This function is very important when performing matrix calculations in the DNN algorithm. The parameter is a 128-bit register, which is used to store 16 unsigned integers (8-bit). Parameter b is a signed integer, also 8-bit. The returned result is 8 16-bit signed integers. This function is great for performing matrix calculations:

r0 := SATURATE_16((a0*b0) + (a1*b1))
r1 := SATURATE_16((a2*b2) + (a3*b3))
…
r7 := SATURATE_16((a14*b14) + (a15*b15))

__m128i _mm_hadd_epi32 (__m128i a, __m128i b) Addition of adjacent elements

This function can be called a function that performs pairwise addition. Parameters a and b are 128-bit registers that store 4 signed 32-bit integers. In accordance with the usual operation of adding the corresponding elements in two vectors, the team performs the addition of adjacent elements of the input vector:

r0 := a0 + a1
r1 := a2 + a3
r2 := b0 + b1
r3 := b2 + b3

Suppose we have a vector computation task typical of a DNN implementation.

There are five vectors: a1, b1, b2, b3, b4. Vector a1 is a one-dimensional array of 16 integers of type signed char. Vectors b1, b2, b3, b4 are arrays of integers of 16 elements each of type unsigned char. We need to get the scalar products a1 * b1, a1 * b2, a1 * b3, a1 * b4 the result must be saved as a 32-bit signed integer.

If we use the usual approach for programming in C, then the code for solving this problem will look like this:

unsigned char b1[16],b2[16],b3[16],b4[16];
signed char a1[16];
int c[4],i;
//
//Инициализация b1,b2,b3,b4 и a1, c инициализируется нулями
// 
for(i=0;i<16;i++){
c[0] += (short)a1[i]*(short)b1[i];
c[1] += (short)a1[i]*(short)b2[i];
c[2] += (short)a1[i]*(short)b3[i];
c[3] += (short)a1[i]*(short)b4[i];
}

Suppose that in one clock cycle you can perform one operation of multiplication and one operation of addition. We get - 64 clock cycles for performing calculations.

Now let's use the SSSE3 instruction set to solve the same problem.

register __m128i a1,b1,b2,b3,b4,c,d1,d2,d3,d4;
//Инициализация a1, b1, b2, b3 и b4, c инициализируется нулями //
d1 = _mm_maddubs_epi16(a1,b1);
d1 = _mm_add_epi32(_mm_srai_epi32(_mm_unpacklo_epi16(d1, d1), 16), _mm_srai_epi32(_mm_unpackhi_epi16(d1, d1), 16));
d2 = _mm_maddubs_epi16(a1,b2);
d2 = _mm_add_epi32(_mm_srai_epi32(_mm_unpacklo_epi16(d2, d2), 16), _mm_srai_epi32(_mm_unpackhi_epi16(d2, d2), 16));
d3 = _mm_hadd_epi32(d1, d2);
d1 = _mm_maddubs_epi16(a1,b3);
d1 = _mm_add_epi32(_mm_srai_epi32(_mm_unpacklo_epi16(d1, d1), 16), _mm_srai_epi32(_mm_unpackhi_epi16(d1, d1), 16));
d2 = _mm_maddubs_epi16(a1,b4);
d2 = _mm_add_epi32(_mm_srai_epi32(_mm_unpacklo_epi16(d2, d2), 16), _mm_srai_epi32(_mm_unpackhi_epi16(d2, d2), 16));
d4 = _mm_hadd_epi32(d1, d2);
c = _mm_hadd_epi32(d3, d4);

We store the result in a 128-bit register (s), which contains 4 integers. Given pipelined data processing, it will take 12 or 13 clock cycles to compute. If you compare this data, you get the following:

Implementation option	CPU cycles	Win
Common C Programming	64	-
Using SSSE3	thirteen	~ 500%

Benchmarking

Let's conduct an experiment based on the above code. We create two functions that perform the same calculations in different ways. One of them, in the end, returns the sum of the elements of the integer array c , the second - the sum of the 32-bit integer elements of the 128-bit register c . Variables are initialized with every function call. In total, there are 10,000,000 calls to each function; the test runs in the background thread.

Application interface for performance testing

These are the results of testing the release version of the application on the Asus Fonepad 8 tablet with Intel Atom Z3530 CPU. The device has Android 5.0 installed.

Comparison of the execution speed of code written using and without using SSSE3

No.	Using SSSE3, ms.	Using regular C, ms.
1	547	3781
2	507	3723
3	528	3762
4	517	3731
5	531	3755
6	517	3769
7	502	3752
8	529	3750
9	514	3745
10	510	3721
Average	520.2	3748.9

As a result, it turns out that code that implements calculations using SSSE3 instructions is executed, on average, 7.2 times faster than normal.

The source code of the project, which can be imported into Android Studio, can be found here .

Summary

As you know, when recognizing speech using a deep neural network, many matrix calculations are performed. If you optimize these calculations, you can achieve better than ever performance on the IA platform. We work together with ISV Unisound, a company that provides speech recognition services in China. Unisound was able to achieve a performance increase of 10% when using DNN-based software on ARM devices.

DNN today is becoming the main algorithm for speech recognition. It, in particular, is used by services such as Google Now, Baidu Voice, Tencent Wechat, iFlytek Speech Service, Unisound Speech Service and many others. At the same time, there is a set of SSSE3 instructions that can help optimize the calculations on which the speech recognition process is based. If everywhere where DNN is used, such optimization is implemented, this will improve the quality of speech recognition and will allow to fully reveal the capabilities of the IA platform.

Tags: