
Hardware support for AES algorithm by modern processors

Intel in 2008 proposed new teams for the x86 architecture, which added hardware support for the symmetric AES (Advanced Encryption Standard) encryption algorithm. AES is currently one of the most popular block cipher algorithms. Therefore, a hardware implementation should lead to increased productivity of programs using this encryption algorithm (OpenSSL, The Bat, TrueCrypt ... ). The new expansion of teams received the name AES-NI. It contains the following instructions:
- AESENC - Perform one round of AES encryption,
- AESENCLAST- Perform the last round of AES encryption,
- AESDEC - Perform one round of decryption of AES,
- AESDECLAST - Perform the last round of AES decryption,
- AESKEYGENASSIST - Contribute to the generation of the AES round key,
- AESIMC - Reverse Mix Columns.
Since much has already been said about the AES encryption algorithm itself, in this post we will look at how to use these instructions.
First, remember how AES works. This is required in order to understand what mechanisms are implemented in these instructions.
The AES algorithm uses 4 functions:
- AddRound - XOR (exclusive or) messages with a key,
- SubBytes - substitution function,
- ShiftRows - a cyclic shift of fields in a block according to a given rule,
- MixColumns - mixing procedure.
The encryption algorithm itself looks like this:

Getting started
To get started, you need to make sure that the AES-NI extension is present in our processor. To do this, there is a special CPUID command, which, with the value eax = 0x00000001, should set bits in registers relative to the present extensions. For the AES extension, this is 25 bits of the ECX register:
AES-NI verification code:
mov eax,0x00000001;
CPUID;
test ecx,0x2000000;
je L_no_AES;
If the bit is set to 1, then we can move on to encryption.
Key Extension / ExpandKey
The key expansion algorithm in pseudo-code looks like this:
KeyExpansion(byte key[4*Nk], word w[Nb*(Nr+1)], Nk)
begin
word temp
i = 0;
while ( i < Nk)
w[i] = word(key[4*i], key[4*i+1], key[4*i+2], key[4*i+3])
i = i+1
end while
i = Nk
while ( i < Nb * (Nr+1))
temp = w[i-1]
if (i mod Nk = 0)
temp = SubWord(RotWord(temp)) xor Rcon[i/Nk]
else if (Nk > 6 and i mod Nk = 4)
temp = SubWord(temp)
end if
w[i] = w[i-Nk] xor temp
i = i + 1
end while
end
For hardware support, you must use the AESKEYGENASSIST instruction, which will execute:
AESKEYGENASSIST xmm1, xmm2/m128, imm8
Tmp := xmm2/LOAD(m128)
X3[31-0] = Tmp[127-96];
X2[31-0] = Tmp[95-64];
X1[31-0] = Tmp[63-32];
X0[31-0] = Tmp[31-0];
RCON[7-0]:= imm8;
RCON [31-8]:= 0;
xmm1 :=[RotWord (SubWord (X3)) XOR RCON, SubWord (X3), RotWord (SubWord (X1)) XOR RCON, SubWord (X1)]
As you can easily see, the instruction does not execute:
w[i] = w[i-Nk] xor temp
You will have to perform these operations yourself using the MMX instructions.
Key extension 128b example
aeskeygenassist xmm2, xmm1, 0x1 ; 1 раунд
pshufd xmm2, xmm2, 0xff;
movups xmm3, xmm4;
pxor xmm2,xmm3;
pshufd xmm2, xmm2, 0x00;
pshufd xmm3, xmm3, 0x39;
pslldq xmm3,0x4;
pxor xmm2,xmm3;
pshufd xmm2, xmm2, 0x14;
pshufd xmm3, xmm3, 0x38;
pslldq xmm3,0x4;
pxor xmm2,xmm3;
pshufd xmm2, xmm2, 0xA4;
pshufd xmm3, xmm3, 0x34;
pslldq xmm3,0x4;
pxor xmm2,xmm3;
Encryption
To implement one round of encryption, the AESENC instruction is used, which performs the following actions:

AESENC xmm1, xmm2/m128
Tmp = xmm1
Round Key := xmm2/m128
Tmp = ShiftRows (Tmp)
Tmp = SubBytes (Tmp)
Tmp = MixColumns (Tmp)
xmm1 = Tmp xor Round Key
The last round of encryption is implemented using the AESENCLAST statement:
AESENC xmm1, xmm2/m128
Tmp = xmm1
Round Key := xmm2/m128
Tmp = ShiftRows (Tmp)
Tmp = SubBytes (Tmp)
xmm1 = Tmp xor Round Key
The difference between this instruction and AESENC is that the MixColums operation in the last step is not performed:
Encryption Example
aesenc xmm1, xmm2 ;
aesenclast xmm1, xmm3;
Decryption
To implement the decryption procedure, the AESDEC instruction is used:

AESDEC xmm1, xmm2/m128
Tmp = xmm1
Round Key = xmm2/m128
Tmp = InvShift Rows (Tmp)
Tmp = InvSubBytes (Tmp)
Tmp = InvMixColumns (Tmp)
xmm1 = Tmp xor Round Key
To get InvKey, you need to perform the InvMixClomuns operation on the key. The instruction that does this is AESIMC xmm1.xmm2.
And for the last round of decryption, the AESDECLAST statement is used:
AESDECLAST xmm1, xmm2/m128
State = xmm1
Round Key = xmm2/m128
Tmp = InvShift Rows (State)
Tmp = InvSubBytes (Tmp)
xmm1= Tmp xor RoundKey
Decryption Example
aesmic xmm2,xmm2;
aesdec xmm1, xmm2 ;
aesdeclast xmm1, xmm3;
So, hardware support should give us a decent increase in encryption speed. As a completion of the post, I will give a C ++ class that implements encryption and decryption operations in ECB mode. After running the test, the encryption speed on one core i5-3740 (3.2GHz) was achieved, equal to 320MB / sec