Malafeev March 26, 2019 at 13:15

How to speed up encryption according to GOST 28147-89 on the Baikal-T1 processor due to the SIMD block

Tutorial

Using an example of a description of the encryption algorithm implementation in accordance with GOST 28147–89 built on the Feistel network, the article shows the capabilities of the BE-T1000 dual-core processor (aka Baikal-T1) and comparative tests of the algorithm implementation are carried out using vector calculations with and without SIMD coprocessor. Here we will demonstrate the use of the SIMD block for the encryption task in accordance with GOST 28147-89, ECB mode.

This article was written last year for the journal Electronic Components by Alexei Kolotnikov , a leading programmer at Baikal Electronics. And since there are not many readers in this magazine, and the article is useful for those who are professionally involved in cryptography, I, with the permission of the author and his small additions, publish it here.

The Baikal-T1 processor and the SIMD block inside it

The Baikal-T1 processor includes a multi-processor dual-core system of the MIPS32 P5600 family, as well as a set of high-speed interfaces for exchanging data and low-speed for controlling peripheral devices. Here is the structural diagram of this processor:

Each of the two cores includes a SIMD block that allows you to work with 128-bit vectors. When using the SIMD method, not one sample is processed,
but the entire vector, which can contain several samples, that is, one command is applied immediately to the entire vector (array) of data. Thus, in one command cycle, several samples are processed. The processor uses the MIPS32 SIMD block, which allows you to work with 32 pieces of 128-bit registers.

Each register can contain vectors of the following dimensions: 8 × 16; 16 × 8; 4 × 32 or
2 × 64 bits. It is possible to use more than 150 instructions for processing data: integer, floating-point and fixed-point, including bitwise operations, comparison operations, and conversion operations.

Very detailed MIPS SIMD technology was described by SparF in the article "MIPS SIMD Technology and the Baikal-T1 Processor" .

Evaluation of the effectiveness of vector computing

The arithmetic coprocessor of the Baikal-T1 processor core combines a traditional floating-point processor and a parallel SIMD processor, focused on the efficient processing of vectors and digitized signals. An independent performance assessment of the vector SIMD coprocessor was carried out in 2017 by SparF while writing the article “MIPS SIMD Technology and the Baikal-T1 Processor” . If desired, such measurements can be repeated, performed independently by connecting to a remote stand with a processor .

Then the test task was to decode the video encoded using the free x264 libraries (H.264 demo clip) and x265 (HEVC video file), with screenshots taken at regular intervals. As expected, a noticeable increase in the processor core performance on FFmpeg tasks with the use of SIMD hardware capabilities was noted:

Brief description of the algorithm GOST 28147-89

Here we will only note the main characteristics for understanding the code and optimization:

The algorithm is built on the Feistel network.
The algorithm consists of 32 rounds.
A round consists of mixing a key and replacing 8 parts of 4 bits in a table with a shift of 11 bits.

A detailed description of the information conversion algorithm in accordance with GOST 28147-89 is given in the State Standard of the USSR .

Implementation of the GOST 28147-89 algorithm without using a SIMD block

For comparison purposes, the algorithms were initially implemented using standard, “non-vectorized” commands.

The MIPS assembler code proposed here allows you to encrypt one block of 64 bits in 450 ns (or ~ 150 Mb / s) at a nominal processor frequency of 1200 MHz. Only one core is involved in the calculations.

Preparing a replacement table implies expanding it into a 32-bit representation to speed up the work of the round: instead of replacing four bits with masking and shifting, a previously
prepared table performs eight-bit replacement.

uint8_t sbox_test[8][16] = {
        {0x9, 0x6, 0x3, 0x2, 0x8, 0xb, 0x1, 0x7, 0xa, 0x4, 0xe, 0xf, 0xc, 0x0, 0xd, 0x5},
        {0x3, 0x7, 0xe, 0x9, 0x8, 0xa, 0xf, 0x0, 0x5, 0x2, 0x6, 0xc, 0xb, 0x4, 0xd, 0x1},
        {0xe, 0x4, 0x6, 0x2, 0xb, 0x3, 0xd, 0x8, 0xc, 0xf, 0x5, 0xa, 0x0, 0x7, 0x1, 0x9},
        {0xe, 0x7, 0xa, 0xc, 0xd, 0x1, 0x3, 0x9, 0x0, 0x2, 0xb, 0x4, 0xf, 0x8, 0x5, 0x6},
        {0xb, 0x5, 0x1, 0x9, 0x8, 0xd, 0xf, 0x0, 0xe, 0x4, 0x2, 0x3, 0xc, 0x7, 0xa, 0x6},
        {0x3, 0xa, 0xd, 0xc, 0x1, 0x2, 0x0, 0xb, 0x7, 0x5, 0x9, 0x4, 0x8, 0xf, 0xe, 0x6},
        {0x1, 0xd, 0x2, 0x9, 0x7, 0xa, 0x6, 0x0, 0x8, 0xc, 0x4, 0x5, 0xf, 0x3, 0xb, 0xe},
        {0xb, 0xa, 0xf, 0x5, 0x0, 0xc, 0xe, 0x8, 0x6, 0x2, 0x3, 0x9, 0x1, 0x7, 0xd, 0x4}
        };
uint32_t sbox_cvt[4*256];
        for (i = 0; i < 256; ++i) {
                uint32_t p;
                p = sbox[7][i >> 4] << 4 | sbox[6][i & 15];
                p = p << 24; p = p << 11 | p >> 21;
                sbox_cvt[i] = p; // S87
                p = sbox[5][i >> 4] << 4 | sbox[4][i & 15];
                p = p << 16; p = p << 11 | p >> 21;
                sbox_cvt[256 + i] = p; // S65
                p = sbox[3][i >> 4] << 4 | sbox[2][i & 15];
                p = p << 8; p = p << 11 | p >> 21;
                sbox_cvt[256 * 2 + i] = p; // S43
                p = sbox[1][i >> 4] << 4 | sbox[0][i & 15];
                p = p << 11 | p >> 21;
                sbox_cvt[256 * 3 + i] = p; // S21
        }

Block encryption is 32 times a repeat of the round using the key.

// input: a0 - &in, a1 - &out, a2 - &key, a3 - &sbox_cvt
LEAF(gost_encrypt_block_asm)
        .set reorder
        lw      n1, (a0)        // n1, IN
        lw      n2,  4(a0)      // n2, IN + 4
        lw      t0, (a2)        // key[0] -> t0
        gostround n1, n2, 1
        gostround n2, n1, 2
        gostround n1, n2, 3
        gostround n2, n1, 4
        gostround n1, n2, 5
        gostround n2, n1, 6
        gostround n1, n2, 7
        gostround n2, n1, 0
        gostround n1, n2, 1
        gostround n2, n1, 2
        gostround n1, n2, 3
        gostround n2, n1, 4
        gostround n1, n2, 5
        gostround n2, n1, 6
        gostround n1, n2, 7
        gostround n2, n1, 0
        gostround n1, n2, 1
        gostround n2, n1, 2
        gostround n1, n2, 3
        gostround n2, n1, 4
        gostround n1, n2, 5
        gostround n2, n1, 6
        gostround n1, n2, 7
        gostround n2, n1, 7
        gostround n1, n2, 6
        gostround n2, n1, 5
        gostround n1, n2, 4
        gostround n2, n1, 3
        gostround n1, n2, 2
        gostround n2, n1, 1
        gostround n1, n2, 0
        gostround n2, n1, 0
1:      sw      n2, (a1)
        sw      n1, 4(a1)
        jr      ra
        nop
        END(gost_encrypt_block_asm)

A spread table round is just a table swap.

        .macro gostround x_in, x_out, rkey
        addu    t8, t0, \x_in   /* key[0] + n1 = x      */
        lw      t0, \rkey * 4 (a2)      /* next key to t0 */
        ext     t2, t8, 24, 8   /* get bits [24,31]     */
        ext     t3, t8, 16, 8   /* get bits [16,23]     */
        ext     t4, t8, 8, 8    /* get bits [8,15] */
        ext     t5, t8, 0, 8    /* get bits [0, 7] */
        sll     t2, t2, 2       /* convert to dw offset */
        sll     t3, t3, 2       /* convert to dw offset */
        sll     t4, t4, 2       /* convert to dw offset */
        sll     t5, t5, 2       /* convert to dw offset */
        addu    t2, t2, a3      /* add sbox_cvt         */
        addu    t3, t3, a3      /* add sbox_cvt         */
        addu    t4, t4, a3      /* add sbox_cvt         */
        addu    t5, t5, a3      /* add sbox_cvt         */
        lw      t6, (t2)        /* sbox[x[3]] -> t2     */
        lw      t7, 256*4(t3)   /* sbox[256 + x[2]] -> t3 */
        lw      t9, 256*2*4(t4) /* sbox[256 *2 + x[1]] -> t4 */
        lw      t1, 256*3*4(t5) /* sbox[256 *3 + x[0]] -> t5 */
        or      t2, t7, t6      /* | */
        or      t3, t1, t9      /* | */
        or      t2, t2, t3      /* | */
        xor     \x_out, \x_out, t2      /* n2 ^= f(...) */
        .endm

The implementation of the algorithm of GOST 28147-89 with the use of SIMD-unit

The code proposed below allows you to simultaneously encrypt four blocks of 64 bits per 720 ns (or ~ 350 Mbit / s) under the same conditions: processor frequency 1200 MHz, one core.

Replacement is carried out in two quadruples and immediately in 4 blocks with instructions shuffleand subsequent masking of significant fours.

Expanding a substitution table

for(i = 0; i < 16; ++i) {
                sbox_cvt_simd[i] =      (sbox[7][i] << 4) | sbox[0][i]; // $w8 [7 0]
                sbox_cvt_simd[16 + i] = (sbox[1][i] << 4) | sbox[6][i]; // $w9 [6 1]
                sbox_cvt_simd[32 + i] = (sbox[5][i] << 4) | sbox[2][i]; // $w10[5 2]
                sbox_cvt_simd[48 + i] = (sbox[3][i] << 4) | sbox[4][i]; // $w11[3 4]
        }

Initializing masks.

l0123:  .int 0x0f0f0f0f
        .int 0xf000000f         /* substitute from $w8  [7 0]  mask in w12*/
        .int 0x0f0000f0         /* substitute from $w9  [6 1]  mask in w13*/
        .int 0x00f00f00         /* substitute from $w10 [5 2]  mask in w14*/
        .int 0x000ff000         /* substitute from $w11 [4 3]  mask in w15*/
        .int 0x000f000f         /* mask $w16 x N x N */
        .int 0x0f000f00         /* mask $w17 N x N x */
LEAF(gost_prepare_asm)
        .set reorder
        .set    msa
        la      t1, l0123               /* masks */
        lw      t2, (t1)
        lw      t3, 4(t1)
        lw      t4, 8(t1)
        lw      t5, 12(t1)
        lw      t6, 16(t1)
        lw      t7, 20(t1)
        lw      t8, 24(t1)
        fill.w  $w2, t2                 /* 0f0f0f0f -> w2 */
        fill.w  $w12, t3                /* f000000f -> w12*/
        fill.w  $w13, t4                /* 0f0000f0 -> w13*/
        fill.w  $w14, t5                /* 00f00f00 -> w14*/
        fill.w  $w15, t6                /* 000ff000 -> w15*/
        fill.w  $w16, t7                /* 000f000f -> w16*/
        fill.w  $w17, t8                /* 0f000f00 -> w17*/
        ld.b    $w8, (a0)               /* load sbox */
        ld.b    $w9, 16(a0)
        ld.b    $w10, 32(a0)
        ld.b    $w11, 48(a0)
        jr      ra
        nop
        END(gost_prepare_asm)

4 block encryption

LEAF(gost_encrypt_4block_asm)
        .set reorder
        .set    msa
        lw      t1, (a0)        // n1, IN
        lw      t2,  4(a0)      // n2, IN + 4
        lw      t3,  8(a0)      // n1, IN + 8
        lw      t4,  12(a0)     // n2, IN + 12
        lw      t5,  16(a0)     // n1, IN + 16
        lw      t6,  20(a0)     // n2, IN + 20
        lw      t7,  24(a0)     // n1, IN + 24
        lw      t8,  28(a0)     // n2, IN + 28
        lw      t0, (a2)        // key[0] -> t0
        insert.w ns1[0],t1
        insert.w ns2[0],t2
        insert.w ns1[1],t3
        insert.w ns2[1],t4
        insert.w ns1[2],t5
        insert.w ns2[2],t6
        insert.w ns1[3],t7
        insert.w ns2[3],t8
        gostround4 ns1, ns2, 1
        gostround4 ns2, ns1, 2
        gostround4 ns1, ns2, 3
        gostround4 ns2, ns1, 4
        gostround4 ns1, ns2, 5
        gostround4 ns2, ns1, 6
        gostround4 ns1, ns2, 7
        gostround4 ns2, ns1, 0
        gostround4 ns1, ns2, 1
        gostround4 ns2, ns1, 2
        gostround4 ns1, ns2, 3
        gostround4 ns2, ns1, 4
        gostround4 ns1, ns2, 5
        gostround4 ns2, ns1, 6
        gostround4 ns1, ns2, 7
        gostround4 ns2, ns1, 0
        gostround4 ns1, ns2, 1
        gostround4 ns2, ns1, 2
        gostround4 ns1, ns2, 3
        gostround4 ns2, ns1, 4
        gostround4 ns1, ns2, 5
        gostround4 ns2, ns1, 6
        gostround4 ns1, ns2, 7
        gostround4 ns2, ns1, 7
        gostround4 ns1, ns2, 6
        gostround4 ns2, ns1, 5
        gostround4 ns1, ns2, 4
        gostround4 ns2, ns1, 3
        gostround4 ns1, ns2, 2
        gostround4 ns2, ns1, 1
        gostround4 ns1, ns2, 0
        gostround4 ns2, ns1, 0
1:
        copy_s.w t1,ns2[0]
        copy_s.w t2,ns1[0]
        copy_s.w t3,ns2[1]
        copy_s.w t4,ns1[1]
        copy_s.w t5,ns2[2]
        copy_s.w t6,ns1[2]
        copy_s.w t7,ns2[3]
        copy_s.w t8,ns1[3]
        sw      t1, (a1)        // n2, OUT
        sw      t2,  4(a1)      // n1, OUT + 4
        sw      t3,  8(a1)      // n2, OUT + 8
        sw      t4,  12(a1)     // n1, OUT + 12
        sw      t5,  16(a1)     // n2, OUT + 16
        sw      t6,  20(a1)     // n1, OUT + 20
        sw      t7,  24(a1)     // n2, OUT + 24
        sw      t8,  28(a1)     // n1, OUT + 28
        jr      ra
        nop
        END(gost_encrypt_4block_asm)

Round using SIMD block commands with calculation of 4 blocks simultaneously

.macro gostround4 x_in, x_out, rkey
        fill.w  $w0, t0                 /* key -> Vi */
        addv.w  $w1, $w0, \x_in         /* key[0] + n1 = x */
        srli.b  $w3, $w1, 4             /* $w1 >> 4 */
        and.v   $w4, $w1, $w16          /* x 4 x 0*/
        and.v   $w5, $w1, $w17          /* 6 x 2 x*/
        and.v   $w6, $w3, $w17          /* 7 x 3 x */
        and.v   $w7, $w3, $w16          /* x 5 x 1 */
        lw      t0, \rkey * 4(a2)       /* next key -> t0 */
        or.v    $w4, $w6, $w4           /* 7 4 3 0 */
        or.v    $w5, $w5, $w7           /* 6 5 2 1 */
        move.v  $w6, $w5                /* 6 5 2 1 */
        move.v  $w7, $w4                /* 7 4 3 0 */
/* 7 x x 0 */
/* 6 x x 1 */
/* x 5 2 x */
/* x 4 3 x */
        vshf.b  $w4, $w8, $w8           /* substitute $w8  [7 0] */
        and.v   $w4, $w4, $w12
        vshf.b  $w5, $w9, $w9           /* substitute $w9  [6 1] */
        and.v   $w5, $w5, $w13
        vshf.b  $w6, $w10, $w10         /* substitute $w10 [5 2] */
        and.v   $w6, $w6, $w14
        vshf.b  $w7, $w11, $w11         /* substitute $w11 [4 3] */
        and.v   $w7, $w7, $w15
        or.v    $w4, $w4, $w5
        or.v    $w6, $w6, $w7
        or.v    $w4, $w4, $w6
        srli.w  $w1, $w4, 21            /* $w4 >> 21 */
        slli.w  $w3, $w4, 11            /* $w4 << 11 */
        or.v    $w1, $w1, $w3
        xor.v   \x_out, \x_out, $w1
        .endm

Brief conclusions

Using the SIMD-block of the Baikal-T1 processor allows achieving a data conversion rate of about 350 Mbit / s according to GOST 28147-89 algorithms, which is two and a half (!) Times higher than the implementation on standard commands. Since this processor has two cores, it is theoretically possible to encrypt a stream at a speed of ~ 700 Mbit / s. At least the test implementation of IPsec, with encryption according to GOST 28147-89, showed the channel throughput of ~ 400 Mbps on a duplex.

Tags: