MichaelBorisov November 18, 2014 at 14:16

Compact RSA implementation for embedded applications

Tutorial

RSA is a well-known public key encryption algorithm. On its basis, in addition to asymmetric encryption, it is also possible to implement an electronic signature ( EDS ). These features are attractive for embedded systems, microcontrollers. The encryption method itself is extremely simple:

C = (M ^e ) mod n

(1)

where C, M, e, n are integers, M is plaintext, numbers e and n are the public key, C is ciphertext. mod - remainder of division.

Stitching looks just as simple:

M = (C ^d ) mod n

(2)

where C, M, n play the same role as in encryption, d is the private key.

Moreover, n = p * q, where p and q are primes (secret), e is usually equal to 65537, d is calculated on the basis of e, p and q. Cryptographic strength is based on the fact that for sufficiently large p and q, the problem of factoring n or reversing the encryption formula without knowing p and q cannot be solved in an acceptable time.

But this apparent simplicity is deceiving. Behind it is a huge amount of details and implementation difficulties. Especially if the goal is to get an efficient performance and memory implementation suitable for use in microcontrollers. I did not find suitable libraries on the Internet, but attempts to study the sources of libgcryptlead into such wilds from which you can’t get out. Therefore, I wrote my compact library, which I share with dear readers.

1. Long arithmetic (Multi-precision integer arithmetics)

The first difficulties begin when you try to test the RSA for small numbers. For example, if we take p = 7, q = 11, then we get n = 77. The chosen e should be mutually simple with (p-1) * (q-1), so 2, 3, 4 and 5 do not fit, the minimum fits 7. Leaving aside the method of calculating d, I will simply give the result: d = 43. And now, although encrypting a small message is not a problem, decryption requires raising C to the power of 43, and this leads to overflow even of 64-bit integers already at C = 3.

However, even without this example it was clear that one could not do without using long arithmetic. Indeed, to ensure cryptographic strength, p and q must be large. In modern practice of using RSA, these numbers are of the order of 2 ¹⁰²⁴ each, and the order of n is 2 ²⁰⁴⁸when using a key length of 2048 bits. It is acceptable to lower the order of numbers by 2 times (512 bits for p and q, 1024 for n), but in recent years you can often find the opinion that 1024-bit RSA keys no longer provide the proper level of cryptographic strength.

Long arithmetic is a loose concept. There is a desire to use the C ++ class to represent long numbers and implement the addition, subtraction, multiplication, etc. operators. My practice shows that this approach is only suitable for an initial introduction to RSA. For implementation in microcontrollers, we need to implement not the whole set of operations, but only the most necessary ones. And implement efficiently. Not in C ++, but in C or even assembler.

Historically, I first implemented the necessary long arithmetic algorithms “head-on” in C ++, then translated them into the dsPIC33F assembler and optimized them, and then translated them back to C. Therefore, the C code turned out to be much more efficient than the original C ++ code and only slightly inferior to assembler.

1.1. Exponentiation

Obviously, first of all we need an exponentiation algorithm. For this, the algorithm studied at school [1] is suitable , which reduces the operation to a series of multiplications and has complexity O (log e), where e is an exponent.

Even when using long arithmetic algorithms, raising a 2048-bit number M to the power e = 65537 leads to a bit number of 2048 * 65537 bits, which will require more than 16 megabytes to store it in memory. The execution time of multiplication for such numbers completely exceeds all conceivable limits. Fortunately, we do not need the result M ^e, but only the remainder of it divided by n. Since exponentiation is reduced to a series of multiplications, after each multiplication it is necessary to reduce the capacity of the intermediate result, finding the remainder of dividing it by n. There are theorems guaranteeing that we obtain the correct result. Thus, the algorithm of raising to a power and subsequent calculation of the remainder of dividing the result by n turns into a single module of raising to a power modulo , which calculates the results of both operations at the same time and thereby fully implements RSA encryption according to formula (1).

To implement this algorithm, we need the operations of long multiplication and finding the remainder of division. Private does not interest us. At the same time, the capacity of the operands for multiplication will be 2048 bits each, 4096-bit numbers must be divided by 2048-bit, the remainder of the division will be 2048 bits long.

Despite the good performance, the above algorithm is suitable only for encryption, where a “good” exponent e = 65537 is used. Its value is large enough for cryptographic strength; at the same time, it is quite small in comparison with the bit depth of the key; it is a prime number and it also has an advantageous binary representation that increases the efficiency of exponentiation. At the same time, the number d cannot be arbitrarily selected, it depends on p, q and e, and in order of magnitude it is comparable to n, 2048 bits. If 65537 can be raised to a power in 17 long multiplications and divisions, then raising to the power of d will require an average of about 3072 such operations, which will reduce the decryption performance compared to encryption by more than 180 times. However, in the case of decryption, you can significantly improve performance, using knowledge of p and q (for the encryption side, these numbers are usually unknown). As a result, the decryption algorithm differs significantly from encryption for all the mathematical similarities of these operations.

My embeddable implementation uses only encryption (or the operation of verifying electronic signatures), so raising to a power of e modulo n can be formalized as a specialized procedure, tightly bound to the value e = 65537. In the source code, this procedure is called mpi_powm65537 , it takes the numbers M and n as input. In her work, she uses the procedures of long multiplication and finding the remainder of the long division.

1.2. Long multiplication

There are a number of long multiplication algorithms. The most widely known is the multiplication "in the column." This method reduces the multiplication of long numbers to a series of multiplications and additions of individual digits. If there is a hardware multiplier in the processor, long multiplication can be implemented in the base number system 256, 65536 or more, depending on the bit depth of the multiplier. He will multiply the individual "numbers" of long numbers.

Column multiplication is not the most efficient method. Its complexity is O (k ² ), where k is the length of the operands in the selected number system. More efficient algorithms exist: Karatsuba multiplication, multiplication by the Fourier transform [2, 7], etc. However, these advanced methods are more difficult to implement. Given the lack of time, I limited myself to multiplying by a column. This gave acceptable performance on the selected microcontroller.

So, multiplication “in a column” reduces a long multiplication to a series of short multiplications and additions. It is wise to use the bit depth of the existing hardware multiplier fully. For example, in PIC24 and dsPIC microcontrollers there is a 16 * 16 => 32 multiplier, that is, a multiplier of 16-bit numbers and a 32-bit result. Thus, the maximum possible base of the number system for these microcontrollers is 2 ¹⁶ = 65536.

Moreover, it is important that the result is 32-bit, because all these 32 bits will be needed in the course of further calculations. Say, in x86 processors, the bit depth of the result is equal to the bit width of the operands, i.e. 32 bits. To prevent overflow, it is necessary to limit the bit length of the operands to 16 bits. The same is true for some other 32-bit processors. They also have to use the base number system 65536, as in a 16-bit microcontroller.

When multiplying “by column”, intermediate results can be calculated in a different order. Say, as we are used to at school, you can first multiply the entire first factor by one digit in the second, then by the second, etc., writing the results in rows. The result is a matrix, which then needs to be added line by line. But for dsPIC microcontrollers, a different approach is more efficient [6]. Namely, the calculation of the matrix is not in rows, but in columns. First the first column, then the second, etc. When the column is calculated - its values can be immediately added together, you get a figure of the final result and transfer. Of course, it makes no sense to store all the numbers from the column - you can instead immediately add them to the sum. With this approach, the operations of multiplication alternate with the operations of adding the result to the battery. And therefore, it becomes possible to use powerfulDSP- tools of this microcontroller, REPEAT and MAC instructions, which extract two operands from memory in one clock cycle of the processor, multiply them, add the result to the battery and increase the values of the pointers. Although the algorithm still remains inefficient - O (k ² ), such an optimized implementation may compete with more efficient algorithms that spend more processor cycles on each pass of the internal cycle.

The unsigned long multiplication 2048 * 2048 => 4096 is handled by the mpi_muluu subroutine in my library .

1.3. Representation of numbers - Little Endian or Big Endian?

Any developer of a long arithmetic library must decide in what form he represents the numbers in memory. Whether the first and the most significant digits of the number are located in the memory (Little Endian) or vice versa (Big Endian). The Big Endian format is more natural for a person, because all the numbers we work with are presented in this format. However, for the internal representation of numbers in a computer, it may turn out differently.

In our case, the conditions are dictated by the long multiplication algorithm. During its operation, numbers are represented in memory in the form of 16-bit “digits”. It would be advisable to read these “numbers” from the memory as a whole, and not one byte at a time. In this case, it is necessary to use instructions of 16-bit processor reading, and they are in the case of dsPIC, and x86 - Little Endian. There are other considerations for which Little Endian is preferable, but this is the most important thing. To improve performance, we want to read the processed information from memory in as large chunks as possible, while avoiding unnecessary conversions. Therefore, I chose Little Endian, despite the fact that many textbooks use the Big Endian format, and I do not regret it. This choice led to a beautiful and optimal code. Of course, for Big-endian processors you should choose the Big-endian format.

1.4. Long division

Effective long division methods reduce it to multiplying by the inverse number divisor [7]. In turn, this inverse number is also calculated using multiplication, for example, by an iterative method based on the Newton method of solving equations [3,7]. Theoretically, this could be very beneficial for RSA, because you have to divide it many times by the same number, so the reciprocal of the number needs to be calculated only once. However, I did not bother with this, limiting myself to the implementation of the “column” division with respect to computers, which is described in [4].

If you recall the method of division “in a column” studied at school and try to implement it “forehead” on a computer, you will encounter the fact that this method, with the exception of working in a binary number system, does not reduce long division to short or other simple operations. In fact, having carried out a certain number of digits of the dividend, it is necessary to divide them into the entire long divisor in order to get the next digit of the quotient. How to be here? A theorem is presented in [4], which makes it possible to “guess” the number of a quotient with a small error. It is valid for cases when:

the digit capacity of the divider is one digit less than the resolution of the partial remainder;
the first digit of the divisor is greater than or equal to half the base of the number system.

The first condition is met due to the fact that we put the numbers of the dividend into a partial remainder in the correct way (similar to the school procedure). To fulfill the second condition, you can multiply the dividend and the divisor by the same number. However, for RSA keys, it is guaranteed that the most significant bit of n (the divisor) is 1. Therefore, the 16-bit “digit” formed from the high 16 bits of the key will be greater than or equal to 32768. Therefore, the division operands are not required for RSA .

To “guess” the number of the quotient, you need to divide the first two digits of the partial remainder by the first digit of the divisor. The value obtained is either equal to the true digit of the quotient, or exceeds it by no more than 2. In fact, if we divide 499 by 50, then dividing 49 by 5, we get 9, which coincides with the true first digit of the quotient. If we divide 890 by 99, then dividing 89 by 9, we get 9, which is 1 more than necessary. In other cases, the estimate is overestimated by 2, but never again, this is guaranteed by the theorem.

The division of two digits of the partial remainder by one digit of the divider is a “short” division, the hardware support of which is very desirable. DsPIC33 has hardware support for dividing 32-bit numbers into 16-bit ones, with 16-bit quotients and the remainder. This is quite enough to implement long division in the number system on the base 2 ¹⁶ = 65536. For x86, the maximum bit depth of the dividend is 32 bits, which also gives the basis of the number system 65536.

After “guessing” the numbers of the quotient, it is necessary to multiply this digit by the divisor and subtract the product from the partial remainder. If the result is less than 0, add the divisor back and reduce the estimate of the quotient of the quotient by 1. If the balance is still less than 0, add the divisor again and reduce the quotient again. As a result, we get the correct value of the figure of the quotient and the correct value of the partial remainder to continue the division process.

RSA does not need to know the quotient, but only the remainder of the division is needed. In this regard, I wrote the mpi_moduu functioncalculating only the remainder. For a 2048-bit RSA, it is necessary to divide 4096-bit numbers into 2048-bit ones, with a 2048-bit remainder. My procedure performs the division "in place": the dividend is replaced by the remainder at the end of the procedure. This increases its effectiveness, since the "demolition" of the next digit from a dividend turns into a simple increment of the pointer. Also, no additional memory is required to store the partial remainder and the private (since it is not saved).

For the above division algorithm to work, the following components of long arithmetic are necessary:

multiplying a long number by a short one (by one digit);
subtraction of long numbers;
addition of long numbers;
comparing long numbers

which are described below.

The complexity of the long division procedure is O (k ² ) by short multiplications and O (k) by short divisions, where k is the number of bits of the divider in the base notation 65536.

1.5. Long addition and subtraction

Writing procedures for adding mpi_add and subtracting mpi_sub does not require special ingenuity and any complex methods. Each processor has an instruction like ADC, which adds two numbers and a carry bit from the previous addition. If you cascade ADC commands, then you can add operands of any capacity. The above is true for the subtraction and SBC instruction. Difficulty - O (k). My procedures return the carry flag when adding or subtracting, which is used in the division procedure.

In C, there is no access to processor flags, so you have to use numbers of greater capacity than necessary. For example, when adding two 16-bit numbers, you need to use at least 32-bit numbers to store the result and transfer to ensure that overflow is guaranteed. This again leads us to the need to work in the base number system 65536 on 32-bit processors. Well, in assembler, the digit capacity of the digits of long numbers is the same as the processor bit, 16 bits in the case of dsPIC.

1.6. Long comparison

Comparison can be implemented more efficiently than subtraction if you start processing with high digits. If the leading digit of one number is greater than the leading digit of another, then the whole number is larger, the remaining digits can not be compared. Thus, the complexity of the comparison procedure is the same as for subtracting O (k), however, statistically comparison is faster. The procedure in the source is called mpi_cmp .

1.7. Multiplying a long number by a short one

Multiplying a long number by a short one (i.e. by one digit), which is used to divide long numbers, is no different in principle from long multiplication. However, it lacks one level of nesting cycles. Multiplying the bit number k by a digit in the system, based on 65536, leads to the bit number k + 1. Since in the division procedure, such a multiplication immediately follows the subtraction of the product from the partial remainder, I implemented both operations simultaneously in the mpi_mulsubuuk procedure , which multiplies a 2048-bit number by a 16-bit one and subtracts the product from a 2064-bit number, which leads to some saving in memory and time execution.

That's all the long arithmetic needed for RSA encryption.

2. Key generation

Key generation in RSA begins by choosing two large primes p and q. When they say: “RSA-key with a length of 2048 bits”, they mean that the product p * q is a 2048-bit number, the highest bit of which is 1 (otherwise, in some interpretations, it would be a 2047-bit number). The bits of the numbers p and q are not individually set, but I noticed that p and q are generated in GnuPG so that each of them is a 1024-bit number with the highest bit equal to 1.

Books can be written about generating large primes, not like articles. But in many situations, including mine, it is not necessary to generate keys on the microcontroller. Therefore, I tried, and successfully, generate keys using GnuPG.

First, you need to generate a key pair (public + private) in GnuPG in the usual way. At the same time, make sure that the RSA key of the desired length is obtained (2048 bits are now by default there). Despite persistent calls by GnuPG, you should refuse to protect this key with a password and store it on disk in clear text. What difference does it make, we are still going to “gut” this key into its constituent numbers p, q, n, d and save them to disk in open form, at least temporarily.
The command that is not documented in new versions of gpg can export a key from the GnuPG key database

gpg --export-secret-keys --armor [key_id] >filename.asc

Where key_id is the key identifier that can be obtained by calling gpg --list-secret-keys.

You will get a text file with contents like:

-----BEGIN PGP PRIVATE KEY BLOCK-----
Version: GnuPG v2.0.17 (MingW32)
lQOXBE9IdyABCADGT3+Dj0dsVPSkzW+zPlfXc4AsuKpkE9GJNAYD3mrcF70nwk1L
...

How to get the numbers that make up the key from here? It turned out that in the gpg program itself and the included utilities there are NO such tools. There is only a third-party utility " pgpdump ", which interprets the contents of PGP Private Key Block and displays in human-readable form. By calling it as follows:

pgpdump -i filename.asc >filename.txt

We will get the parameters we need:

Old: Secret Key Packet(tag 5)(919 bytes)
	Ver 4 - new
	Public key creation time - Sat Feb 25 07:52:32 FLE Standard Time 2012
	Pub alg - RSA Encrypt or Sign(pub 1)
	RSA n(2048 bits) - c6 4f 7f ...
	RSA e(17 bits) - 01 00 01 
	RSA d(2040 bits) - a5 c5 ce ...
	RSA p(1024 bits) - d0 28 ad ...
	RSA q(1024 bits) - f3 e3 61 ...
	RSA u(1024 bits) - 90 8b 22 ...
	Checksum - 3e 22
...

Here I don’t give real numbers at all, but you can easily get them for your key. It remains only to copy them where necessary, not forgetting that in this file they are presented in Big Endian format . So to use them in conjunction with my procedures, you need to convert them to Little Endian, that is, rearrange all bytes in the reverse order.

Also do not forget that the public key is only the numbers e and n. The remaining numbers are a secret (private) key, and knowing any of them (d, p, q, u) will make it easy to calculate the rest and decrypt your message, or forge an electronic signature.

3. Decryption

Since most RSA implementations hide somewhere inside the numbers that make up the keys and wrap encrypted messages in all kinds of packages, at least to check the subroutines listed in Section 1, it is necessary to implement decryption, in addition to the encryption operation.

As mentioned above, the decryption formula (2) can be implemented “on the forehead”. And although it will work hundreds of times slower than encryption, due to the large value of the exponent d, but this will be enough for many applications. If there is a desire to expedite the operation, then the trick is applied using the Chinese remainder theorem described in [5]. I do not have beautiful source codes that implement all this, so I give only a verbal description of the algorithm:

First, the number d "decomposes" into two parts, d _p = d mod (p-1) and d _q = d mod (q-1) . Moreover, the bit depth of each of the parts is 2 times less than the “key length” (bit width n). Next, q _inv is calculated such that (q * q _inv ) mod p = 1. That is, q _inv is the inverse of q in the ring modulo p. In English terminology, this is called " Modular multiplicative inverse ." To calculate the inverse element, the extended Euclidean algorithm is usually used .

When the numbers d _p , d _q and q _inv found, message decryption by RSA method is performed as follows:

m ₁ : = (C ^{d _p} ) mod p
m ₂ : = (C ^{d _q} ) mod q
h: = ((m ₁ -m ₂ ) * q _inv ) mod p
M: = m ₂ + h * q

Please note that the calculations in steps 1-3 are performed in a reduced capacity: 1024 bits instead of 2048 for formula (2). Since modulo exponentiation algorithms have cubic complexity (quadratic from multiplication and division, and even linear from “fast” exponentiation), we get a gain in speed when working with half the digit capacity of numbers by 8 times. But since two expensive exponential modulo operations are now called in, the total gain is obtained 4 times in comparison with the direct application of formula (2). Although this is a lot, it is still not so significant that in all cases it is necessary to abandon the direct application of formula (2), especially given the difficulties in calculating q _inv .

Mention should be made of an interesting vulnerability related to “fast” exponentiation. This is mentioned in the libgcrypt source file mpi-pow.c. The attack is called “Yarom / Falkner flush + reload cache side-channel attack” and allows using the manipulations with the second-level cache of the processor (which is common for all cores) for unauthorized applications to fully determine the value of d or its parts d _p and d _q. Both mean full disclosure of the secret key, allowing an attacker to fully read all messages encrypted with this key or generate an electronic signature. That is, let's say your program decrypts something with a secret key, and the malicious program at this moment works on the same system, but on behalf of another user, so you do not know about it. And at that moment a key leak occurs.

Conclusion

I introduced a C language module suitable for implementing RSA encryption or verifying electronic signatures on 32-bit embedded processors with clock frequencies of the order of 40 MHz and a hardware multiplier. When translating the C-source into assembler and using materials [6], you can get an implementation for dsPIC33F that encrypts one RSA-2048 data block for a time of about 10 ms at a processor clock frequency of 80 MHz.

Sources can be downloaded from GitHub .

RSA decryption (or electronic signature generation) can also be implemented on microcontrollers of this class, but this operation will be approximately 45-180 times slower.

It is possible to speed up the library by attracting effective multiplication and division algorithms mentioned in the article by links.

I recommend using my routines only for embedded applications, and for "large" computers in most cases it is better to use standard tools like Windows CryptoAPI or libgcrypt.

List of references

1. A. G. Kushnirenko, G.V. Lebedev, R. A. Svoren. Fundamentals of computer science and computer engineering. M.: 1990, p. 133.
2. Wikipedia. Schoenhage-Strassen Multiplication Method
3. Wikipedia. " Long arithmetic "
4. Donald Knuth. “The Art of Programming”, volume 2. Algorithms obtained
5. Wikipedia, article “ RSA (cryptosystem) ”, section “A worked example”
6. Erich Wegener, Mario Werner. "Evaluating 16-bit Processors for Elliptic Curve Cryptography." Institute for Applied Information Processing And Communications (IAIK), Graz University of Technology
7. William H. Press et al. Numerical Recipes in C ++, Second edition, Cambridge University Press 2002, page 922

Tags: