**[»] Programming**.

I recently had the occasion to implement several cryptographic systems, including AES, SHA-256 and RSA, into the [»] SPECTRE file format to perform Digital Signatures operation on spectra. Whereas SHA-256 and AES were relatively straightforward to implement, RSA required much more efforts. I had to look at a lot of literature to be able to implement it efficiently even if the underlying equations of RSA look simple at first sight. To avoid forgetting all the subtleties of the code, and to help those who are struggling with their own RSA implementation, I decided to write this post.

Please note that, generally speaking, it is not a good idea to implement your own version of cryptographic systems. Not only are available solution most probably faster, they may also be much safer against attacks. Sensitive applications should always rely on proven standards and custom implementations should be reserved to very specific scenarios. Also, some people will tell you that RSA is a thing of the past and that modern applications should rely on one of the elliptic curves cryptography systems. I have read this many time but I cannot tell you to what extend this is true or not.

Although I initially looked for existing solutions like the Crypto API of Windows, the OpenSSL libraries and other open-source solutions, I was not satisfied with what I found. I was either being block by licenses issues which may not necessarily be 100% compatible with the license I’m currently using on my software, or the API did not give enough low-level access for what I intended to do. Since my application did not involve sensitive data and that I already had background in cryptographic systems, I decided to go for a custom implementation that would fit my needs and not more than that. Obviously, I would make no concession on safety and apply myself like always to get the best implementation possible.

Before we dig into the RSA process, I’d like to give a few words about the application I designed the system for.

Cryptographic systems are known to be useful to cipher information so that no one else than you and your friends can read the message. A less known application to cryptography is the ability to sign a file, proving that you generated that particular file and that it was not modified by a third party. RSA was the first cryptosystem designed to accomplish both of these operations. The two mode of operations are briefly represented in Figure 1. On the left hand-side, users send a secret message to you and on the right hand-side, you send a message to people that contains a proof that you did write that message and it was not crafted by a third party.

In RSA cryptography, you generate two keys: a secret private key that you keep for yourself and a public key that you broadcast to the world. Using your public key, people can send you encoded message that only you can read using your private key. And using your private key, you can encode message that only your public key can decode. The public key cannot be used to decipher a text that was encrypted using the public key which makes the system safe. One big advantage of this private/public key mode of operation is that you don’t have to worry anymore on the exchange process of the key because the public key cannot decode messages that other people will send you.

In practice, since encoding using the private key takes time and that in our system everybody has access to the public key, it is not necessary to encode the full data using the private key. Instead, a hash digest of the data is generated and that hash is encoded using the private key. That encoded message digest is sent along with the data such that anybody receiving the message can decode the digest, recompute a hash and verify that the two corresponds. If the two hash differs, it means that the data has been modified by a third party in-between and should not be trusted. The signature of a file is illustrated in Figure 2.

In theory any checksum function can be used to produce the digest but best practice is to use what is known as a *secure hash function*. These message digest functions were specially crafted to avoid attacks that would aim at generating a fraudulent message that would have the same hash as an original, safe, message and substitute the safe data with the crafted packet, leading the recipient of the message believe that the data is authentic. Here, I implemented the SHA-256 algorithm from the step-by-step description available on [∞] Wikipedia.

Now that you understand the concept of *digital signature*, we can move on to what is called a *certificate*. We have previously said that the public key of users is made available to everyone using broadcast. In practice, the public key is combined with info about the user such as its name, contact and address. That way, you can check that user XYZ is indeed the author of the file that you are looking at by comparing the public key with the one that is publically made available. Note that this is by no mean a copyright DRM system because it does not prevent a user from removing a signature to a file and appending is own signature on it. The only thing signature proves is that the file you are looking at was sent by user XYZ and not modified by a third-party.

In the SPECTRE file system, I chose to attach the public key to the signed file. While it is still possible to compare the public key to a database to check that the public key is valid, this was not my intend when I created the system.

So far, any user can create a key and associate whatever name he wants. It is not because the public key says that the user name is Barack Obama that the file was truly made by the ex-president of the US. All you know is that you have some file sent by a user who *pretends* he is Barack Obama.

This is where the concept of certificate comes into play.

Imagine you have a few users (at least one) who the software trusts. Their public key is made available to the public by their Internet website and are hardcoded into the software. These are known are *certificate authorities* and their key are called *master keys*. The role of the certificate authority is to sign user’s public key with their own private master key. To do so, users must register to the certificate authority by presenting an ID or any other document that proves they are who they pretend to be.

Once you, the final user, receive a signed file you can then check if the accompanying public key was signed by one of the known certificate authorities using their master public key. If it does, then you know that the user XYZ is who he pretends to be. If not, you just know that you are talking to some dude who pretends to be user XYZ. If the file is not signed at all, you have no information about its author.

The overall process of certificate authorities is represented in Figure 3. It is actually nothing more than signing a key file using a special private key owned by a third party (the certificate authority).

So far I discussed about both RSA and SHA-256 but you may wonder what is AES. AES is a standard secret key cryptographic system developed by the NSA to cipher top-secret documents. It is a de facto standard for secret key cryptography today and its working principle is detailed in the NIST FIPS-197 document.

If you followed carefully this introduction, you understand how important it can be to protect your private key when you have registered to a certificate authority. If a malicious person manages to get your private key file, he can sign files and pretend he is you. With all the certificate authority system, you will have a hard time explaining that you did not sent the file because all evidences will point against you. To prevent this from happening, all private key files must be encrypted using a password. In the SPECTRE file system, I used AES-256 to encode the private key files.

Let us now go back to RSA after this very long introduction.

The RSA algorithm is relatively straightforward to quote although why it works go well beyond the purpose of this post. Even my reference book “Cryptography and Network Security” by W. Stallings only briefly discuss the mathematical foundations of RSA. So we will focus here on how to make it works rather than on how it works (notice the subtle difference).

Let us start with two prime numbers *p* and *q*. A number is prime if it is only divisible by 1 and by itself. The first prime numbers are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97 etc. You should understand the “and” as a strong “and” so 1 is not a prime number because it is not divisible by 1 __and__ itself. An interesting point, totally unrelated to RSA, is that the square root of any prime number is an irrational number (which is not the case of 1 because √1 = 1). Another interesting point, totally related to RSA this time, is that any positive integer can be expressed as the power of primes numbers (*e.g.* 20=2×2×5).

From these two prime numbers, we compute the following quantities

where *lcm(a,b)* is the lowest common divisor of the numbers *a* and *b*.

Interestingly, some authors (*e.g.* William Stallings in “Cryptography and Network Security”) define λ(n) as (p-1)×(q-1). I do not know the implications of such a change but I decided to use the former version which is the one published on [∞] Wikipedia.

You then choose a number *e* such that

where *gcd(a,b)* is the greatest common divisor of *a* and *b*.

A common practice nowadays, according to the literature, is to choose *e* first and to restrict prime numbers selection such that

It is convenient to choose a low value for *e* with only a few bits on to speed up the encryption and signature validation process. Historically, the value *e*=3 was used as a standard but a poor RSA implementation yielded insecure applications. I will cover this when we talk about padding. Today, according to [∞] Wikipedia (and many other websites and forums that discuss RSA), the standard is to use *e*=65537 (10000000000000001 in binary).

In my implementation, I did not restrict the prime number selection and rather chose a random value for *e* in the range 3-65537 that respect the gcd condition. This yields a slightly slower signature validation but it was fast enough with a typical 10 ms for 2,048 bits key (see section 4).

Once you have selected the value *e*, you find its inverse such that

where *a mod b* is the remainder of the integer division of *a* by *b*.

Encryption of a secret word *w* is then given by the formula

And decryption is given by the formula

Similarly, the signature of word *w* is given by

And the signature check can be done using

The private key consists of the pair (*d*,*n*) and the public key consists of the pair (*e*,*n*). The security of the algorithm is based on the fact that it should be impossible to deduce *d* from the pair (*e*,*n*). Failure to respect this assertion would compromise the private key and the security of the whole system.

This is the regular textbook description of the RSA algorithm. You may wonder what is so difficult in implementing these equations? The answer is precisely in the condition that it should not be possible to deduce *d* from *e* and *n*.

The only way to make RSA safe is to use very, very, large numbers. The numbers *d* and *n* are typically over hundreds of digits long in RSA cryptography. These numbers require coding on N bits for their correct representation and the keys are often described by their number of bits. For instance, we will talk about 2,048 bits keys when N=2048 or 4,096 bits keys when N=4096.

To give you an idea about what this represent, here are two primes numbers coded on 1,024 bits and expressed in decimal (conventional) format:

p = 32 762 203 560 972 941 344 320 963 890 846 642 265 552 663 665 276 448 054 179 427 422 781 221 086 507 756 198 443 795 713 311 212 941 792 193 731 242 324 799 283 561 647 112 731 267 018 074 023 744 338 886 072 824 886 034 479 740 375 235 298 805 718 497 037 983 870 394 644 678 463 176 252 155 995 209 404 351 603 702 761 439 742 865 078 220 334 170 274 197 138 675 539 734 078 678 994 278 900 998 232 622 191

q = 21 686 117 942 078 049 274 507 483 768 669 248 053 041 318 840 745 598 019 086 429 902 787 323 974 192 549 774 615 899 647 605 041 493 494 518 974 535 428 562 953 516 353 971 103 485 012 409 318 701 353 604 730 173 311 045 180 548 108 687 319 781 225 353 210 298 466 343 418 465 937 312 793 189 121 738 104 581 895 244 981 913 202 595 501 952 512 651 907 728 819 877 127 757 178 571 177 187 466 575 267 264 771

I let you imagine how complex manipulating such number efficiently can be, in particular for the exponentiation part of the algorithm. Just as an order of magnitude, the total amount of atoms contained on earth (~10^{50} according to this [∞] page) is not even close to the product of *p*×*q* (~10^{600}).

Before we move on to the implementation of large integer manipulation, there are a few things we still need to cover about the RSA algorithm.

The selection of the two prime numbers is of particular importance because most attacks on RSA consists of factorizing *n* into *p* and *q* so we should make the job of attackers as difficult as possible.

The problem with computer algorithms is that they are deterministic. We like to think about *p* and *q* as being randomly independent but poor implementations will typically violate this rule. This is problematic because the selection process for *p* and *q* is at the heart of RSA security.

Real randomness does not exist in computer (unless you work with specific add-on hardware modules but this goes beyond our topic). When you call the rand() function in C/C++, all you do is calling a function that attempt to mimic such kind of random behaviour in a more or less satisfactory way. A typical random number generation function will look like this

with typical values a=16807 and m=65535 (historical values used in the IBM 360 family).

With the proper seed value, *x _{0}*, you can then track all the subsequent values. If the algorithm that selects the prime numbers

*p*and

*q*is purely deterministic (which is often the case), running the algorithm a second time with the same seed value will generate exactly the same numbers

*p*and

*q*. Since we know

*n=p×q*and that the seeds are typically coded in 16 bits format, it is relatively straightforward to test all the possible values until we find

*p*and

*q*that matches the public key

*n*. From

*p*and

*q*we can recompute

*λ(n)*and inverse

*e*to obtain the private key component

*d*. The key is now compromised.

In the software, I used two tricks to make the algorithm non-deterministic to prevent finding *p* and *q* simultaneously:

1/ The code that select a random prime number (see below) is made asynchronous over twice as much cores the computer possesses. The operating system will switch tasks depending on the current system idle state and the pending I/O requests which makes the system much less deterministic.

2/ The program uses two different seeds for the prime number selection *p* and *q*. The user is asked to enter its passphrase for the private key and during that time the program silently generates new seeds. This time that depend on human interaction makes the two seeds used for *p* and *q* independent.

Let us now look at algorithms to find prime numbers randomly.

The good news is that when dealing with very large numbers, we have a high probability that any randomly chosen number is prime. The classical scheme is to generate a random number over N bits and to check if that number is prime. According to William Stallings references, a typical number of trials before finding a prime number purely from random is 67 with the typical key lengths involved in RSA.

Checking for primality is however extremely difficult because we cannot exhaustively check if the number divides by all the other smaller numbers. Instead, there are algorithms designed to check if the number is NOT prime relative to another number. By repeating the process a number of time, we can get som confidence that the number is probably prime.

A number is NOT prime if:

1/ It is an even number because even numbers are divisible by two.

2/ According to [∞] this page, 88% of non-prime numbers are dividable by prime numbers smaller than 100 (nb: remember that any integer can be expressed as a product of prime numbers).

3/ The Miller-Rabin test can check if the number is non-prime to any random number *a*. If no random numbers are found to match after M iterations, there is a 1-K^{-4} probably that the number is prime. So if the algorithms still success after 10 trials, there is a 99.99% chance that the number tested is prime.

The following function return false if the number is non-prime and return true if the number is probably prime:

*// ***return** **false** **if** number is NOT prime ; might be prime **if** **true** with somewhat confidence given by the number of tests according to 1 - 4^-n
**template**<size_t N> **bool** is_prime(**const** bigint<N>& rValue, size_t nNumTests = DEFAULT_MILLERRABIN_TESTS)
{
*// even numbers are never primes
* **if** (rValue.bitval(0) == 0)
**return** **false**;
*// heard somewhere about 88% non-primes numbers have <100 prime as divisor... so test them all to 127
* int8_t primeslist[] = { 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127 };
**for** (size_t i = 0; i < **sizeof**(primeslist) / **sizeof**(int8_t); i++)
{
**auto** curr = from_int8<N>(primeslist[i]);
*// prime ***for** sure
**if** (rValue == curr)
**return** **true**;
*// divisible by prime... so NOT prime
* **if** (rValue % curr == bigint<N>::zero())
**return** **false**;
}
*// apply Miller-Rabin test n times
* **auto** witness = [](**const** bigint<N>& n, **const** bigint<N>& a, **const** bigint<N>& d, size_t s)
{
**auto** n_1 = n - bigint<N>::one();
**auto** x = montgomery_exp(a, d, n);
**if** (x == bigint<N>::one() || x == n_1)
**return** **false**;
**for** (size_t r = 1; r < s; r++)
{
x = mod_mult(x, x, n);
**if** (x == n_1)
**return** **false**;
}
**return** **true**;
};
size_t s = 0;
**auto** d = rValue - bigint<N>::one();
**while** (d.bitval(0) == 0)
{
d >>= 1;
s++;
}
**for** (size_t i = 0; i < nNumTests; i++)
{
*// generate random number between 2 and n - 2... but I have some issues with ***signed** modulos so I reduced to N-1
**auto** a = from_int8<N>(2) + randomize<N - 1>().cast<N>() % (rValue - from_int8<N>(3));
*// ***return** **false** **if** we witness a non-prime
**if** (witness(rValue, a, d, s))
**return** **false**;
}
*// some probability that it is prime
* **return** **true**;
}

Finding the greatest common divisor can be done using the [∞] Euclidean algorithm:

Once the GCD has been found, it is easy to derive the LCM using the [∞] following formula:

The following implementation computes the GCD and LCM of two numbers:

*// ***return** greatest common divisor (euclid algorithm)
**template**<size_t N> **auto** gcd(**const** bigint<N>& rA, **const** bigint<N>& rB)
{
**auto** x = abs(rA), y = abs(rB);
**while** (y != bigint<N>::zero())
{
**auto** temp = x % y;
x = y;
y = temp;
}
**return** x;
}
*// ***return** least common multiple
**template**<size_t N> bigint<N> lcm(**const** bigint<N>& rA, **const** bigint<N>& rB)
{
*// LCM(a,b) = |a * b| / gcd(a,b)
* **auto** prod = abs(safe_mult(rA, rB));
**auto** gd = gcd(rA, rB).cast<2 * N>();
**return** (prod / gd).cast<N>();
}

Computing the inverse of number *e* mod *n* is possible using the Extended Euclid algorithm if gcd(e,n)=1:

for which you have the following implementation:

*// ***return** inverse modulo **using** extended euclid algorithm
**template**<size_t N> **auto** mod_inv(**const** bigint<N>& rValue, **const** bigint<N>& rMod)
{
**auto** t = bigint<N>::zero();
**auto** nt = bigint<N>::one();
**auto** r = abs(rMod);
**auto** nr = abs(rValue);
**while** (nr != bigint<N>::zero())
{
**auto** q = r / nr;
**auto** temp1 = t - safe_mult(q, nt).cast<N>();
**auto** temp2 = r - safe_mult(q, nr).cast<N>();
t = nt;
r = nr;
nt = temp1;
nr = temp2;
}
**if** (r > bigint<N>::one())
throwException(NoInverseException);
**if** (t < bigint<N>::zero())
**return** t + rMod;
**return** t;
}

When using the private key we have used the same exponential as for the public key:

However, this time, *d* is much larger than *e* and therefore consumes more computational time.

A trick is to replace this formula by those:

And precomputes the following values inside the private key:

This is known as the Chinese Remainder Theorem (CRT) trick and is well documented [∞] here.

Note that it is safe to use this only in the private key because it requires storing *p* and *q* directly which pose potential high security treats as they allow to build the values *λ(n)* which relates *e* to *d*.

To get a noticeable speed-up, it is mandatory to apply the operation to N bits size of *p* and *q*. I obtained almost a 10× speed-up compared to a mere 10% speed-up depending if I casted the number to a smaller bit size or not.

Although any number 3≤*e*<*n* are equally valid for the public key, encrypting text using low-value *e* can be problematic when used without precaution on small texts. This led to a vulnerability that affected many systems (*e.g.* see [∞] this page about Infineon chips security breach).

Taking back the equation

It is relatively clear that

And small key exponents *e* therefore becomes insecure to encode small words *w*. Note that this requires both low *e* and low *w* values.

To avoid such kind of issue, our software uses bitmask to ensure that the exponentiation always overflows.

We now use modified equations for encoding:

where rand bits are numbers of N bits composed or purely random bits and *mask* is a bit mask defined such that its bit ‘i’ respect the following formula

where K is the block size (see below) and *nlz(n)* is the number of leading zeros in the binary representation of *n*.

The block size is the number of bits used to describe the word *w*. For instance, we can choose to encrypt blocks by 1,024 bits group and K will then be 1,024. When encoding a SHA-256 hash, since the output is always 256 bits, we choose K=256.

The modified encoding function therefore fills all unused high-order bits of the word to encode but still ensuring that the overall results will be lower than *n*.

Decoding is done using the following modified rules:

where *umask* is a bitmask that will remove the padded bits:

Note that it is possible to turn the masks off by defining

When using RSA cryptosystem, it is important to be sure that the key cannot be factorized too easily.

The [∞] Wikipedia page on RSA gives very good recommendations:

In general, 1,024 bits keys are not recommended anymore since 2010. Although no such keys have been cracked until this day of writing, the RSA consortium recommends using 2,048 bits keys at the minimum for the near future.

It is also recommended to check the following conditions:

Keys that do not respect these rules have been shown to be easy to crack.

The software written for SPECTRE includes these three checks with a default key size of 2,048 bits.

But, most of all, strong primes selection algorithm is mandatory for the safety of the algorithm, as explained in section 2.1.

On a typical computer, the CPU can handle all typical arithmetic operations and inequalities test on 64 bits integers. Some systems implement 128 bits integers but not all of them do.

Here, we need to manipulate integers that can be coded on 4,096 bits or even more. We therefore need a way to both represent these numbers and redefine all the standard arithmetic operations for them. We will also cover special function designed for modular arithmetic that are not implemented in common CPU.

Although it would be possible to represent the numbers as boolean vectors of N bits and apply bitwise arithmetic operation to them, the result would be inefficient in term of computational power. Since CPU can handle words of up to 64 bits, it would be much better to rely on the standard arithmetic of machine words. In practice, and for reasons that will appear clearer later, we are restricted to words of 32 bits in the default implementation but if your OS handles 128 bits operations, you can switch to 64 bits words.

To give an order of magnitude, just by updating the bitshift operation from bitwise to machine words, I was able to speed up the signature process by almost a factor 10. Do not expect to gain factors 10 everytime you change an operation though, the bitshift is actually a very crucial operation that bottleneck almost all of the algorithms (see Section 4 for some profiling results).

To represent a number of N bits with machine words of M bits, we need to use at least (N+M-1)/M machine words. It is recommended to keep all the words in contiguous memory to avoid cache synchronization issues that may slow down the process. I chose to use a std::array container to achieve that.

Here, I will use small letters to talk about large numbers (assumed to be coded on N bits) and I will use capitals with subscript *i* to denote the *i*-th word of the large integer. When talking about the bit value at bit position *i*, I will use the notation *(i)*. So *a* will be a big integer, A_{5} will be the 5^{th} word of *a* and *a(10)* will be the 10^{th} bit of *a*.

The access order is as-follow:

So the least-significant word is A_{0} and the bit ordering is least-significant bit first in the word A.

The AND, OR, XOR and NOT operations are straightforward to implement and applies directly on each words:

The bitshift operations are more interesting and needs to be implement efficiently because they are used in many other arithmetic functions. I will cover here the left shift operation, the right shift is implemented by analogy.

The naïve implementation of a left shift by *k* bits would be

But this does not take advantage of machine words operation and it will dramatically slow down our program as we have already mentioned. Note that we can restrict the loop to N-*k* since all the bits above that will be thrown away.

Although it is not required in our program, you can implement the circular shift operators which takes the high bits of *a* and put them in the low bits as

Let’s go back to a more efficient way of implementing the left shift operator using machine words.

The first important thing to notice is that a shift of *r*M bits can be implemented as a swap of the words:

We can also exclude the case of zero shift:

And the case of a shift larger than N:

Otherelse we need to apply the generic algorithm:

Where the bar above T means that we use a word twice as large to store the result of the operation. The subscript *lo* and *hi* refers then to the low and high word of that double word. On a 32 bits implementation, T will be 64 bits and the operations will be performed using 64 bits operations.

In the implementation, I call the standard words “types” and the double words “overflow types” because they are able to handle the overflow of the arithmetic on the simple words.

Basically, we shift each word A_{i} by the requested amount of bits and combine the results in the two words of B (high and low parts) using the OR operation.

The code for the left shift is given below:

` `*// shift left ***operator**
**static** bigint bit_shiftleft(**const** bigint<N>& rValue, **unsigned** **int** bits)
{
*// ***return** value
bigint<N> ret;
*// skip null shifts
* **if**(bits == 0)
**return** rValue;
*// ***return** zero **if** bits **if** larger than N
**if**(bits >= (**unsigned** **int**)N)
**return** bigint<N>::zero();
*// special ***case** when number of bits to shift equals the size of the limbs
**if** ((bits % (**sizeof**(type) * 8)) == 0)
{
**auto** ofs = bits / (**sizeof**(type) * 8);
**for** (size_t i = ofs; i < ret.m_data.size(); i++)
ret.m_data[i] = rValue.m_data[i - ofs];
**return** ret;
}
**auto** bits_m = (size_t)(bits % (**sizeof**(type) * 8));
**auto** bits_o = (size_t)(bits / (**sizeof**(type) * 8));
**for** (size_t i = 0; i < ret.m_data.size() - bits_o; i++)
{
overflow_type temp = ((overflow_type)rValue.m_data[i]) << bits_m;
ret.m_data[i + bits_o] |= (type)temp;
**if** ((i + bits_o + 1) < ret.m_data.size())
ret.m_data[i + bits_o + 1] |= (type)(temp >> (**sizeof**(type) * 8));
}
*// ***return** object
**return** ret;
}

Standard arithmetic operations are the +, -, * and / operations. They must all be implemented because they all appear in the RSA algorithms somewhere. Note that we discuss here unsigned operations, signed operations will be discussed in section 3.4.

Addition and subtraction are relatively straightforward once you use the overflow type. Both of these functions performs the operations in double words, assign the low word part to the current word and propagate the high word to the next word as a carry, just like you would when adding numbers on a sheet of paper.

The code is given below without more details:

` `*// addition with carry
* **static** **auto** add(**const** bigint<N>& rA, **const** bigint<N>& rB)
{
bigint<N> result;
overflow_type carry = 0;
*// ***loop** through all
**for**(size_t i = 0; i < rA.m_data.size(); i++)
{
*// compute addition with carry
* overflow_type res = (carry + (overflow_type) rA.m_data[i] + (overflow_type) rB.m_data[i]);
*// update current
* result.m_data[i] = (type) (res & (overflow_type) (~(type) 0));
*// update carry
* carry = res >> (8 * **sizeof**(type));
}
**return** std::tuple(result, carry);
}
*// subtraction with carry
* **static** **auto** subtract(**const** bigint<N>& rA, **const** bigint<N>& rB)
{
bigint<N> result;
overflow_type carry = 0;
*// ***loop** through all
**for**(size_t i = 0; i < rA.m_data.size(); i++)
{
*// current value
* overflow_type val_a = (overflow_type) rA.m_data[i];
overflow_type val_b = carry + (overflow_type) rB.m_data[i];
*// compute carry
* **if**(val_a < val_b)
carry = ((overflow_type)1) << (8 * **sizeof**(type));
**else**
carry = 0;
*// compute ***new** value
result.m_data[i] = (type) (carry + val_a - val_b);
*// update carry
* carry = carry >> (8 * **sizeof**(type));
}
**return** std::tuple(result, carry);
}

Note that both functions will return the carry and it is up to the user to decide what to do with it. Safe operation should trigger an exception if the carry is non-zero but most of the time it is ok to live with overflows. In C/C++, almost nobody looks at the carry of additions and subtraction although this may sometimes lead to catastrophic security failures. For safe applications like pointer address arithmetic, our libraries implement the SAFE_ADD and SAFE_SUBTRACT operations that will throw an exception if the result overflow. But this goes beyond the topic of this post. You can have a look in the source code if you are interested in safe arithmetics.

Let us now focus on the multiplication and division algorithms which are more complex.

Imagine you want to multiply 64 by 123.

You can decompose the product as

which is the same as

or

which means that we can decompose the product of two numbers *a* and *b* as

where W=2^{M} is the size of the base (*e.g.* 10 in our previous example since we work in decimal there).

This formula allows computing the product of two large integers as the shifted sum of products of large integers by words.

The product of a large integer by a word can be handled using overflow types:

The code can be implemented as follows:

` `*// scale number
* **static** **auto** scale(**const** bigint<N>& rValue, type k)
{
*// to hold ***return**
bigint<N> ret;
*// carry
* overflow_type carry = 0;
*// ***loop** through all
**for** (size_t i = 0; i < rValue.m_data.size(); i++)
{
*// ***do** product
overflow_type res = carry + (overflow_type)k * (overflow_type)rValue.m_data[i];
*// set ***return** and next carry
ret.m_data[i] = (type)res;
carry = res >> (8 * **sizeof**(type));
}
*// ***return** result
**return** std::tuple(ret, carry);
}
*// ***unsigned** multiply of two numbers
**static** **auto** umult(**const** bigint<N>& rA, **const** bigint<N>& rB)
{
*// prepare ***return**
bigint<N> ret;
*// ***loop** through all
**for** (size_t i = 0; i < rA.m_data.size(); i++)
{
*// scale with current limb
* **auto** [temp, carry] = scale(rA, rB.m_data[i]);
*// add to result
* ret += temp << (**unsigned** **int**)(i * (**sizeof**(type) * 8));
}
*// ***return** value
**return** ret;
}
*// safe ***unsigned** multiplication
**template**<size_t N> **auto** safe_umult(**const** bigint<N>& rA, **const** bigint<N>& rB)
{
**return** bigint<N * 2>::umult(rA.cast<N * 2>(), rB.cast<N * 2>());
}

Note that when multiplying two large integers of N bits, the output can be as large as 2×N bits to store all the information. A safe multiply function first casts the number to 2×N bits before applying the multiplication to be sure that no data overflows.

This leaves us with the integer division of two numbers. This is a very important operation in RSA because the modulo operation relies on this. The operation is however much more difficult to implement using machine words so I had to use an inferior version that applies to bits directly.

The function actually applies a modular regression given two numbers *a* and *b* such that:

where *q* is called the quotient, *r* the remainder, *b* the divisor and *a* the dividend.

From this regression we find

The algorithm proceeds as follow:

Let’s say you would like to divide 101 (1100101 in binary) by 6 (110 in binary).

You start with r=0 and q=0. Since *a* has 7 bits, there will be 7 iterations.

You shift *r* (r=0) and add 1 (r=1), *q* is shifted but remains 0. *r* is smaller than *b* (1<110) so you skip the rest.

You shift *r* (r=10) and add 1 (r=11), *q* is shifted but remains 0. *r* is smaller than *b* (11<110) so you skip the rest.

You shift *r* (r=110) and add 0 (r=110), *q* is shifted but remains 0. *r* is larger than or equal to *b* (110=110) so you set *r* to 110-100=0 and you add 1 to *q* (q=1).

You shift *r* (r=0) and add 0 (r=0), *q* is shifted (q=10). *r* is smaller than *b* (1<110) so you skip the rest.

You shift *r* (r=0) and add 1 (r=1), *q* is shifted (q=100). *r* is smaller than *b* (1<110) so you skip the rest.

You shift *r* (r=10) and add 0 (r=10), *q* is shifted(q=1000). *r* is smaller than *b* (10<110) so you skip the rest.

You shift *r* (r=100) and add 1 (r=101), *q* is shifted (q=10000). *r* is smaller than *b* (101<110) so you skip the rest.

There are no more bits to process, we end with q=10000 (16) and r=101 (5). We can verify that the algorithm is correct because 101=16×6+5.

The implementation is given here below:

*// ***unsigned** division
**template**<size_t N> **auto** udiv(**const** bigint<N>& rDividend, **const** bigint<N>& rDivisor)
{
*// ***throw** exception when dividing by zero
**if** (rDivisor.iszero())
throwException(DivisionByZeroException);
*// handle easy ***case**
**if** (rDividend < rDivisor)
**return** std::tuple(bigint<N>::zero(), rDividend);
*// ***generic** **case**
bigint<N> remainder, quotient;
**for** (**unsigned** **int** i = N - nlz(rDividend); i > 0; i--)
{
*// shift remainder and quotient 1 bit left
* remainder <<= 1;
quotient <<= 1;
*// add current bit to remainder
* remainder.bitset(0, rDividend.bitval(i - 1));
*// skip rest ***if** remainder is lower than divisor
**if** (remainder < rDivisor)
**continue**;
*// subtract divisor from remainder
* remainder -= rDivisor;
*// add one to quotient
* quotient.bitset(0, 1);
}
**return** std::tuple(quotient, remainder);
}

Note that the multiplication and division (and hence the modulo operation) relies extensively on bit shift operations which explains why a proper implementation of these function is critical to avoid any bottleneck.

When we take the result of a standard arithmetic operation and compute its remainder relative to a given divisor, we talk about *modular arithmetic*. As presented in Section 2, modular arithmetic plays an important role in RSA cryptography.

The straightforward way to implement modular arithmetic is to first compute the result of standard arithmetic operation, with enough bits to avoid any overflow, and to apply the modular regression algorithm presented in Section 3.2. Although this solution works, it is not necessarily efficient because the extra bits and regression algorithm requires a lot of supplementary computational effort.

The [∞] BearSSL website presents a way to compute modular addition and subtraction by computing the results and carry of the following operations

where *c _{1}* and

*c*are the carry of the addition and subtraction respectively.

_{2}Then,

A similar rule is given for modular subtraction.

During testing, I have found that this formula will fail if either *a* or *b* are larger than *m* so I patched these functions using the modulo operator:

*// modular addition
***template**<size_t N> bigint<N> mod_add(**const** bigint<N>& rA, **const** bigint<N>& rB, **const** bigint<N>& rMod)
{
**auto** [d, c1] = bigint<N>::add(rA % rMod, rB % rMod);
**auto** [e, c2] = bigint<N>::subtract(d, rMod);
**if** (c1 == c2)
**return** e;
**return** d;
}
*// modular subtraction
***template**<size_t N> bigint<N> mod_subtract(**const** bigint<N>& rA, **const** bigint<N>& rB, **const** bigint<N>& rMod)
{
**auto** [d, c1] = bigint<N>::subtract(rA % rMod, rB % rMod);
**auto** [e, c2] = bigint<N>::add(d, rMod);
**if** (rA.sign() == -1 && rB.sign() == +1)
**return** e;
**if** (rA.sign() == +1 && rB.sign() == -1)
**return** d;
**if** (!((c1 == 0) && d.bitval(N - 1) == 0))
**return** e;
**return** d;
*/*
***if** (lower_than(rA, rB))
**return** e;
**else**
**return** d;
*/
}

The two functions are slightly slower than applying the formula on larger integers and then using the modulo operator. On the other hand, it manages signed numbers correctly which is trickier when expanding the number of bits.

One specific operation was found to be the bottleneck of RSA encryption and decryption. It is the modular exponentiation.

In Section 3.2 I did not discuss exponentiation because it produces unbounded results in terms of bit length. For instance, x^{3} will take up to 3×N bits. When using exponents that has several hundreds of digits, it is clearly not manageable in terms of memory space and it therefore does not make sense to implement such function (unbounded). In modular arithmetic things are different because the results will always be N bits long due to the modulo function (considering that all numbers are coded on N bits).

A naïve implementation of the modular exponentiation would look like this:

*// modular multiplication
***template**<size_t N> bigint<N> mod_mult(**const** bigint<N>& rA, **const** bigint<N>& rB, **const** bigint<N>& rMod)
{
*// peform safe multiplication
* **auto** prod = safe_mult(rA, rB);
*// get modulo
* **auto** mod = prod % rMod.cast<2 * N>();
*// cast back
* **return** mod.cast<N>();
}
*// modular exponent ***operator** (**signed** version)
**template**<size_t N> bigint<N> mod_exp(bigint<N> v, bigint<N> n, **const** bigint<N>& m)
{
bigint<N> ret = bigint<N>::one();
v = v % m;
**auto** stopN = N - nlz(n);
**for** (**unsigned** **int** i = 0; i < (**unsigned** **int**)stopN; i++)
{
**if** (n.bitval(i) == 1)
ret = mod_mult(ret, v, m);
v = mod_mult(v, v, m);
}
**return** ret;
}

It is however not efficient in terms of computation because it relies heavily on the usage of the regression function to compute the remainder. There however exists a trick known as the *Montgomery multiplication* that can be used here to make the function much faster. I recorded a 10× speed gain on my laptop using 2,048 bits key with that trick enabled.

In Montgomery multiplication, the numbers are first projected into a Montgomery space by multiplying them by a number *r* modulo *n* (*r*>*n*). The product of two numbers in the Montgomery space follows a specific rule that is c=a×b×r^{-1} mod n. Once the operation is performed, the result is projected back to the normal space by multiplying it by r^{-1} mod n. By choosing r=2^{N}, it is possible to implement all these operations without having to actually compute any regression. The only limitation to the algorithm is that *n* must be odd (actually the condition is gcd(n,r)=1).

Projecting any number x<n in the Montgomery space modulo *n* with r=2^{N} can be done by N shifting operation (one bit shift is equivalent to a multiply by 2 operation) and subtraction. Indeed, if some number *z* is smaller than *n* then 0<*z*+*z*<2*n* and only one subtraction can put it back in the [0,*n*[ range.

Projecting back to the regular space consist of multiplying the number in Montgomery space with r^{-1} mod n. It is then possible to reuse the multiplication algorithm using b=1 to perform that operation.

The standard Montgomery multiplication is detailed in the [∞] Wikipedia page (see REDC). However, I use a more efficient method here that I was able to find in an online paper. It took me some time to find good reference material to grasp the algorithm details as most of the source are rather vague. I was able to find a nice [∞] PDF by Cetin K. Koc and Coling D. Walter that explained it with words I could understand. I strongly recommend that you read it. This [∞] page has some nice information too.

Note that the projection to and from the Montgomery space requires some amount of computations which make the algorithm not necessarily interesting when only one multiplication need to be performed. But in the modular exponentiation algorithm, we need to project the number only when the algorithm starts and all the subsequent multiplications can then occur in Montgomery space, which is much more efficient.

The implementation that I used is given here-below:

*// Montgomery exponentiation
***template**<size_t N> bigint<N> montgomery_exp(**const** bigint<N>& v, **const** bigint<N>& n, **const** bigint<N>& m)
{
*// Montgomery product
* **auto** _prod = [](**const** bigint<N> &a, **const** bigint<N> &b, **const** bigint<N>& m)
{
bigint<N> u;
**for** (**unsigned** **int** i = 0; i < (**unsigned** **int**)N; i++)
{
**if** (a.bitval(i) == 1)
u += b;
**if** (u.bitval(0) == 1)
u += m;
u >>= 1;
}
**if** (u >= m)
**return** u - m;
**return** u;
};
*// Montgomery projection
* **auto** _project = [](bigint<N> x, **const** bigint<N>& m)
{
**for** (size_t i = 0; i < N; i++)
{
x <<= 1;
**if** (x >= m)
x -= m;
}
**return** x;
};
*// project in Montgomery space
* **auto** _v = _project(v % m, m);
**auto** _ret = _project(bigint<N>::one(), m);
*// perform exponentiation
* **auto** stopN = N - nlz(n);
**for** (**unsigned** **int** i = 0; i < (**unsigned** **int**)stopN; i++)
{
**if** (n.bitval(i) == 1)
_ret = _prod(_ret, _v, m);
_v = _prod(_v, _v, m);
}
*// unproject result
* **return** _prod(_ret, bigint<N>::one(), m);
}

Up to now I have considered only unsigned operations. However, using the [∞] two-complement writing, it is possible to deal with negative number as well without having to modify the addition and subtraction algorithms.

In the two-complement notation, the negative of a number is given by

We can then define the absolute value function as

The product of two numbers *a* and *b* is given by

with

Similarly, the quotient of two numbers is given by

And finally

In case of doubts, you can check the following equalities (a=±101, b=±6):

Finally, we can also implement inequalities testing such as *a*<*b* using the subtraction operator:

` `*// smaller-than ***operator**
**static** **bool** lower_than(**const** bigint<N>& rA, **const** bigint<N>& rB)
{
**if** (rA.sign() == -1 && rB.sign() == +1)
**return** **true**;
**if** (rA.sign() == +1 && rB.sign() == -1)
**return** **false**;
**auto** [result, carry] = subtract(rA, rB);
**return** !((carry == 0) && result.bitval(N - 1) == 0);
}

The equality operator consists of simply checking that all words in *a* and *b* are the same. All other operators (≠, ≤, >, ≥) can be built from these two operators and the boolean NOT operator:

A demo software to create keys and sign SPECTRE (.spc) files can be downloaded [∞] here with its source code. The program can be used with the following arguments:

```
spc keygen "filename" [-type=<rsa2048>/rsa3072/rsa4096]
spc sign "filename" "
```**private** key" "**public** key" [-type=<rsa2048>/rsa3072/rsa4096]
spc encode "filename"
spc auth "filename"
spc sample "filename"
spc dump "filename"

When used with the **keygen** command, the program will generate a private .key file and a public .pub.key file. It is possible to choose between three different key length (rsa2048, rsa3072 and rsa4096) with 2,048 bits key length set as default.

If you don’t have .spc file on hand (you can create them with any of the [∞] Spectrum Analyzer software suite), you can use the **sample** command to create a sample file.

You may then sign any .spc file using the **sign** command. Be sure to match the type argument with the key type you are importing when not using default values.

You can then check the signature of any .spc file using the **auth** command.

The **encode** command will let you change (or add or remove) an AES-256 password to any .spc file.

Finally, the **dump** command will show you the content of any .spc file.

Please note that signatures are only implemented in version 1 (SPC1) of SPECTRE files. Earlier version 0 (SPC0) files will be converted automatically to the latest version (SPC1 at time of this writing). Programs that rely on earlier version of SPECTRE files will not be able to load files using newer SPECTRE versions.

Figure 4 shows the time required to generate a key of given size. I repeated the test 10 times on my laptop which has an i7-8700 processor running at 3.2 Ghz.

The key generation time follows a logarithmic trend with the key size. For a 2,048 bits key (default key size), the median key generation time was 24.2 seconds.

For a specific 2,048 bits key generated, the median time required to encode the string “The quick brown fox jumps over the lazy dog” as a 1,024 bits packet over 100 runs was 9.3 ms using the public key and 115.69 ms using the private key. The median time required to decode this message was 8.24 ms using the public key and 116.52 ms using the private key.

The median encoding and decoding time are roughly the same and the slight variation are due to the random padding in the encoding. Most of the time is actually spent computing the modular exponentiation. Since the exponent is much smaller in the public key, it is faster to execute.

99.73% of the execution time of the program was spent in the Montgomery product for the private key encoding/decoding among which 62.31% was spent doing >>= 1 operations and 32.83% doing additions. Moreover, looking into the addition, most of the job was spent doing machine words shifts for the carry computations. This clearly emphasize the statement about binary shifts being the bottleneck of the application.

Finally, the influence on the word size was checked for private key decoding operations. The results are presented in Figure 5 and clearly follows a logarithmic trend with the machine word size (8/16 bits, 16/32 bits and 32/64 bits words – my computer does not support 128 bits for overflow types).

Concerning the compiler optimization, enabling the automatic parallel code generation in VisualStudio 2019 did not help the processing time and enabling AVX2 SIMD instructions sets decreasing the processing time by only about 10%.

As an order of magnitude, I tested the speed required by OpenSSL to encode/decode using 2,048 bits key and the result were 14 µs and 0.48 ms respectively! As you can see, there is still room for improvement :P

This post has detailed the concept of RSA cryptography, Digital Signature and Certification. Furthermore, I have presented all the implementations tricks that I was able to find online to make the computation as efficient as possible.

I have then presented some typical results and gave a demo software with its source code, which I hope will help those of you who might be struggling in implementing their own RSA cryptosystem :)

This implementation is certainly not as fast as dedicated package available online, but it is enough for the purpose of Digital Signature of the SPECTRE files. I also enjoyed very much working on this problem!

I would like to give a big thanks to **James**, **Daniel**, **Mikhail** and **Naif** who have supported this post through [∞] Patreon. I also take the occasion to invite you to donate through Patreon, even as little as $1. I cannot stress it more, you can really help me to post more content and make more experiments!

**You may also like:**

[»] Data Communication With a PIC Using RS232

[»] Implementing Object Persistence in C++