KvanTTT April 19, 2014 at 14:18

Encoding binary data in a string with an alphabet of arbitrary length (BaseN)

Everyone knows the algorithm for converting a byte array to a base64 string. There are a large number of varieties of this algorithm with various alphabets, with tail characters and without. There are modifications of the algorithm in which the length of the alphabet is equal to other powers of two, for example 32, 16. However, there are more interesting modifications in which the length of the alphabet is not a multiple of the power of two, such are the algorithms base85 , base91 . However, I did not come across an algorithm in which the alphabet could be arbitrary, including a longer length than 256. The problem seemed interesting to me, and I decided to implement it.

Immediately spread the link to the source and demo on js. Although the developed algorithm has more theoretical value, I considered it necessary to describe the details of its implementation. In practice, it can be used, for example, for cases where a shorter string length is more relevant than its size in bytes (for example, in quines).

Custom alphabet

To understand how such an algorithm can be synthesized, I decided to figure out a particular case, namely the base85 algorithm. The idea of this algorithm is as follows: the input data stream is divided into blocks of 4 bytes, then each of them is considered as a 32-bit number with a high byte at the beginning. By sequential division of each block by 85, 5 digits of the 85-decimal number system are obtained. Further, each digit is encoded by a printed symbol from the alphabet, the size of 85 characters, and is output to the output stream, with the preservation of the order, from the highest order to the lowest.

But why the size of 4 bytes was chosen, i.e. 32 bits? But because this achieves optimal compression, i.e. the minimum number of bits (2 ^ 32 = 4294967296) is used with the maximum number of characters (85 ^ 5 = 4437053125). However, this technique can be extended to any other alphabet. Thus it was prepared a mathematical system to find the number of bits, wherein the maximum compression is:

a - the size of the alphabet A .
k is the number of encoded characters.
b - the base of the number system.
n - number of bits in radix b to represent the k symbols of the alphabet A .
r- compression ratio (the more the better).
mbc is the maximum block size in bits.

⌊X⌋ is the largest integer less than x (floor).
⌈X⌉ is the smallest integer greater than x (ceiling).

Further, using this technique, the optimal combinations of bits and characters encoded in them were found, which were depicted in the figure below. The maximum block size is 64 bits (this number was chosen due to the fact that with a larger number it is necessary to use large numbers). As you can see, the compression ratio does not always increase with an increase in the number of characters (this can be seen in the region of 60 and 91). Red bars depict well-known encodings (85 and 91). Actually, from the diagram it can be concluded that such a number of characters for these encodings was not chosen in vain, since this uses the minimum number of bits with a good compression ratio. It is worth noting that with an increase in the maximum block size, the diagram may change (for example, for base85 encoding with 64 bits, the block size will be 32 bits, and the number of characters is 5 with a redundancy of 1.25. If the maximum block size is increased to 256 bits, then the block size will be equal to 141 bits with 22 characters and a redundancy of 1.2482).

Coding steps

So, in stages, the encoding process is as follows:

Calculation of the optimal block size (number of bits n ) and the number of characters corresponding to it ( k ).
Representation of the source string as a sequence of bytes (using UTF8).
Dividing the original sequence of bytes into groups of n bits.
Convert each group of bits to a number with a number system with base a .
Miscalculation of tail bits.

The implementation of the first stage was discussed above. For the second stage, a method was used to represent the string as an array of bytes (in C # it is built-in Encoding.UTF8.GetBytes () , and for JS it was written manually by strToUtf8Bytes and bytesToUtf8Str ). Next, the next three steps will be considered in more detail. The inverse conversion of a sequence of characters to an array of bytes looks similar.

Bit Block Encoding

private void EncodeBlock(byte[] src, char[] dst, int ind)
{
	int charInd = ind * BlockCharsCount;
	int bitInd = ind * BlockBitsCount;
	BigInteger bits = GetBitsN(src, bitInd, BlockBitsCount);
	BitsToChars(dst, charInd, (int)BlockCharsCount, bits);
}
private void DecodeBlock(string src, byte[] dst, int ind)
{
	int charInd = ind * BlockCharsCount;
	int bitInd = ind * BlockBitsCount;
	BigInteger bits = CharsToBits(src, charInd, (int)BlockCharsCount);
	AddBitsN(dst, bits, bitInd, BlockBitsCount);
}

GetBitsN returns a length number of the BlockBitsCount bit and starting with a bit with the number bitInd .
AddBitsN appends the number of bits of the length of the BlockBitsCount bit to the dst byte array at position bitInd .

Converting a block of bits into a number system with an arbitrary base and vice versa

Alphabet is an arbitrary alphabet. For the inverse transformation, the previously calculated inverse alphabet InvAlphabet is used . It is worth noting that, for example, for base64, the direct order of the bits is used, and for base85, the reverse order is used ( ReverseOrder ), _powN is the length of the alphabet.

private void BitsToChars(char[] chars, int ind, int count, BigInteger block)
{
	for (int i = 0; i < count; i++)
	{
		chars[ind + (!ReverseOrder ? i : count - 1 - i)] = Alphabet[(int)(block % CharsCount)];
		block /= CharsCount;
	}
}
private BigInteger CharsToBits(string data, int ind, int count)
{
	BigInteger result = 0;
	for (int i = 0; i < count; i++)
		result += InvAlphabet[data[ind + (!ReverseOrder ? i : count - 1 - i)]] * _powN[BlockCharsCount - 1 - i];
	return result;
}

Tail bit processing

Most of the time was spent on solving this problem. However, in the end, it turned out to create code for calculating the main and tail bits and symbols without using real numbers, which made it easy to parallelize the algorithm.

mainBitsLength - the main number of bits.
tailBitsLength - the tail number of bits.
mainCharsCount - the main number of characters.
tailCharsCount - the tail number of characters.
globalBitsLength - total number of bits (mainBitsLength + tailBitsLength).
globalCharsCount - total number of characters (mainCharsCount + tailCharsCount).

Calculation of the main and tail number of bits and symbols during encoding.

int mainBitsLength = (data.Length * 8 / BlockBitsCount) * BlockBitsCount;
int tailBitsLength = data.Length * 8 - mainBitsLength;
int globalBitsLength = mainBitsLength + tailBitsLength;
int mainCharsCount = mainBitsLength * BlockCharsCount / BlockBitsCount;
int tailCharsCount = (tailBitsLength * BlockCharsCount + BlockBitsCount - 1) / BlockBitsCount;
int globalCharsCount = mainCharsCount + tailCharsCount;
int iterationCount = mainCharsCount / BlockCharsCount;

Calculation of the main and tail number of bits and symbols during decoding.

int globalBitsLength = ((data.Length - 1) * BlockBitsCount / BlockCharsCount + 8) / 8 * 8;
int mainBitsLength = globalBitsLength / BlockBitsCount * BlockBitsCount;
int tailBitsLength = globalBitsLength - mainBitsLength;
int mainCharsCount = mainBitsLength * BlockCharsCount / BlockBitsCount;
int tailCharsCount = (tailBitsLength * BlockCharsCount + BlockBitsCount - 1) / BlockBitsCount;
BigInteger tailBits = CharsToBits(data, mainCharsCount, tailCharsCount);
if (tailBits >> tailBitsLength != 0)
{
	globalBitsLength += 8;
	mainBitsLength = globalBitsLength / BlockBitsCount * BlockBitsCount;
	tailBitsLength = globalBitsLength - mainBitsLength;
	mainCharsCount = mainBitsLength * BlockCharsCount / BlockBitsCount;
	tailCharsCount = (tailBitsLength * BlockCharsCount + BlockBitsCount - 1) / BlockBitsCount;
}
int iterationCount = mainCharsCount / BlockCharsCount;

Javascript implementation

I decided to create porting this algorithm to JavaScript. To work with large numbers, the jsbn library was used . The interface used bootstrap. The result can be seen here: kvanttt.github.io/BaseNcoding

Conclusion

This algorithm is written so that it is easy to port to other languages and parallelize (this is in the C # version), including the GPU. By the way, about modifying the base64 algorithm for the GPU using C # it is well written here: Base64 Encoding on a GPU .

The correctness of the developed algorithm was tested on the base32, base64, base85 algorithms (the sequences of the resulting characters were the same except for the tails).

Tags: