AlexLeonov March 19, 2010 at 17:26

SGVsbG8gd29ybGQh or base64 history

Brief Background

In general, it all started a long time ago. So long ago that there are hardly any witnesses to the holy wars of those days when it was decided how many bits should be in a byte.

It now seems to us for granted that 1 byte = 8 bits, that 256 different values can be encoded in a byte. But once it was not at all like that. History remembers seven-bit encodings, and six-bit, and even more exotic systems (for example, the Setun computer , which used ternary logic, that is, one ternary bit - a trit could have three, not two values, for it the ratio 1 trait was true = 6 tritam). But if you leave aside all exotic, then the mainstream still had encodings in which 6, 7 or 8 bits in a byte.

A six-bit encoding (for example, BCD) made it possible to encode 64 different values in one byte, which seemed to be quite enough for encoding alphanumeric characters, and the "extra" seventh bit expanded the encoding to 128 characters.

However, an eight-bit byte soon became generally accepted.

Eighth bit problem

The adoption of eight-bit encodings as a de facto standard has brought many problems. At this point, there was already a certain infrastructure using precisely seven-bit encodings, and holy wars flared up with renewed vigor.

They came to us in the form of problems with the "eighth-bit trimming" in the email system. The statement of an eight-bit byte gave 256 different values for one byte, which, in turn, allowed to fit common symbols (numbers, punctuation marks, Latin) and characters, say Cyrillic, in one code table. It would seem - sheer convenience, the text can be typed even in Russian letters, even in English, and if necessary - there is a place for German umlauts!

But, as always, the devil was in the details. Already accumulated and working hard-n-software was often adapted for seven-bit encodings, which led to various problems.

For example, the mail server, when sending a letter, could quite easily reset the most significant bits in each byte of the message, which could not but lead to problems, often the information was simply disastrously lost.

Several options have been proposed for a temporary solution to this problem. One of them was the encoding "KOI-8." The solution, I must admit, is very elegant - in this encoding, the Russian letters were arranged in Latin order and differed from them exactly by that very high bit. Thus, when cutting this bit, Russian “A” turned into Latin “A”, “B” into “B” and so on, the message was simply transliterated and it could still be read. True, this could not have done without a skeleton in the closet - sorting in the Russian alphabetical order in “KOI” was becoming a nightmare ...

And what was to be done by other languages, peoples and encodings? What about binary data? All the same, transliteration encodings did not solve the fundamental problem - the loss of the eighth bit, the loss of part of the information. So the encoding (or rather, the algorithm) of Base64 was born.

Base64 Algorithm

The idea of base64 is simple - reversible encoding, with the possibility of recovery, which translates all the characters of an eight-bit code table into characters that are guaranteed to be preserved during data transfer on any network and between any devices.

The algorithm is based on reducing three eights of bits (24) to four sixes (also 24) and representing these sixes in the form of ASCII characters. In this way, reversible encryption is obtained, the only drawback of which will be an increase in encoding size - in the ratio of 4: 3.

Example:
Take the text Russian text “ABVGD”. In binary form, encoded in Windows-1251, we get 5 bytes:
11000000
11000001
11000010

11000011
11000100
(00000000) - an extra zero byte is needed so that the total number of bits is divided by 6.

Divide these bits into groups of 6:
110000
001100
000111
000010

110000
111100
010000
000000

We take an array of characters "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234589 using these numbers and these are 238989 as array indices, we get "wMHCw8Q". It remains only to add one character "=" at the end, as an indication of one extra zero byte, which we added in the first step and get the final result:

"ABVGD": base64 = "wMHCw8Q ="

The inverse conversion is no less easy, try, for example, to decipher what is placed in the title of this article.

Application

The base64 algorithm is still used today where there is no way to guarantee the careful handling of your information - for example, when encoding email attachments. In PGP, the base64 algorithm is used to encode binary data.

You can imagine other uses of base64 - for example, when saving to a database, if the environment is not known in advance (oh, these magic_qoutes in PHP!) And there is no need for indexing and text search, you can use base64.

base64 may well be used to obtain hashes, for example, using the md5 algorithm, as a means against hash table selection, if the data, such as the user's password in the system, is previously converted to base64.

Well and finally Data URI

References

en.wikipedia.org/wiki/Base64
base64.ru

Tags: