
We deal with the forward and reverse byte order
Translated by Khalid Azad - Understanding Big and Little Endian Byte Order Byte
ordering problems are very frustrating, and I want to save you the grief I experienced. Here are the key points:
The most important concept is to understand the difference between the numbers and the data that these numbers represent. A number is an abstract concept, as a calculator of something. You have ten fingers. The concept of “ten” does not change, depending on the representation used: ten, 10, diez (Spanish), ju (Japanese), 1010 (binary representation), X (Roman numbers) ... All these representations indicate the concept of “ten”.
Compare this with the data. Data is a physical concept, just a sequence of bits and bytes stored on a computer. Data is not inherently meaningful and should be interpreted by those who read it.
Data is like human writing, just a set of marks on paper. These marks do not have any meaning. If we see a line and a circle (for example, | O), then we can interpret this as “ten”. But this is only an assumption that the read characters represent a number. It may be the letters “IO” - the name of the satellite of Jupiter. Or perhaps the name of the Greek goddess. Or an abbreviation for input / output. Or someone’s initials. Or the number 2 in the binary representation (“10”). This list of assumptions goes on. The fact is that one piece of data (| O) can be interpreted differently, and the meaning remains unclear until someone clarifies the intentions of the author.
Computers face the same problem. They store data, not abstract concepts, using 1 and 0. Later they read these 1 and 0 and try to recreate abstract concepts from the data set. Depending on the assumptions made, these 1 and 0 can have completely different meanings.
Why it happens? Well, actually there is no such rule that computers should use the same language, just as there is no such rule for people. Each computer of one type has internal compatibility (it can read its own data), but there is no guarantee how exactly another type of computer interprets this data.
Key concepts:
Fortunately, most computers store data in only a few formats (although this was not always the case). This gives us a common starting point, which makes life a little easier:
We can use these agreements as a building block for data exchange. If we save and read data one byte at a time, then this approach will work on any computer. The concept of byte is the same on all machines, the concept of “byte 0” is the same on all machines. Computers also perfectly understand the order in which you send them bytes - they understand which byte was sent first, second, third, etc. “Byte 35” will be the same on all machines.
So what's the problem - computers get along fine with single bytes, right? Well, everything is excellent for single-byte data such as ASCII characters. However, a lot of data uses several bytes to store, for example, integers or floating-point numbers. And there is no agreement on the order in which these sequences should be stored.
Consider a sequence of 4 bytes. Let's call them WXY and Z. I avoid the names ABCD because they are hexadecimal numbers, which can be a bit confusing. So, each byte has a value and consists of 8 bits.
For example, W is one byte with a value of 0x12 in hexadecimal or 00010010 in binary. If W is interpreted as a number, then it will be “18” in the decimal system (by the way, nothing indicates that we should interpret this byte as a number - it could be an ASCII character or something completely different). Are you still with me? We have 4 bytes, WXY and Z, each with a different value.
Pointers are a key part of programming, especially in C. The pointer is a number that is an address in memory. And it depends only on us (programmers) how to interpret the data at this address.
In C, when you cast (cast) a pointer to a specific type (such as char * or int * ), this tells the computer how to interpret the data at that address. For example, let's declare:
Please note that we cannot receive data from p because we do not know their type. p can indicate a number, letter, the beginning of a line, your horoscope or image - we just don’t know how many bytes we need to read and how to interpret them.
Now suppose we write:
This operator tells the computer that p points to the same place, and the data at this address must be interpreted as one character (1 byte). In this case, c will point to memory at address 0, or to byte W. If we print c , we get the value stored in W, which is equal to the hexadecimal 0x12 (remember that W is a complete byte). This example does not depend on the type of computer - again, all computers equally well understand what is one byte (in the past this was not always the case).
This example is useful, it works equally on all computers - if we have a pointer to a byte (char *, one byte), we can go through memory by reading one byte at a time. We can go to any place in memory, and the storage order of bytes will not matter - any computer will return the same information to us.
Problems begin when the computer tries to read a few bytes. Many data types consist of more than one byte, such as long integers or floating point numbers. A byte has only 256 values and can store numbers from 0 to 255.
Now the problems begin - if you read multibyte data, then where is the high byte?
That naming makes sense, right? The type of storage from senior to junior implies that the recording starts with the oldest and ends with the younger (Incidentally, the English version of the names from senior to junior (Big-endian) and from younger to senior (Little-endian) are taken from Gulliver’s Travels, where the midgets argued about whether to break the egg at the small end (little-end) or at the large (big-end)). Sometimes computers debate is the same meaningful :)
I repeat, the byte order does not matter while you are working with one byte. If you have one byte, then this is just the data that you read and there is only one option for their interpretation (again, because the concept of one byte is agreed between computers).
Now suppose we have 4 bytes (WXYZ) that are stored the same on machines with both types of byte order. That is, memory cell 0 corresponds to W, cell 1 corresponds to X, etc.
We can create such an agreement, remembering that the concept of “byte” is machine-independent. We can bypass the memory one byte at a time and set the necessary values. This will work on any machine.
Such code will work on any machine and successfully set the value of the bytes W, X, Y and Z located at the corresponding positions 0, 1, 2 and 3.
Now let's look at an example with multibyte data (finally!). Short summary: “short int” is a 2-byte number (16 bits), which can have a value from 0 to 65535 (if it is unsigned). Let's use it in an example.
So, s is a pointer to short int, and now it points to position 0 (in which W is stored). What happens when we count the value at pointer s?
Note that both machines started at position s and read memory sequentially. There is no confusion about what position 0 and position 1 mean. Like there is no confusion about what constitutes a short int type .
Now do you see the problem? A machine with a high to low order of storage thinks s = 0x1234 , while a machine with a low to high order of storage thinks s = 0x3412 . Absolutely identical data result in two completely different numbers.
Let's take another example with a 4 byte integer for “fun”:
And again we ask ourselves: what value is stored at address i ?
The same data, but different results - this is not a very pleasant thing.
The byte order problem is sometimes called the NUXI problem: the UNIX word stored on machines with a high to low order will be displayed as NUXI on machines with a low to high order.
Suppose we are going to store 4 bytes (U, N, I, and X) as two short ints : UN and IX. Each letter takes up a whole byte, as is the case with WXYZ. To save two values of type short int, write the following code:
This code is not specific to any machine. If we store the “UN” value on any machine and count it back, then we will get the “UN” back as well. The issue of byte ordering will not bother us, if we store the value on one machine, then we should get the same value when reading.
However, if you go through the memory one byte at a time (using the char * trick ), then the byte order may vary. On a machine with direct storage, we will see:
Which makes sense. “U” is the highest byte in “UN” and is accordingly stored first. The same situation is for “IX”, where “I” is the high byte and it is stored first.
On a machine with a reverse storage order, we will most likely see:
But that also makes sense. “N” is the low byte in “UN” and therefore it is stored first. Again, although the bytes are stored in the “reverse order” in memory, machines with a low to high order of storage know that this is the reverse byte order, and interprets them correctly when reading. Also, note that we can define hexadecimal numbers, such as 0x1234, on any machine. A machine with the reverse order of storage of bytes knows what you mean when you write 0x1234 and will not force you to swap values (when a hexadecimal number is sent for writing, the machine understands what's what and swaps the bytes in memory, hiding it from the eyes. Here such a trick.).
The scenario we examined is called the “NUXI” problem, because the sequence “UNIX” is interpreted as “NUXI” on machines with different byte storage order. Again, this problem only occurs when exchanging data - each machine has internal compatibility.
Now the computers are connected - the times are gone when the machines had to worry only about reading their own data. Machines with different byte storage order need to somehow exchange data and understand each other. How do they do that?
The simplest approach is to align with a common format for transmitting data over a network. The standard network order is from senior to junior, but some people may be upset that the order from junior to senior has not won, so just call it “network order”.
To convert data according to the network byte order, the machines call the hton () (host-to-network) function. On machines with direct storage order, this function does nothing, but we won’t talk about it here (it can piss off machines with reverse storage order :)).
But it is important to use the hton () function before sending data, even if you are working on a machine with a storage order from oldest to youngest. Your program can become very popular and will be compiled on various machines, and you strive for portability of your code (isn’t that so?).
Similarly, there is the ntoh () (network-to-host) function, which is used to read data from the network. You must use it to ensure that you interpret network data correctly in the host format. You must know the type of data that you accept in order to decrypt them correctly. The conversion functions are as follows:
Remember that one byte is one byte and the order does not matter.
These features are critical when performing low-level network operations, such as checking the checksum of IP packets. If you do not understand the essence of the problem with the order of storage of bytes, then your life will be filled with pain - take my word for it. Use the conversion functions and know why they are needed.
This approach involves using some kind of magic number, such as 0xFEFF, before each piece of data. If you read the magic number and its value 0xFEFF, then the data is in the same format as your machine and everything is fine. If you read the magic number and its value 0xFFFE, this means that the data was recorded in a format different from the format of your machine and you will need to convert them.
It should be noted a few points. Firstly, the number is not entirely magical, as programmers often use this term to describe randomly selected numbers (BOM can be any sequence of different bytes). This mark is called a byte sequence marker because it shows in which order the data was saved.
Secondly, BOM adds overhead for all transmitted data. Even if 2 bytes of information are transferred, you must add 2 bytes of the BOM token to them. Frightening, isn't it?
Unicode uses BOM when it stores multibyte data (some Unicode encodings can have 2, 3, or even 4 bytes per character). XML avoids this confusion by storing data immediately in UTF-8 by default, which stores Unicode information one byte at a time. Why is it so cool?
I repeat the 56th time - because the storage order problem does not matter for single bytes.
Again, using BOM may cause other problems. What if you forget to add a BOM? Will you assume that the data was sent in the same format as yours? Read the data and, seeing that they are “upside down” (whatever that means), try to convert them? What if the correct data accidentally contains the wrong BOM? These situations are not very pleasant.
Oh, what a philosophical question this is. Each byte storage order has its own advantages. Machines with the order from lowest to highest allow you to read the low byte first, without reading the rest. Thus, you can easily check whether the number is odd or even (the last bit is 0), which is very cool if you need such a check. Machines with the order from the oldest to the youngest store the data in memory in the usual form for a person (from left to right), which simplifies low-level debugging.
So why doesn't everyone just agree to use one of the systems? Why are some computers trying to be different from others? Let me answer a question with a question: why do not all people speak the same language? Why is writing in some languages from left to right, and in others from right to left?
Sometimes systems develop independently, and subsequently need interaction.
Questions with byte storage are an example of a common coding problem - data should be abstract concepts, and later this concept should be created from the data. This topic deserves a separate article (or series of articles), but you should have a better understanding of the problem with the storage order of bytes.
ordering problems are very frustrating, and I want to save you the grief I experienced. Here are the key points:
- Problem: Computers, like people, speak different languages. Some write data “from left to right” others “from right to left”. At the same time, each device perfectly reads its own data - problems begin when one computer saves data, and the other tries to read this data.
- Solution: Accept a common format (for example, all network traffic is transmitted in a single format). Or always add a header describing the data storage format. If the read header has the reverse order, then the data is saved in a different format and must be converted.
Numbers and data
The most important concept is to understand the difference between the numbers and the data that these numbers represent. A number is an abstract concept, as a calculator of something. You have ten fingers. The concept of “ten” does not change, depending on the representation used: ten, 10, diez (Spanish), ju (Japanese), 1010 (binary representation), X (Roman numbers) ... All these representations indicate the concept of “ten”.
Compare this with the data. Data is a physical concept, just a sequence of bits and bytes stored on a computer. Data is not inherently meaningful and should be interpreted by those who read it.
Data is like human writing, just a set of marks on paper. These marks do not have any meaning. If we see a line and a circle (for example, | O), then we can interpret this as “ten”. But this is only an assumption that the read characters represent a number. It may be the letters “IO” - the name of the satellite of Jupiter. Or perhaps the name of the Greek goddess. Or an abbreviation for input / output. Or someone’s initials. Or the number 2 in the binary representation (“10”). This list of assumptions goes on. The fact is that one piece of data (| O) can be interpreted differently, and the meaning remains unclear until someone clarifies the intentions of the author.
Computers face the same problem. They store data, not abstract concepts, using 1 and 0. Later they read these 1 and 0 and try to recreate abstract concepts from the data set. Depending on the assumptions made, these 1 and 0 can have completely different meanings.
Why it happens? Well, actually there is no such rule that computers should use the same language, just as there is no such rule for people. Each computer of one type has internal compatibility (it can read its own data), but there is no guarantee how exactly another type of computer interprets this data.
Key concepts:
- Data (bits and bytes or marks on paper) alone does not make sense. They must be interpreted in some abstract concept, for example, a number.
- Like people, computers have different ways of storing the same abstract concept (for example, we can say “10” in various ways).
Store numbers as data
Fortunately, most computers store data in only a few formats (although this was not always the case). This gives us a common starting point, which makes life a little easier:
- A bit has two states (on or off, 1 or 0).
- A byte is a sequence of 8 bits. The leftmost bit in the byte is the most significant. That is, the binary sequence 00001001 is the decimal number nine. 00001001 = (2 ^ 3 + 2 ^ 0 = 8 + 1 = 9).
- Bits are numbered from right to left. Bit 0 is the far right and it is the smallest. Bit 7 is the leftmost and it is the largest.
We can use these agreements as a building block for data exchange. If we save and read data one byte at a time, then this approach will work on any computer. The concept of byte is the same on all machines, the concept of “byte 0” is the same on all machines. Computers also perfectly understand the order in which you send them bytes - they understand which byte was sent first, second, third, etc. “Byte 35” will be the same on all machines.
So what's the problem - computers get along fine with single bytes, right? Well, everything is excellent for single-byte data such as ASCII characters. However, a lot of data uses several bytes to store, for example, integers or floating-point numbers. And there is no agreement on the order in which these sequences should be stored.
Example byte
Consider a sequence of 4 bytes. Let's call them WXY and Z. I avoid the names ABCD because they are hexadecimal numbers, which can be a bit confusing. So, each byte has a value and consists of 8 bits.
Имя байта W X Y Z
Позиция 0 1 2 3
Значение (hex) 0x12 0x34 0x56 0x78
For example, W is one byte with a value of 0x12 in hexadecimal or 00010010 in binary. If W is interpreted as a number, then it will be “18” in the decimal system (by the way, nothing indicates that we should interpret this byte as a number - it could be an ASCII character or something completely different). Are you still with me? We have 4 bytes, WXY and Z, each with a different value.
We understand pointers
Pointers are a key part of programming, especially in C. The pointer is a number that is an address in memory. And it depends only on us (programmers) how to interpret the data at this address.
In C, when you cast (cast) a pointer to a specific type (such as char * or int * ), this tells the computer how to interpret the data at that address. For example, let's declare:
void *p = 0; // p указатель на неизвестный тип данных
// p нулевой указатель - не разыменовывать
char *c; // c указатель на один байт
Please note that we cannot receive data from p because we do not know their type. p can indicate a number, letter, the beginning of a line, your horoscope or image - we just don’t know how many bytes we need to read and how to interpret them.
Now suppose we write:
c = (char *)p;
This operator tells the computer that p points to the same place, and the data at this address must be interpreted as one character (1 byte). In this case, c will point to memory at address 0, or to byte W. If we print c , we get the value stored in W, which is equal to the hexadecimal 0x12 (remember that W is a complete byte). This example does not depend on the type of computer - again, all computers equally well understand what is one byte (in the past this was not always the case).
This example is useful, it works equally on all computers - if we have a pointer to a byte (char *, one byte), we can go through memory by reading one byte at a time. We can go to any place in memory, and the storage order of bytes will not matter - any computer will return the same information to us.
So what is the problem?
Problems begin when the computer tries to read a few bytes. Many data types consist of more than one byte, such as long integers or floating point numbers. A byte has only 256 values and can store numbers from 0 to 255.
Now the problems begin - if you read multibyte data, then where is the high byte?
- Machines with high to low order storage (direct order) store the high byte first. If you look at a set of bytes, then the first byte (the lowest address) is considered the highest.
- Machines with low to high storage order (reverse order) store the low byte first. If you look at a set of bytes, then the first byte will be the smallest.
That naming makes sense, right? The type of storage from senior to junior implies that the recording starts with the oldest and ends with the younger (Incidentally, the English version of the names from senior to junior (Big-endian) and from younger to senior (Little-endian) are taken from Gulliver’s Travels, where the midgets argued about whether to break the egg at the small end (little-end) or at the large (big-end)). Sometimes computers debate is the same meaningful :)
I repeat, the byte order does not matter while you are working with one byte. If you have one byte, then this is just the data that you read and there is only one option for their interpretation (again, because the concept of one byte is agreed between computers).
Now suppose we have 4 bytes (WXYZ) that are stored the same on machines with both types of byte order. That is, memory cell 0 corresponds to W, cell 1 corresponds to X, etc.
We can create such an agreement, remembering that the concept of “byte” is machine-independent. We can bypass the memory one byte at a time and set the necessary values. This will work on any machine.
c = 0; // указывает на позицию 0 (не будет работать на реальной машине!)
*c = 0x12; // устанавливаем значение W
c = 1; // указывает на позицию 1
*c = 0x34; // устанавливаем значение X
... // то же повторяем для Y и Z
Such code will work on any machine and successfully set the value of the bytes W, X, Y and Z located at the corresponding positions 0, 1, 2 and 3.
Data interpretation
Now let's look at an example with multibyte data (finally!). Short summary: “short int” is a 2-byte number (16 bits), which can have a value from 0 to 65535 (if it is unsigned). Let's use it in an example.
short *s; // указатель на short int (2 байта)
s = 0; // указатель на позицию 0; *s это значение
So, s is a pointer to short int, and now it points to position 0 (in which W is stored). What happens when we count the value at pointer s?
- Machine with direct storage order: I think short int consists of two bytes, which means I count them. Position s is address 0 (W or 0x12), and position s + 1 is address 1 (X or 0x34). Since the first byte is high, the number should be the following 256 * bytes 0 + bytes 1 or 256 * W + X, or 0x1234. I multiply the first byte by 256 (2 ^ 8) because it needs to be shifted by 8 bits.
- Machine with the reverse storage order: I don’t know what Mr. “From the oldest to the youngest” smokes. I agree that short int consists of 2 bytes and I consider them exactly the same: position s with value 0x12 and position s + 1 with value 0x34. But in my world, the first is the low byte! And the number should be byte 0 + 256 * byte 1 or 256 * X + W, or 0x3412.
Note that both machines started at position s and read memory sequentially. There is no confusion about what position 0 and position 1 mean. Like there is no confusion about what constitutes a short int type .
Now do you see the problem? A machine with a high to low order of storage thinks s = 0x1234 , while a machine with a low to high order of storage thinks s = 0x3412 . Absolutely identical data result in two completely different numbers.
And another example
Let's take another example with a 4 byte integer for “fun”:
int *i; // указатель на int (4 байты 32-битовой машине)
i = 0; // указывает на позицию 0, а *i значение по этому адресу
And again we ask ourselves: what value is stored at address i ?
- Machine with direct storage order: the int type consists of 4 bytes and the first byte is high. I read 4 bytes (WXYZ) of which the oldest is W. Received number: 0x12345678.
- A machine with a reverse storage order: of course, int consists of 4 bytes, but the last is the oldest. I also read 4 bytes (WXYZ), but W will be located at the end - since it is the youngest. The resulting number: 0x78563412.
The same data, but different results - this is not a very pleasant thing.
NUXI Problem
The byte order problem is sometimes called the NUXI problem: the UNIX word stored on machines with a high to low order will be displayed as NUXI on machines with a low to high order.
Suppose we are going to store 4 bytes (U, N, I, and X) as two short ints : UN and IX. Each letter takes up a whole byte, as is the case with WXYZ. To save two values of type short int, write the following code:
short *s; // указатель для установки значения переменной типа short
s = 0; // указатель на позицию 0
*s = UN; // устанавливаем первое значение: U * 256 + N (вымышленный код)
s = 2; // указатель на следующую позицию
*s = IX; // устанавливаем второе значение: I * 256 + X
This code is not specific to any machine. If we store the “UN” value on any machine and count it back, then we will get the “UN” back as well. The issue of byte ordering will not bother us, if we store the value on one machine, then we should get the same value when reading.
However, if you go through the memory one byte at a time (using the char * trick ), then the byte order may vary. On a machine with direct storage, we will see:
Byte: U N I X
Location: 0 1 2 3
Which makes sense. “U” is the highest byte in “UN” and is accordingly stored first. The same situation is for “IX”, where “I” is the high byte and it is stored first.
On a machine with a reverse storage order, we will most likely see:
Byte: N U X I
Location: 0 1 2 3
But that also makes sense. “N” is the low byte in “UN” and therefore it is stored first. Again, although the bytes are stored in the “reverse order” in memory, machines with a low to high order of storage know that this is the reverse byte order, and interprets them correctly when reading. Also, note that we can define hexadecimal numbers, such as 0x1234, on any machine. A machine with the reverse order of storage of bytes knows what you mean when you write 0x1234 and will not force you to swap values (when a hexadecimal number is sent for writing, the machine understands what's what and swaps the bytes in memory, hiding it from the eyes. Here such a trick.).
The scenario we examined is called the “NUXI” problem, because the sequence “UNIX” is interpreted as “NUXI” on machines with different byte storage order. Again, this problem only occurs when exchanging data - each machine has internal compatibility.
Data exchange between machines with different byte storage order
Now the computers are connected - the times are gone when the machines had to worry only about reading their own data. Machines with different byte storage order need to somehow exchange data and understand each other. How do they do that?
Solution 1: Use a common format
The simplest approach is to align with a common format for transmitting data over a network. The standard network order is from senior to junior, but some people may be upset that the order from junior to senior has not won, so just call it “network order”.
To convert data according to the network byte order, the machines call the hton () (host-to-network) function. On machines with direct storage order, this function does nothing, but we won’t talk about it here (it can piss off machines with reverse storage order :)).
But it is important to use the hton () function before sending data, even if you are working on a machine with a storage order from oldest to youngest. Your program can become very popular and will be compiled on various machines, and you strive for portability of your code (isn’t that so?).
Similarly, there is the ntoh () (network-to-host) function, which is used to read data from the network. You must use it to ensure that you interpret network data correctly in the host format. You must know the type of data that you accept in order to decrypt them correctly. The conversion functions are as follows:
htons() - "Host to Network Short"
htonl() - "Host to Network Long"
ntohs() - "Network to Host Short"
ntohl() - "Network to Host Long"
Remember that one byte is one byte and the order does not matter.
These features are critical when performing low-level network operations, such as checking the checksum of IP packets. If you do not understand the essence of the problem with the order of storage of bytes, then your life will be filled with pain - take my word for it. Use the conversion functions and know why they are needed.
Solution 2: Using a Byte Order Mark (BOM)
This approach involves using some kind of magic number, such as 0xFEFF, before each piece of data. If you read the magic number and its value 0xFEFF, then the data is in the same format as your machine and everything is fine. If you read the magic number and its value 0xFFFE, this means that the data was recorded in a format different from the format of your machine and you will need to convert them.
It should be noted a few points. Firstly, the number is not entirely magical, as programmers often use this term to describe randomly selected numbers (BOM can be any sequence of different bytes). This mark is called a byte sequence marker because it shows in which order the data was saved.
Secondly, BOM adds overhead for all transmitted data. Even if 2 bytes of information are transferred, you must add 2 bytes of the BOM token to them. Frightening, isn't it?
Unicode uses BOM when it stores multibyte data (some Unicode encodings can have 2, 3, or even 4 bytes per character). XML avoids this confusion by storing data immediately in UTF-8 by default, which stores Unicode information one byte at a time. Why is it so cool?
I repeat the 56th time - because the storage order problem does not matter for single bytes.
Again, using BOM may cause other problems. What if you forget to add a BOM? Will you assume that the data was sent in the same format as yours? Read the data and, seeing that they are “upside down” (whatever that means), try to convert them? What if the correct data accidentally contains the wrong BOM? These situations are not very pleasant.
Why does this problem exist at all? Can’t you just agree?
Oh, what a philosophical question this is. Each byte storage order has its own advantages. Machines with the order from lowest to highest allow you to read the low byte first, without reading the rest. Thus, you can easily check whether the number is odd or even (the last bit is 0), which is very cool if you need such a check. Machines with the order from the oldest to the youngest store the data in memory in the usual form for a person (from left to right), which simplifies low-level debugging.
So why doesn't everyone just agree to use one of the systems? Why are some computers trying to be different from others? Let me answer a question with a question: why do not all people speak the same language? Why is writing in some languages from left to right, and in others from right to left?
Sometimes systems develop independently, and subsequently need interaction.
Epilogue: Goodbye Thoughts
Questions with byte storage are an example of a common coding problem - data should be abstract concepts, and later this concept should be created from the data. This topic deserves a separate article (or series of articles), but you should have a better understanding of the problem with the storage order of bytes.