Inside the MP3. And how is it all arranged?

Once I needed to solve a simple (as it seemed to me then) task - to find out the duration of an mp3 file in a PHP script. I heard about ID3 tags and immediately thought that the duration information is stored either in the tags or in the headers of the mp3 file. Superficial searches on the Internet showed that in a couple of minutes to solve this problem will not work. Since by nature I’m quite curious and time was running out - I decided not to use third-party tools but to figure out one of the most popular formats on my own.
If you are interested in what's inside, welcome to cat (traffic).
In this article, we will not dwell on the extraction of ID3v2 tags - this can be put out in a separate article, since there are various nuances. And also on fragments of headers that are practically not used at present (for example, part of Emphasis header of an mp3 frame). Also, we do not consider the structure of the audio data itself - the same that we hear from the speakers.
ID3 tags
ID3 (from Identify a MP3) is the metadata format most commonly used in MP3 audio files. ID3 signature contains data on the name of the track, album, artist name, etc., which are used by multimedia players and other programs, as well as hardware players, to display information about the file and automatically organize the audio collection.
Wikipedia
There are two completely different versions of ID3 data: ID3v1 and ID3v2.
ID3v1 - has a fixed size of 128 bytes, which are appended to the end of the mp3 file. You can store there: track name, artist, album, year, comment, track number (for version 1.1) and genre.

Pretty quickly everyone realized that 128 bytes is a very small place to store such data. And therefore, over time, a second version of the data appeared and is successfully used - ID3v2 .
Unlike the first version, v2 tags are of variable length and are placed at the beginning of the file, which allows streaming playback. (ID3v2.4 format also allows you to store data at the end of the file).
ID3v2 data consists of a header and subsequent ID3v2 frames. For example, in version ID3v2.3 there are more than 70 types of frames.

- the marker is always equal to 'ID3'
- There are currently three versions of ID3v2.2, ID3v2.3 and ID3v2.4.
Version v2.2 is considered obsolete.
v2.3 is the most popular version.
v2.4 - gaining popularity. One of the differences from v2.3 is that it allows the use of UTF-8 encoding (and not just UTF-16) - Flags . Currently, only three (5,6,7) bits are used:
bin:% abc00000
a 'unsynchronization' - used only with MPEG-2 and MPEG-2.5 formats.
b 'Extended header' - indicates the presence of an extended header
with 'Experimental indicator' - experimental indicator - Length . The peculiarity of specifying the ID3v2 data length is that in each byte the 7th bit is not used and is always set to 0.
Consider an example:

In this case, along with the ID3v2 header (10 bytes), ID3v2 data takes up 1024 bytes.
After the ID3v2 header, the tags themselves go. A detailed analysis of reading ID3v2 tags, as mentioned above, I decided not to include in this article.
Now we have information about the presence and length of ID3 tags and we can proceed with parsing the mp3 frame and understand where the duration is stored. And at the same time understand everything else.
MP3 frame
The entire mp3 file consists of frames that can only be retrieved sequentially. The frame contains the header and audio data. Since we do not set ourselves the goal of writing firmware for a tape recorder, we are interested in the frame header.
More about him (a bunch of tables and dry information)
The header size is 4 bytes.

Description:
- [0-10] Marker - 11 bits filled with units (Frame sync)
- [11-12] MPEG version index (Audio version ID)

- [13-14] Layer index (Layer index)

By the way, MP3 is MPEG-1 Layer III - [15] Protection bit
1 - no protection
0 - the header is protected 16-bit. CRC (follows the heading) - [16-19] Bitrate index

The bitrate in kilobits / sec is stored in the table. However, in this format, it is assumed that 1 kilobit = 1000 bits, not 1024. Thus, 96 Kbps = 96000 bits / sec. - [20-21] Sampling rate index

- [22] Padding bit
If set, data is shifted by 1 byte. This is important for calculating the frame size. - [23] Private bit (for information only)
- [24-25] Channel mode

- [26-27] Channel mode extension. (Mode extension) Used only with Joint stereo
- [28] Copyright (Copyright bit) - for information only
- [29] Original (Original bit) - for information only.
- [30-31] Emphasis (Emphasis) - is currently practically not used.
Data compression modes or what is the bitrate
There are 3 modes of data compression:
CBR (constant bitrate) - constant bitrate. It does not change throughout the track.
VBR (variable bitrate) - variable bitrate. With this compression, the bitrate constantly changes throughout the track.
ABR (average bitrate) - average bitrate. This concept is used only when encoding a file. At the “output”, a file with VBR is obtained.
CBR
If the file is encoded with a constant bitrate, then we can
Duration = Audio size / Bitrate (in bits!) * 8
For example, a file has a size of 350,670 bytes. There are ID3v1 tags (128 bytes) and ID3v2 tags (1024 bytes). Bitrate = 96. Therefore, the size of the audio data is 350670 - 128 - 1024 = 349518 bytes.
Duration = 349518/96000 * 8 = 29.1265 = 29 seconds
Vbr
It is necessary to explain how to determine the compression mode. Everything is simple. If the file is compressed with VBR, then the VBR header is added. By its presence, we can understand that a variable bitrate is used.
There are two kinds of headers: Xing and VBRI.
Xing is placed with the offset from the beginning of the first mp3 frame in the position, according to the table:

For example: for us, the ID3v2 tag takes 1024 bytes. If our mp3-file has the “Stereo” channel mode, then the VBR Xing header will begin with an offset of 1024 + 32 = 1056 bytes.
The VBRI header is always placed with an offset of +32 bytes from the beginning of the first mp3 frame.
The first four bytes in both headers contain the 'Xing' or 'Info' token for Xing. And 'VBRI' for VBRI.
These VBR headers are variable in length and contain various file encoding information. More information about the structure of VBR headers (and not only) can be found, for example, here .
I’ll only tell you what interests us at the moment. Namely, the number of frames (Number of Frames). This number is 4 bytes long.
The Xing header contains an offset of +8 bytes from the beginning of the header. In VBRI, +14 bytes from the start of the header.
Using the Sampler Per Frame table, we can get the duration of an mp3 file encoded with a variable bitrate.

Duration = Number of frames * Samples per frame / Sample rate
For example: from the VBRI header, the number of frames was 1118, samples per frame = 1152. Sampling frequency = 44100.
Duration = 1118 * 1152/44100 = 29.204 = 29 seconds.
That's all for today. If it was useful to someone - thanks .
For those who want to immediately dig inside mp3 - Here lies a php script that I wrote for myself simultaneously with this article and four small mp3 files for the test.
References
id3.org - Read about ID3
id3.org - and something about mp3 frame
Pretty detailed about mp3 frame
getID3: A good library for getting information about mp3. (Php)