zencd September 3, 2012 at 06:43

Counting md5 hash for mp3 file

So, we need to calculate the hash amount for the MP3 file. A simple run of the file through md5.exe is not suitable, since the file contains meta-information - tags that tend to change over time. Thus, just updating the tags in the file, we get a different hash amount, which is not good at all.

By the way, for FLAC and APE formats, this problem is practically absent, since they usually initially contain a hash sum of audio data prescribed by the encoder. For FLAC, the value can be obtained with the command metaflac --show-md5sum.

Next is a fairly reliable way to calculate a (non-perceptual) hash based on binary data stored in MP3s.

1) Approach No. 1
2) Approach No. 2
3) Xing and Lame tags
4) Resync
5) Counting reliability

Approach number 1. We remove unnecessary

The idea is that if you remove everything unnecessary from the file (tags), then only the necessary information will remain - audio data, from which you can calculate the hash.

The structure of the mp3 file:
- ID3v2 tag
- mpeg frames - audio data proper
- Lyrics tag
- APE tag
- ID3v1 tag (final)

(All tags are optional.)

ID3v2, unlike its predecessor, is at the beginning of the file, which gives the ability for the client to immediately read the meta information if the file is transmitted over the network, for example. It starts with three ASCII characters “ID3”, then comes the encoded tag length:

if buf[0:3] == 'ID3':
    id3_v2_len = 20 if ord(buf[5]) & 0x10 else 10
    id3_v2_len += ((ord(buf[6]) * 128 + ord(buf[7])) * 128 + ord(buf[8])) * 128 + ord(buf[9])
    audio_start = id3_v2_len

Next come the audio frames. Their beginning can be visually noticed if you open the file in 1251 encoding and find the characters "yy".

Now let's go from the end. ID3v1 is recognized as a 128-byte block at the end of the file, starting with the ASCII string “TAG”. If you then search from the end for “LYRICSBEGIN”, you can find the Lyrics3 tag. And if "APETAGEX" is the APEv2 tag.

If all this is cut out, only the audio data should remain. This approach is followed by the mp3tag.de program, a bunch of private scripts and tagging libraries, most of which are focused only on ID3, which, of course, is in the way.

But the bad thing is that tags can and often are broken, written on top of each other, etc. With this approach, garbage heaps are taken for audio data, which leads to the calculation of one hash amount, and after changing the tags to another, which is not permissible.

As a result, the program written in such a manner I had to throw out after a collision with reality.

Approach number 2. Leave the right

MP3 players act the other way around - they isolate what they are interested in - mpeg frames, skipping everything that doesn’t look like frames, and they do it very successfully - you don’t usually hear “sobs” on “bad” files. It’s wise to do the same.

It looks like the foobar2000 player does this, which, in my estimation, works perfectly, but it’s understandable to dispose of it in this case.

MPlayer should do the same, but doubts arise from the fact that in fact he sometimes stumbles on incorrect tags, leaving them behind. File cleaning team this for him mplayer in.mp3 -dumpaudio -dumpfile out.mp3.

There are also media libraries - mp3 decoders. These are mad, gstreamer and libmpg123, which are used non-selectively by different testers. I have not tried the first two, but libmpg123 went off with a bang - this code has been tested for years and a lot of projects, and high-quality according to the results of my own research and comparisons. There, in doc/examplesthere is the source code of a micro-program with a talking name extract_frames.c. The program accepts the original mp3 file as input and sends clean audio frames to the output.

libmpg123 can be compiled without problems with cygwin and mingw (although the mingw version is somehow buggy with stdin / stdout, so I had to fix the source by opening the file in binary mode myself). I slightly changed the program so that instead of frames it immediately issued md5 and made a couple of changes described below. Source code for anyone interested:

dl.dropbox.com/u/1883230/my/habr/mp3hash.zip

Xing and Lame Tags

But meta-information that we so want to get rid of can also be stored in audio frames - these are xing and lame tags, where the extra information used for us to optimize the movement along the vbr-stream is encoded, as well as the parameters used in encoding. In general, you can leave the xing from the leem as few people can and will change them, but if you suddenly perform the “utilities / fix vbr mp3 header” operation in foobar2000, the hash amount for the file will change. So it would be better to throw this meta. You can stop taking these tags into account when hashing by passing the following parameter to libmpg123:

// "remove ignore" несколько смущает, но это работает
ret = mpg123_param(m, MPG123_REMOVE_FLAGS, MPG123_IGNORE_INFOFRAME, 0.);

Resync

It was also useful to remove the resync limit. If this is not done, the program will “stumble” when it does not meet the audio frames for a long time (4KB), which happens with files in which, for example, ID3v2 contains a large image. In my version of the program, the hash amount is calculated the same, but the error flag that appears appears to spoil everything and you can no longer be sure that the result was obtained without errors. And with this parameter, everything is fine:

mpg123_param(m, MPG123_RESYNC_LIMIT, -1, 0.0);

Counting Reliability

In my limited view, foobar2000 works (gets rid of meta information) perfectly. The patched program extract_frames.cdoes not cope with rare files, but after the “rebuild stream” operation in fubar, 95 out of 100 cases are counted correctly (compatible with fubar). Further and mplayer goes a little worse - it is almost always compatible with extract_frames(in accounting mode lame / xing, of course), but, as I already wrote, it sometimes falls on garbage tags. Still further are various taggers that require fairly correct tags, and hashing problems can hardly be applicable if more stable alternatives are available.

In general, after one major failure and struggle with a couple of aspects, I personally was satisfied with this algorithm, checking it on a bunch of files.

Tags: