Rembish October 28, 2009 at 19:31

RAR: getting a list of files without PECL

Not so long ago I wrote about getting text from all kinds of file formats, whether it be DOC or PDF . Today we will consider a no less interesting format - the RAR compression format. I will not reassure the afflicted - today we will only read the list of files without any additional PHP extensions. So, who cares, please, under the cut ...

RAR is a good "bad" archiver

Let me remind you that RAR is developed by our compatriot Eugene Roshal. From him, he received his name Roshal Archiver. The format is closed, which absolutely did not affect its distribution both in Russia and around the world. Almost all the workstations that I had seen were with the RAR archiver installed ~~and sometimes cracked~~ .

During its development, and beingness archive has grown to 3 ^s (assuming that soon will be, and 4 ^th ) version, which affected the most "self-made" razarhivatorov: third version introduced new compression algorithms, by which the latter fell into paranoia and heresy. However, the developer's site containsa sufficient amount of all kinds of source codes to unzip RAR files for different development platforms and environments.

As for PHP, the PECL extension has grown to the “stable” first version and is rarely installed on hosting sites. The extension, by the way, uses the very “unrar” whose source codes are on the program’s website. Moreover, I admit honestly, I could not get the ~~extension~~ to work under 5.3 (under Windows), under php_rar.dll it worked under 5.2.11, but most of the archives could not be read. I won’t be surprised that all versions of the compiled library for the Windows system were for “some” other version, but I didn’t want to compile myself ... so in the evening I sat down to look and see what unrar.dll is, what can be assembled from source codes on the site.

RAR - how is it?

Due to the closed format, the documentation on it is scarce, even despite the fact that there are source codes for data compression. Well, not surprisingly, few people would like to consider about 600 kb of source code. Nevertheless, there are still enthusiasts (God forbid, if you thought about my person :) - therefore, the UniquE RAR File Library project was created at one time , which at times reduced the source code for unzipping files created by the ^2nd version of the archiver.

So I came across the sources of the aforementioned library, as well as the minimal, but at least some, documentation on the older 2.02 version of the archiver. Well, let's dive into what our RAR archives look like.

The RAR archive consists of variable-length blocks with headers of 7 bytes each. Any archive contains at least two blocks MARK_HEAD and MAIN_HEAD. The first contains information that we have RAR, and looks like a " 52 61 72 21 1a 07 00" in HEXs. The third byte 0х72just indicates that this is a Marker Header. A word 00 07in little-endian contains the length of the block. Just the same 7 bytes.

The second Main Header block begins immediately after the first and must contain 13 bytes and have a marking byte equal 0x73. After it, the file already starts data - be it a compressed file (a market 0х74in the third byte of the block header), a comment on the archive, additional information, or, for example, a recovery record.

The algorithm for obtaining a list of files is not complicated (if you do not take into account archives with an encrypted directory structure, the reading of which is beyond the scope of this article).

We read the first seven bytes of the header. We find there the length of the heading and read it to the end;
Check if the block is "file";
If “Yes”, then DWORD at the seventh position is the size of the archived file (as well as the amount of data that needs to be read before the next block), the next double word is the size of the original file, at position 28 there are file attributes (DWORD), and addresses 26 and 32 are the length of the file name (2 bytes) and the name itself. In addition, there you can find the creation date, the OS code in which the file was created and CRC;
If the block is not “file”, then we must read the word in third position and check the value of its ^15th bit, which is responsible for the additional amount of information that can go with the block. In the case of "1" at this position, we must skip ADD_SIZE bytes (the first double word after the block header);
And so on until the end of the file ...

Difficult? Not really, in comparison with any .doc files.

Source

// Function to read a list of files from $ filename without using
// PECL extensions rar.
function rar_getFileList ($ filename) {
    // Function to get COUNT bytes from a string (little-endian).
    // In order not to litter the global space of functions - send it
    // invert the maternal.
    if (! function_exists ("temp_getBytes")) {
        function temp_getBytes ($ data, $ from, $ count) {
            $ string = substr ($ data, $ from, $ count);
            $ string = strrev ($ string);

            return hexdec (bin2hex ($ string));
        }
    }

    // Attempt to open the file
    $ id = fopen ($ filename, "rb");
    if (! $ id)
        return false;

    // Check whether the file is a RAR archive
    $ markHead = fread ($ id, 7);
    if (bin2hex ($ markHead)! = "526172211a0700")
        return false;

    // Trying to read the MAIN_HEAD block
    $ mainHead = fread ($ id, 7);
    if (ord ($ mainHead [2])! = 0x73)
        return false;
    $ headSize = temp_getBytes ($ mainHead, 5, 2);

    // Move to the position of the first "significant" block in the file
    fseek ($ id, $ headSize - 7, SEEK_CUR);

    $ files = array ();
    while (! feof ($ id)) {
        // Read the block header
        $ block = fread ($ id, 7);
        $ headSize = temp_getBytes ($ block, 5, 2);
        if ($ headSize <= 7)
            break;

        // Read the rest of the block based on the length of the header by
        // corresponding offset
        $ block. = fread ($ id, $ headSize - 7);
        // If this is a file block, then we begin to process it
        if (ord ($ block [2]) == 0x74) {
            // We look at how much the packed file takes in the archive and
            // move to the next position.
            $ packSize = temp_getBytes ($ block, 7, 4);
            fseek ($ id, $ packSize, SEEK_CUR);

            // Read file attributes: r - read only, h - hidden,
            // s - system, d - directory, a - archived
            $ attr = temp_getBytes ($ block, 28, 4);
            $ attributes = "";
            if ($ attr & 0x01)
                $ attributes. = "r";
            if ($ attr & 0x02)
                $ attributes. = "h";
            if ($ attr & 0x04)
                $ attributes. = "s";
            if ($ attr & 0x10 || $ attr & 0x4000)
                $ attributes = "d";
            if ($ attr & 0x20)
                $ attributes. = "a";

            // Read the file name, sizes before and after packing, CRC and attributes
            $ files [] = array (
                "filename" => substr ($ block, 32, temp_getBytes ($ block, 26, 2)),
                "size_compressed" => $ packSize,
                "size_uncompressed" => temp_getBytes ($ block, 11, 4),
                "crc" => temp_getBytes ($ block, 16, 4),
                "attributes" => $ attributes,
            );
        } else {
            // If this block is not file, then we skip taking into account the possible
            // extra offset ADD_SIZE
            $ flags = temp_getBytes ($ block, 3, 2);
            if ($ flags & 0x8000) {
                $ addSize = temp_getBytes ($ block, 7, 4);
                fseek ($ id, $ addSize, SEEK_CUR);
            }
        }
    }
    fclose ($ id);

    // Return a list of files
    return $ files;
}

You can get the code with comments on GitHub .

Literature

Well, as usual, the literature for review:

Prospects

As for reading files from archives, then ... this can theoretically be done in PHP by refactoring the library from UniquE, but this is only suitable for archives created before version 2.90. The library will not read the new archives ... but you yourself understand how to understand 500 kilobytes of code.

Tags: