Methods for detecting "glued" files

Many could hear about files like rarjpeg. This is a special kind of file, which is a jpeg image and a rar archive glued together. It is an excellent container for hiding the fact of information transfer. You can create rarjpeg using the following commands:
UNIX: cat image1.jpg archive.rar> image2.jpg
WINDOWS: copy / b image1.jpg + archive.rar image2.jpg
Or, if you have a hex editor.
Of course, to hide the fact of the transfer of information, you can use not only the JPEG format, but also many others. Each format has its own characteristics, thanks to which it can be suitable or not for the role of the container. I will describe how you can find glued files in the most popular formats or point to the fact of gluing.
Methods for detecting glued files can be divided into three groups:
- Method for checking the area after the EOF marker. Many popular file formats have the so-called end-of-file marker, which is responsible for displaying the desired data. For example, photo viewers read all bytes up to this marker, however, the area after it remains ignored. This method is ideal for formats: JPEG, PNG, GIF, ZIP, RAR, PDF.
- Method for checking file size. The structure of some formats (audio and video containers) allows you to calculate the actual file size and compare it with the original size. Formats: AVI, WAV, MP4, MOV.
- Method for checking CFB files. CFB or Compound File Binary Format - a document format developed by Microsoft, which is a container with its own file system. This method is based on the detection of anomalies in the file.
Is there life after the end of the file?
Jpeg
To find the answer to this question, it is necessary to delve into the specification of the format, which is the "ancestor" of the glued files and understand its structure. Any JPEG starts with a signature of 0xFF 0xD8.
After this signature there is service information, optionally an image icon and, finally, the compressed image itself. In this format, the end of the image is marked with a double-byte signature 0xFF 0xD9.
PNG
The first eight bytes of the PNG file are the following signature: 0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A. The end signature that ends the data stream: 0x49, 0x45, 0x4E, 0x44, 0xAE, 0x42, 0x60, 0x82.
Rar
Common signature for all rar archives: 0x52 0x61 0x72 0x21 (Rar!). After it comes information about the archive version and other related data. It was experimentally established that the archive ends with the signature 0x0A, 0x25, 0x25, 0x45, 0x4F, 0x46.
Table of formats and their signatures:
Format | Initial Signature | End signature |
---|---|---|
Jpeg | 0xFF 0xD8 | 0xFF 0xD9 |
PNG | 0x89 0x50 0x4E 0x47 0x0D 0x0A 0x1A 0x0A | 0x49 0x45 0x4E 0x44 0xAE 0x42 0x60 0x82 |
Rar | 0x52 0x61 0x72 0x21 | 0x0A 0x25 0x25 0x45 0x4F 0x46 |
- Find the initial signature;
- Find the ultimate signature;
- If there is no data after the final signature, your file is clean and contains no attachments! Otherwise, it is necessary to search for other formats after the final signature.
GIF and PDF
Format | Initial Signature | End signature |
---|---|---|
GIF | 0x47 0x49 0x46 0x38 | 0x00 0x3B |
0x25 0x50 0x44 0x46 | 0x0A 0x25 0x25 0x45 0x4F 0x46 |
- Point 1 is repeated from the previous algorithm.
- Point 2 is repeated from the previous algorithm.
- When finding the final signature, remember its location and search further;
- If you reach the last EOF token in this way, the file is clean.
- If the file does not end with the final signature - goto is the place of the last found final signature.
The big difference between the file size and the position after the last ending signature indicates the presence of a glued attachment. The difference may be more than ten bytes, although other values may be set.
ZIP
The peculiarity of ZIP archives is the presence of three different signatures:
Signatures | Description |
---|---|
0x50 0x4B 0x03 0x04 | Normal archive signature |
0x50 0x4B 0x05 0x06 | Signature of an empty archive |
0x50 0x4B 0x07 0x08 | Partitioned archive signature |
Local File Header 1 |
File data 1 |
Data Descriptor 1 |
Local File Header 2 |
File data 2 |
Data Descriptor 2 |
... |
Local File Header n |
File data n |
Data descriptor n |
Archive decryption header |
Archive extra data record |
Central directory |
To check the ZIP archive, you need to find the final signature of the central directory, skip 18 bytes and look for signatures of known formats in the comment area. The large size of the comment also indicates the fact of gluing.
Size matters
Avi
The structure of the AVI file is as follows: each file begins with a RIFF signature (0x52 0x49 0x46 0x46). On 8 bytes there is an AVI signature specifying format (0x41 0x56 0x49 0x20). A block at offset 4, consisting of 4 bytes, contains the initial size of the data block (byte order - little endian). To find out the number of the block containing the next size, you must add the size of the header (8 bytes) and the size obtained in the block 4-8 bytes. Thus, the full file size is calculated. It is assumed that the calculated size may be smaller than the actual file size. After the calculated size, the file will contain only zero bytes (necessary for alignment of the border of 1 KB).
Size calculation example:

Bias | The size | Next offset |
---|---|---|
4 | 31442 | 8 + 31442 = 31450 |
Wav
Like AVI, a WAV file starts with a RIFF signature, however, this file has a signature of 8 bytes - WAVE (0x57 0x41 0x56 0x45). File size is calculated in the same way as AVI. The actual size should be exactly the same as calculated.
Mp4
MP4 or MPEG-4 - a media container format used to store video and audio streams, also provides for the storage of subtitles and images.
At an offset of 4 bytes, the signatures are located: the file type ftyp (66 74 79 70) (QuickTime Container File Type) and the file subtype mmp4 (6D 6D 70 34). To recognize hidden files, we are interested in the ability to calculate the file size.

Consider an example. The size of the first block is at zero offset, and it is 28 (00 00 00 1C, Big Endian byte order); it also indicates the offset where the size of the second data block is located. At the 28th offset, we find the next block size equal to 8 (00 00 00 08). To find the next block size, you must add the sizes of the previous blocks found. Thus, the file size is calculated:
Bias | Value | Next offset |
---|---|---|
0 | 28 | 28 + 0 = 28 |
28 | 8 | 28 + 8 = 36 |
36 | 303739 | 36 + 303739 = 303775 |
303775 | 6202 | 303775 + 6202 = 309977 |
Mov
This widely used format is also an MPEG-4 container. MOV uses a proprietary data compression algorithm, has a structure similar to MP4 and is used for the same purposes - to store audio and video data, as well as related materials.
Like MP4, any mov file has a 4-byte signature ftyp at 4 offsets, however, the following signature has a value of qt__ (71 74 20 20). The rule for calculating the file size has not changed: starting from the beginning of the file, we calculate the size of the next block and add it.
The method of checking this group of formats for the presence of “glued” files consists in calculating the size according to the rules given above and comparing it with the size of the checked file. If the current file size is much smaller than calculated, then this indicates the fact of gluing. When checking AVI files, it is assumed that the calculated size may be smaller than the file size due to the presence of added zeros to align the border. In this case, it is necessary to check the zeros after the calculated file size.
Checking Compound File Binary Format
This file format, developed by Microsoft, is also known as OLE (Object Linking and Embedding) or COM (Component Object Model). DOC, XLS, PPT files belong to the group of CFB formats.
A CFB file consists of a 512-byte header and sectors of the same length that store data streams or service information. Each sector has its own non-negative number, with the exception of special numbers: “-1” - numbers the free sector, “-2” - numbers the sector that closes the chain. All sector chains are defined in the FAT table.

Suppose that an attacker modified a certain doc file and pasted another file at its end. There are several different ways to detect it or point to an anomaly in a document.
Abnormal file size
As mentioned above, any CFB file consists of a header and sectors of equal length. To find out the size of a sector, it is necessary to read a two-byte number at 30 offset from the beginning of the file and raise 2 to the power of this number. This number must be equal to either 9 (0x0009) or 12 (0x000C), respectively, the file sector size is 512 or 4096 bytes. After finding the sector, it is necessary to check the following equality:
(FileSize - 512) mod SectorSize = 0
If this equality does not hold, then you can indicate the fact of file gluing. However, this method has a significant drawback. If the attacker knows the size of the sector, then he just needs to stick his file and another n bytes so that the size of the glued data is a multiple of the size of the sector.
Unknown sector type
If an attacker knows about a method of circumventing a previous check, then this method can detect the presence of sectors with undefined types.
Define the equality:
FileSize = 512 + CountReal * SectorSize, where FileSize is the file size, SectorSize is the sector size, CountReal is the number of sectors.
We also define the following variables:
- CountFat - the number of FAT sectors. It is located at the 44th offset from the beginning of the file (4 bytes);
- CountMiniFAT - the number of sectors MiniFAT. It is located at 64 offset from the beginning of the file (4 bytes);
- CountDIFAT - the number of DIFAT sectors. It is located at 72 offset from the beginning of the file (4 bytes);
- CountDE - The number of Directory Entry sectors. To find this variable, you must find the first sector DE, which is located at 48 bias. Then you need to get a complete view of DE from the FAT and calculate the number of DE sectors;
- CountStreams - the number of sectors with datastreams;
- CountFree - the number of free sectors;
- CountClassified - the number of sectors with a specific type;
CountClassified = CountFAT + CountMiniFAT + CountDIFAT + CountDE + CountStreams + CountFree
Obviously, with the inequality of CountClassified and CountReal, we can conclude that files can be glued together.
Used sources:
Analysis of MP4 structure
Analysis of AVI structure
Analysis of MOV structure
Analysis of WAV structure
O-checker: Detection of Malicious Documents through Deviation from File Format Specifications
GIF
format specifications PDF format specifications
Wikipedia article About JPEG
structure analysis