vpanferov May 23, 2011 at 11:18

How MRC technology reduces the size of PDFs

The PDF format has long taken root as a means of preserving documents that are then not supposed to be edited. All PDF files can be divided into two classes. The first is documents that have been digitally compiled, and then converted to PDF. The instruction for some device will most likely be just such a file. Inside, it looks like text and graphics plus formatting commands that describe how to arrange elements on a page.

The second class is documents obtained as a result of scanning paper images. You can skip them through ABBYY FineReader, and they will turn into the first type, or you can simply save it as a picture in PDF. And it often makes sense to use this when you want to keep the original form of the document. Despite the fact that ABBYY FineReader recognizes documents quite well, recognition errors occur, some important elements on the page are not found, in general, what happens is somewhat different from the original document.

Therefore, it often makes sense to save the image of the original image in PDF, and place recognized text under it so that you can find the document by keywords or use copy-paste. Only one point is embarrassing - such PDF files have a rather large size, from half a megabyte per page or more. Accordingly, if you scan an average-sized textbook on matanalysis, you get a megabyte file of 200.

This size is explained by the fact that inside the PDF, scanned, bitmap images are compressed with conventional picture codecs, JPEG, JPEG2000, LZW or ZIP. Accordingly, it doesn’t work out less than ordinary JPEG files occupy for such pages. To reduce the size, they usually resort to various tricks - they reduce the resolution, greatly underestimate the compression of the image, as a result of which the quality of the text in such PDFs suffers.

Or then you have to give up PDF and save everything in DjVu. The result is a rather small size, but the reality is that not all users of the resulting file can read it with ease - after all, Adobe Acrobat is installed on a much larger number of computers than the DjVu viewer.
And then PDF MRC technology (from “Mixed Raster Content”) comes to the rescue - Adobe's answer to the DjVu format. This is the same PDF, but borrowing many elements from DjVu, and can be read by all popular PDF readers. When using MRC, the page size is reduced by 4 times while maintaining the quality of the scanned image. This is due to the decision to divide the image into layers and compress each layer with the most suitable codec. The text is compressed by the JBIG2 codec, everything else is compressed using JPEG / JPEG2000 / ZIP with different quality.

How is the MRC inside PDF? Consider a simple example, and then gradually we will complicate it.
Let us have a scan of a white page with black text, for example, pages from a book (all pictures are clickable).

Scan, JPEG, 1.2 Mb

Useful information - only letters, everything else can be ignored. We find all the text on the page, for example, it is logical to launch FineReader and recognize the page. Then we select all the text found in a separate layer, and compress it using the JBIG2 codec. We get 50 kilobytes per page versus 400 for JPEG and 200 for the black and white fax codec CCITT4.

JBIG2 is specifically designed to compress text. At work, he combines externally similar images of letters in clusters. Examples of such clusters, for example, are all the letters 'a' printed in the same font of the same size. The slightly different letters 'a', for example, with distortions from scanning, or printed in a different font, will fall into other clusters. The result is a dictionary in which frequently occurring identical letters are combined. Then for each letter its place is remembered. It turns out very compact.

JBIG2, 50 Kb. PDF with additional information has a size of 80 Kb Now let's complicate the task. Let us have an uneven background that we don’t want to lose. Tiff, 500 Kb

For this we need two layers already. The first of these will still be text compressed by JBIG2. And in the second layer everything that remains of the original image after cutting out the letters and filling in the holes from them will fall. We can compress the second layer quite strongly using JPEG, since it usually does not have particularly valuable information.

The resulting PDF has a size of 35 KB versus 190, which we would have obtained by simply compressing the whole picture in JPEG.

Text, JBIG2, 18 Kb Background, 11 Kb, JPEG Final PDF MRC, 35 Kb Next complication. So far, we have only output black and white text. Let now we will meet the colored text. Tiff, 700 Kb

As before, we press the text with the black and white codec JBIG2, but under the color letters we put the so-called color mask - another layer that is visible in the "slots" made by the letters. This layer contains few colors and is perfectly packaged, for example, using ZIP.

Text, JBIG2, 11 Kb Color mask, ZIP, 3 Kb Text + color mask looks like this: Background, JPEG, 40 Kb It is important not to overdo it when compressing the background - it may contain text that was not recognized as text. And if we compress it too much, such a text will be difficult to read. Final PDF MRC, 60 Kb

So, there are already 3 layers: text, a color mask that paints the text, and background. It remains to deal with elements that are neither text nor background. For example, these are pictures or photographs. Nothing special can be done with them, and we just add them to the background, compressing JPEG or JPEG 2000 with high quality.

Tiff, 600 Kb Text, JBIG2, 25 Kb Color mask, ZIP, 5 Kb Background, JPEG, 40 Kb PDF MRC is ready. It contains several layers, each of which contains different pieces of the picture and is compressed by the most suitable codec. Final PDF MRC, 72 Kb

Of course, there are images that do not benefit in size from using MRC. For example, trying to compress a landscape photograph like that makes no sense, less than JPEG does. Or text printed on a background containing many small details.

From such a PDF image, MRC does not work.

However, for many documents that we meet in everyday life, MRC gives excellent results.

And finally, a few examples of PDF MRC, which can be obtained using ABBYY FineReader, ABBYY FineReader Engine or ABBYY Recognition Server:

PDF, JPEG	PDF, MRC
524 Kb	218 Kb


618 Kb	175 kb


412 Kb	113 Kb

In total, we get compression 2-6 times with the same quality, and this is not the limit. PDF MRC is still a very young technology, and it continues to evolve. There will be improvements in the direction of improving the quality, and in the direction of reducing the size.

All PDF examples in this article are obtained using ABBYY FineReader Engine 10, the default setting.

Vasily Panfyorov,
Department of Products for Developers

Tags:

How MRC technology reduces the size of PDFs

Also popular now: