PavelMSTU July 28, 2016 at 09:13

Information hiding in PDF documents

There are many ways to informationally hide some data inside other data. The most common thing that is usually remembered is steganography in images, audio and video information.

However, containers are not limited to this. Together with ~~two slovenly~~ very talented students (namely, lancerx and PavelBatusov ), we decided to develop a simple just4fun-projection of information hiding in electronic documents.

Link to what happened (do not judge strictly): pdf.stego.su
(PDF examples can be found here )

The satisfied user interface is shown in the kawaii picture:

What is all this about?

Once, while drinking a cup of coffee, talking about steganography, we asked ourselves: “ Is it possible to intersperse some additional third-party information in electronic text documents so that the documents themselves do not visually change? ” So our little "steganographic circle" appeared.

It turns out you can.

Here is a far from complete list.

OpenDocument Format (ODT) - aka ISO / IEC 26300-1: 2015 , by the way, it is not a little less than a state standard (sic!) GOST R ISO / IEC 26300-2010 . Speaking on fingers, the protocol is a zip archive from xml'ek. Anyone who does not believe can install LibreOffice , create an arbitrary document “example.odt”, rename it to “example.zip” and make sure that it is so. The space for creativity interspersed with extraneous information is mass.
Office Open XML ( OOXL , aka DOCX , aka ISO / IEC IS 29500: 2008 ) is Microsoft's answer to Chamberlain. From the point of view of information hiding, the same eggs, only in profile. DOCX is also a zip archive with xml'kami, only organized differently.
DjVu (from French "deja vu") is a very interesting protocol for hiding. DjVu uses the JB2 algorithm , which searches for duplicate characters and saves their image only once. Accordingly, there are a number of ideas:
- Select the set of all similar characters and select one using hash steganography .
- Select two characters instead of one. The first character is considered transmitting 0, and the second character is considered transmitting 1. With the help of "alternation" you can transmit hidden information.
- Hide data inside the picture itself, denoting the symbol in DjVU, using LSB .
FictionBook (fb2) is xml. However, it may contain a binary tag, inside which is a picture. Further hiding in the picture itself. You can also try to insert spaces and other characters outside the tags or inside the tags themselves.

You can continue for a long time, because Mankind has thought up a lot of formats for storing text information.
For our hiding experiments, we chose PDF , because It has the following “advantages”:

this format is not editable - therefore, there is no problem when rewriting (Well ... actually we are also editing, but PDFs are often used as an “uneditable” format)
this format is simple enough - more on that below
this format is quite popular

How it works?

We called the patellar pribluda SHP , which stands for Simple Hide to Pdf . Simple - because simple; Hide - because it hides; and “ to PDF ” - because it works only with PDF documents.

For educational program, a couple of paragraphs about the ISO 32000: 2008 protocol , which is PDF.
The document consists of objects (the so-called obj ) at the end of the document there is an xref- table that lists all the necessary objects. Each object has a number and a revision ... Yes, that's right, pdf supports various revisions! Actually PDF is a mini version control system! ;)) It's just that something has not taken root in life ... A

PDF document is formed by objects of different types:

boolean variables
numbers (integer and fractional)
strings
arrays
dictionaries
streams
comments

Roughly speaking, the PDF structure is as follows:

heading
objects ( obj data)
xref table
trailer (contains information about the objects from which you want to start reading files)

Having studied the PDF standard a little offhand, you can offer the following methods of hiding.

Alternate each object in a certain way, thereby changing the structure of the document. The guys and I called this “structural steganography,” since you change the structure of the document without changing the content. If you have n objects, then you may end up with n! different orderings, therefore you can transfer no more than log ₂ (n!) bits of data. The idea is interesting, but we put it off until better times.
You can play around with the versions of the files themselves. In the old (unused) versions, add hidden information. However, we looked at 1000 different pdfs and in all there was not a single file with a revision greater than 0 ...
You can find various ways of entering data provided by the protocol that are not displayed to the user.

The easiest way to point 3 are ... comments. I don’t know for whom it was left; maybe this is the legacy of PostScript , which is “legally” a programming language (like LaTeX) and, accordingly, comment lines are provided in its syntax, as in any language. From the point of view of "refined" steganography - this, of course, is not security. However, the alleged adversary needs to know about the fact of concealment ...

Nevertheless, there are cases when during concealment it does not make sense to hide the fact of the presence of the message. This will be information hiding, but not steganography.

Data interspersion:

the user submits the PDF document, a message for hiding and a certain password to the SHP system input;
SHP password calculates the stego key and crypto key. Information in the message is compressed and encrypted using a crypto key;
with the help of a stego key, information is embedded in a pdf document;
at the exit from the SHP system, the user receives a pdf document with interspersed data.

Data Extraction:

user submits pdf and password to the system input;
the system likewise calculates a stegokey and a cryptokey using a password;
The system retrieves stegokey data;
decrypts data using a crypto key and decompresses it;
gives a message to the user.

That, in fact, is all.

If the user enters an incorrect password, SHP will incorrectly calculate the stego key and crypto key. Therefore, the user can be sure that without knowing the password no one else will receive the information from pdf.

For those who did not notice at the beginning of the long read , I give once again a link to our knee- deep web platform: pdf.stego.su
~~(As you can see, instead of the standard black color in Django, we chose toads in love . Yes, we are just design geniuses!)~~

What is it for?

At first it was just just4fun for me and the acquisition of skills and experience for my paddavan students. However, later we got a number of ideas. That is why we publish this post because we want to know the opinion of the professional IT community, especially security.

Perhaps everything we write is nonsense. In this case, if the reader has not quit reading this post yet, then we ask him to spend another 5-10 minutes on criticism in the comments.

In one of my past posts, I talked about 15 practical goals of steganography (and information hiding) .
In fact, steganography in documents (and in particular in PDF documents) to one degree or another may be applicable to all tasks.

However, the most interesting are only 4.5 tasks.

0.5 Silent transmission of information & Covert storage of information.

As already written - not security! However, it definitely works against cyberblondes. For more serious steganography, you need to come up with a good algorithm for steganographic dissemination as such. Therefore, we count this task as 0.5, and not 1.

Moreover, the use of electronic documents cannot be considered robust steganography because when converting (for example: pdf -> odt) information is lost.

The only thing where the idea of invisible transmission can be claimed is in closed protocols. A kind of "security through obscurity" , only in steganography.

1.5 Protection of exclusive rights

The sale of electronic magazines is gaining more and more; various analytics and other paid subscriptions. The question arises: is it possible to somehow protect the sold content? So that characters publishing on the network were not up to

date ? .. You can try to intersperse information about the recipient in the distributed document. For example: e-mail and payment card number, IP, login when registering in the online store, mobile phone, etc. For security and law enforcement, you can intersperse this in the form of hashes (+ salt) or simply intersperse some number (ID'shnik in the system),
so this number will tell something only to the owner of the system.

If you publish a protected document, you can determine who has leaked this information.

Of course, a number of questions arise.

Can I retrieve a tag?
Is it possible to fake a label and "substitute" another user?

If you use SHP, then this task should also be counted as 0.5, and not as 1.0 ...

However, you can try to find better and more reliable data hiding algorithms.
For example, the use of several concealment algorithms "not interfering with each other" allows you to build a single steganographic design, so to speak, "multifactorial steganography" (also a gag term).

The essence of "multifactorial steganography" is as follows: if at least one tag is saved, ~~we can take the character by the balls,~~ we can determine who published the paid content. In Japan, this is true.

2.5 Protecting the authenticity of a document.

The idea is very simple. We sign the document, certifying our authorship. The difference from the huge zoo of similar systems is that our signature is inalienable from the file itself.

However, there is a full-time mechanism that does the same! (at least within the framework of the PDF protocol)
Therefore, we were late> __ <
But can similar reasoning be applied to other formats?

3.5 Decentralized SEDO.

In principle, the “inalienability” of hidden data can be used
for decentralized electronic document management systems (EDMS).

But is it necessary?
It is clear that this is very convenient; peer-2-peer and in general - fashionable!
The main cymes is the inalienability of the document.
In modern EDMS, a signed document is signed only if it is inside the EDMS.
If you extract it and mail it to a third-party organization that does not have a solution to your EDS, then you just transfer the file.

The modern EDS market is reminiscent of messengers. If you are on Skype, and Vasya on Telegram, then you either need to install Telegram, or Vasya Skype ... But imagine a protocol for disseminating information (or a set of interspersing protocols for each protocol of electronic documents).

One for all! General!

If this protocol of interspersing and extracting signatures would be uniform, as well as the same SMTP and IMAP for all mailers, it would be much more convenient.

Although I am not a specialist in EDMS. If there are specialists here, then please take a moment and write in the comments what you think about this.

Is this idea in demand?

4.5 Watermark in DLP systems.

Imagine that you have a mode or "semi-mode" object (yes, there are some). You have information that you would not want to let go outside, for example, the internal documentation of a product. You intersperse a specific label (or a label from a specific set). If the document goes "outside" the system, then the DLP ( Data Leak Prevention ) checks for the presence of the label. If there is no label, the document passes; if there is, the system raises an alert.

Of course, this is not a panacea. But if the benefits of information hiding are much greater than the price of developing this system, then why not introduce as an optional (that is, additional ) measure?

In addition, from one type of "leak" it will definitely help - from unintentional. There are cases when such documents are inadvertently sent that it would be better not to send (I hope this sad property is inherent only in “semi-mode” and not in regime objects ...)

To summarize

.

We made sure that hiding data in documents is a very real thing.
~~... Well, we learned a lot of new things , because we digged a lot where ...~~

Of course, there are a number of questions.
Is it possible to make this cover-up steganographically stable? What happens if the user translates everything from pdf, say, to jpeg? .. Will the hidden information be deleted? How critical is this? Will multifactor steganography solve this problem ?

Is a statistical approach applicable to system quality analysis? That is, if the system protects in 90% of cases, but does not protect in 10%, is it reasonable (as in cryptography) to say that the system does not protect at all? Or maybe there are business cases when 90% will be enough to get certain benefits? ..

Your point of view, reader, is categorically welcome in the comments - for the sake of this, this one is long and was written.

Once again, the link to the portal: pdf.stego.su
(+ PDF examples for experiments, who are too lazy to look for )
(we apologize in advance for a possible habraeffect)

Only registered users can participate in the survey. Please come in.

Do you think the idea of information hiding in electronic documents will be in demand?

47% Yes, if you bring the idea to mind! But, of course, there are many pitfalls. 48
34.3% No, this is just4fun for crypto blondes too. 35
18.6% I don’t know ... But there is something in it! 19

Tags: