What hides PDF

Original author: Nick Winder
  • Transfer


PDF files have a lot of information. Most are used for the same visualization of the document on different platforms. But there is also a lot of metadata: the date and time of creation and editing, which application was used, the topic of the document, title, author and much more. This is a standard set of metadata, and there are ways to insert custom metadata into a PDF: hidden comments in the middle of the file. In this article, we will present some forms of metadata and show where to look for them.

Information Metadata


Starting with PDF 1.0, there is a standardized set of values ​​that can be further added to the document. File managers use these values ​​to improve document searches. They include:

  • Author
  • date of creation
  • Creator
  • Producer

In PDF 1.1, this set was expanded to include additional data that helps find documents:

  • Title
  • Topic
  • Keywords
  • Editing Date (ModDate)

Strictly speaking, this information is not actually hidden, as many applications allow you to view it. But it is not shown to the general public. In any case, if you are worried about security, you should carefully rely on this information because it can be edited later. Since metadata may be updated separately from the displayed content, this means that the file manager and metadata will show the changes, and the contents may not change.



Additional metadata


The PDF standard now supports even more metadata. Instead of a small set of default values, you can store a whole stream of information in XMP format . As a result, any type of data can be embedded there. Again, they are not displayed, but they can be analyzed by the file manager.

The XMP stream can be encoded, so it is not always read by people, but many applications can read and edit this information. Here's an example of what XMP looks like in a human readable format:

1851-08-18Ink and PaperNick WinderMy Amazing PDF

It’s easy to understand that this information is invaluable when trying to determine the history of a document or trying to embed other information. PSPDFKit for iOS and Android supports reading and editing metadata.

Object Metadata


Metadata streams are not limited to documents; metadata can also be assigned to any object in a document. For example, a stream with an embedded image. To complicate matters, auxiliary metadata can also be stored in the stream itself. If we go even further, we can embed PDF in the metadata of the image stream , thereby achieving infinite recursion! So the next time you check metadata for information, remember that you may have to go through several levels to find the information you are looking for.

Additional save / update


The PDF standard has an additional save concept that many applications, including PSPDFKit, implement to speed up saving. In short, this method adds additional information to the end of the document, and old objects that are no longer referenced will remain hanging there. This is great when you change document elements on the fly and don’t want to wait for a long save process, or, for example, for the automatic save function, where the process runs in the background thread, and we want to use a minimum of resources.

As you can understand, this opens a whole Pandora's box: the history of the document shows confidential or erroneous information that was deleted from the eyes, but it remained in the document. In such situations, it is recommended that you save the document completely. This will result in the removal of old objects or even “smoothing”, so that the forms cannot be edited in the future.

PDF Comments


Many programming languages ​​provide comments so that the compiler or interpreter ignores the string, the same option is in the PDF. The% symbol is used in the format in different ways, but one of them is an indication of a comment in the code. Therefore, if the user opens the document in a text editor, he may see some secret messages inserted by your PDF processor. PDF renderers will ignore these comment lines, so the file looks correct and does not show any comments after rendering.

One big dictionary!


The last thing to note is that the PDF format is actually one big dictionary! Technically, anyone can embed a document and change something. Not every change is as easy as editing a single line, but it can be done. For this reason, you should always remember what information may be hidden in the PDF. In addition, if you process confidential information, you should definitely use digital signatures to ensure that the document has not been altered by someone other than its author, and that the author is the one you expect and not someone else.

Conclusion


This article lists some ways that metadata can enter a document without your knowledge. There are other factors to consider, such as JavaScript support for PDF . With JavaScript, the options are generally endless. Hidden objects can also be stored in documents, which are usually analyzed but not displayed. This is a good way to inject some type of information into the parser. PDF is a very extensive standard, so you should always know what kind of PDF reader software you use and trust it.

Also popular now: