
Text at all costs: PPT
Some time ago we discussed getting clear text from various data formats: be it PDF or DOC. In one of the discussions , it was suggested that by parsing PowerPoint presentations, I will earn hemorrhoids or another terrible soft spot disease. Well, by the will of fate, I had to get the text out of this “sweet” format. Frankly, I could not earn hemorrhoids , but the class for parsing presentations came out.
Like DOC, PPT is an add-on to the WCBFF format (structured binary file format), which you can read about in an article about MS Word that I wrote earlier. I only note that during the testing I found several errors (and besides, I saw a bunch of “broken” files from which I had to get the text) in the old WCBFF and DOC implementation , so I advise those who use my developments to update their sources.
So, we are distracted. Let's continue the talk about PowerPoint. Unlike DOC, we will not work with “files”
So, to start receiving data from PPT, you should read a small “file” record (or “file” consisting of one record
Now I’ll tell you how to read PPT records. Any entry in the presentation contains a special title.
What next? Back to the record
After receiving all the offsets,
Now back to the last read
We are interested in all three types of records that are stored in the main unit:
When I first learned about this fact, to be honest, I was upset. 600 more with the pages of documentation. But ODRAW, as it turned out, is built on the same
You can get the code with comments on GitHub . It’s still a little damp, but I think that in the near future I will find all the pitfalls. The main errors pop up precisely because of the not quite correct (?) Reading
Not a lot about PPT format
Like DOC, PPT is an add-on to the WCBFF format (structured binary file format), which you can read about in an article about MS Word that I wrote earlier. I only note that during the testing I found several errors (and besides, I saw a bunch of “broken” files from which I had to get the text) in the old WCBFF and DOC implementation , so I advise those who use my developments to update their sources.
So, we are distracted. Let's continue the talk about PowerPoint. Unlike DOC, we will not work with “files”
WordDocument
and 1Table
, as before, but with presentation-specific ones: Current User
andPowerPoint Document
. The presence of both “files” in the “file system” of a CBF file is required for presentations. From them, you can determine that we have a presentation in case of an erroneous extension. So, to start receiving data from PPT, you should read a small “file” record (or “file” consisting of one record
CurrentUserAtom
) Current User
. This post contains technical information about who last edited this file, but this is not the most important. In this block there is information about the offset to the first record UserEditAtom
, which will be discussed below. Now I’ll tell you how to read PPT records. Any entry in the presentation contains a special title.
rh
which contains technical information about it. To do this, just read the first 8 bytes of any record. The first word usually does not contain the necessary information, but we will need the next 6 bytes. WORD at offset 2 ( rh.recType
) identifies the type of record by which you can find out what to do with the record next. Long at offset 4 ( recLen
) - the length of the record without taking into account the header of eight bytes. This recording method is quite convenient and avoids many errors when parsing the presentation file. What next? Back to the record
UserEditAtom
. This entry is already in PowerPoint Document
. Later we will work only with this “file”. By reading this and related entries, we must build such a marvelous thing as an array of offsetsPersistDirectory
with which we will look for the main structure of a PowerPoint document - DocumentContainer
. To do this, we must read the current record UserEditAtom
, find in it the offset offsetPersistDirectory
to the current "live" version PersistDirectory
and the offset offsetLastEdit
to the next record UserEditAtom
. So we continue to receive offsets until we come across zeros in the DWORD offsetLastEdit
. After receiving all the offsets,
offsetPersistDirectory
we must create this one PersistDirectory
. We follow the displacements in the reverse order and read the records PersistDirectoryAtom
. They contain an array of records PersistDirectoryEntry
. Each of which contains the number of the persistId
first entry and their number cPersist
in the current record. After this information is an array of offsets to objectsPersistDirectory
. This is the most important array by which we will find links to all objects of the presentation. Now back to the last read
UserEditAtom
and find the field there docPersistIdRef
. This is the number of the most important object DocumentContainer
in PersistDirectory
. Read it. It stores a carriage and a small cart of information about the current presentation: headers and footers, notes for slides, and most importantly - a record SlideListWithTextContainer
containing all sorts of different things about slides. We are interested in all three types of records that are stored in the main unit:
TextCharsAtom
, TextBytesAtom
and SlidePersistAtom
. With the first two, everything is simple: this is unicode text on a slide and regular ANSI, respectively. Another thing is when instead of text we get a link to a slide SlidePersistAtom
. On it we should read the object Drawing
, which ( sic!) is not a PPT object. Yes, in this case, the MS Drawing object is embedded inside the slide, with a rather unpleasant structure of nested records. When I first learned about this fact, to be honest, I was upset. 600 more with the pages of documentation. But ODRAW, as it turned out, is built on the same
rh
headings with the same recType
's as PPT. This made it possible to ease the task and Drawing
cheat a little by searching in the -object all the same atoms TextCharsAtom
and TextBytesAtom
their recType
's.Implementation
You can get the code with comments on GitHub . It’s still a little damp, but I think that in the near future I will find all the pitfalls. The main errors pop up precisely because of the not quite correct (?) Reading
PersistDirectory
. If anyone has clarifications, then I will listen to them with pleasure.