Rembish November 22, 2009 at 7:11 pm

Text at all costs: PPT

Some time ago we discussed getting clear text from various data formats: be it PDF or DOC. In one of the discussions , it was suggested that by parsing PowerPoint presentations, I will earn hemorrhoids or another terrible soft spot disease. Well, by the will of fate, I had to get the text out of this “sweet” format. Frankly, I could not earn hemorrhoids , but the class for parsing presentations came out.

Not a lot about PPT format

Like DOC, PPT is an add-on to the WCBFF format (structured binary file format), which you can read about in an article about MS Word that I wrote earlier. I only note that during the testing I found several errors (and besides, I saw a bunch of “broken” files from which I had to get the text) in the old WCBFF and DOC implementation , so I advise those who use my developments to update their sources.

So, we are distracted. Let's continue the talk about PowerPoint. Unlike DOC, we will not work with “files” WordDocumentand 1Table, as before, but with presentation-specific ones: Current UserandPowerPoint Document. The presence of both “files” in the “file system” of a CBF file is required for presentations. From them, you can determine that we have a presentation in case of an erroneous extension.

So, to start receiving data from PPT, you should read a small “file” record (or “file” consisting of one record CurrentUserAtom) Current User. This post contains technical information about who last edited this file, but this is not the most important. In this block there is information about the offset to the first record UserEditAtom, which will be discussed below.

Now I’ll tell you how to read PPT records. Any entry in the presentation contains a special title.rhwhich contains technical information about it. To do this, just read the first 8 bytes of any record. The first word usually does not contain the necessary information, but we will need the next 6 bytes. WORD at offset 2 ( rh.recType) identifies the type of record by which you can find out what to do with the record next. Long at offset 4 ( recLen) - the length of the record without taking into account the header of eight bytes. This recording method is quite convenient and avoids many errors when parsing the presentation file.

What next? Back to the record UserEditAtom. This entry is already in PowerPoint Document. Later we will work only with this “file”. By reading this and related entries, we must build such a marvelous thing as an array of offsetsPersistDirectorywith which we will look for the main structure of a PowerPoint document - DocumentContainer. To do this, we must read the current record UserEditAtom, find in it the offset offsetPersistDirectoryto the current "live" version PersistDirectoryand the offset offsetLastEditto the next record UserEditAtom. So we continue to receive offsets until we come across zeros in the DWORD offsetLastEdit.

After receiving all the offsets, offsetPersistDirectorywe must create this one PersistDirectory. We follow the displacements in the reverse order and read the records PersistDirectoryAtom. They contain an array of records PersistDirectoryEntry. Each of which contains the number of the persistIdfirst entry and their number cPersistin the current record. After this information is an array of offsets to objectsPersistDirectory. This is the most important array by which we will find links to all objects of the presentation.

Now back to the last read UserEditAtomand find the field there docPersistIdRef. This is the number of the most important object DocumentContainerin PersistDirectory. Read it. It stores a carriage and a small cart of information about the current presentation: headers and footers, notes for slides, and most importantly - a record SlideListWithTextContainercontaining all sorts of different things about slides.

We are interested in all three types of records that are stored in the main unit: TextCharsAtom, TextBytesAtomand SlidePersistAtom. With the first two, everything is simple: this is unicode text on a slide and regular ANSI, respectively. Another thing is when instead of text we get a link to a slide SlidePersistAtom. On it we should read the object Drawing, which ( sic!) is not a PPT object. Yes, in this case, the MS Drawing object is embedded inside the slide, with a rather unpleasant structure of nested records.

When I first learned about this fact, to be honest, I was upset. 600 more with the pages of documentation. But ODRAW, as it turned out, is built on the same rhheadings with the same recType's as PPT. This made it possible to ease the task and Drawingcheat a little by searching in the -object all the same atoms TextCharsAtomand TextBytesAtomtheir recType's.

Implementation

You can get the code with comments on GitHub . It’s still a little damp, but I think that in the near future I will find all the pitfalls. The main errors pop up precisely because of the not quite correct (?) Reading PersistDirectory. If anyone has clarifications, then I will listen to them with pleasure.

Literature

Text at all costs

Tags:

Text at all costs: PPT

Not a lot about PPT format

Implementation

Literature

Text at all costs

Also popular now: