Rembish December 1, 2010 at 00:14

Text at all costs: Miette

Yes, you were not mistaken, and it is not deja vu. You probably once (if a regular) saw this topic . A lot of time has passed since then, and letters continue to go to me with questions and requests for advice on reading textual information from binary data formats. And this means that the topic is still relevant, interesting to the programming community.

For this year (and indeed more than a year has passed) I changed my place of work and do completely different things and for a long time I have not programmed (I do not program much, to be precise) in PHP. The new project obliged me to improve in python (and feel its power), so one Sunday evening it was decided to rewrite and, most importantly, improve some of my libraries for reading text. Today I will present to the public a young opensource projectMiette (“ yummy ”, if translated from French), which is designed (in no future) to read Microsoft Office package files.

The main objective of Mett is primarily reading pure text from Office formats, but this time I would like to go further and create ~~not~~ possible: make the parser to read format (at least minimum). The task is difficult, but quite feasible, if there is time in the evenings and interest (and possibly feasible help in the form of testing and joint development) from the suffering population. But these are just plans and, so to speak, hobbies.

Naturally, python is very different from PHP in many respects and, in my opinion, has slightly more functionality, therefore the principle of building libraries in the project is somewhat different than the old "craft" in PHP. In this case, it was decided to forbid itself, as a developer and a customer, to load any large blocks into memory in one person. Miett reads the data gradually, on demand, as Word itself does. This makes it lightweight and undemanding to RAM. In the future, I will try to go through the initial profiler and find narrow necks that should be optimized further.

Move on?

I advise you to review the old article and the source code for cfb and doc in PHP before reading further.

Project structure

The project consists (and subsequently will consist) of directories, each of which contains a reader of one or another type of file. Now there is a reader on the Compound File Binary File Format, which is a wrapper over the data of most office files, and for DOC (Microsoft Word). Further support for XLS and PPT will be added.

CFB contains two main objects - Reader and DirectoryEntry, on which the rest of the "readers" are built. The first provides an interface for working with the "directory entries" that make up CFB storage. Using the Reader class, you can access the required entry both by name and number. For the root entry (“Root Entry”), an attribute forwarding was made, which, as can be replaced in the DirectoryEntry class, largely organizes and standardizes the work with mini FAT.

DirectoryEntry implements a minimal file interface: read ([size]), seak (offset, [whence]) and tell (). This again simplifies the work with "entries" and, in general, in the spirit of python. You can still read the whole entry using read () without a parameter, but when reading a few bytes you will get a very advantageous solution that no extra bits will read. In addition, you can refer to the left / right sibling and child "entries" through the corresponding attributes - this makes walking the CFB tree convenient and unobtrusive.

On the example of DocTextReader you can see an example of working with CFB. As you can see, in contrast to the PHP implementation, we try to read a smaller amount of data into RAM, constantly moving through the doc file. Additional DirectoryEntry get_byte, get_short and get_long methods come to our aid, which read the corresponding number of bytes from a certain place. The main "occurrences" of 0 / 1Table and WordDocument are forwarded as class attributes.

This implementation has a test character, in the future DocTextReader will have standardized methods for reading a given number of bytes from a selected position, and possibly some other functions of the file class.

Usage example

And finally, an example of using the library. PS I hope you and Miette do not disappoint. Stay tuned on Github :)

from doc.text import DocTextReader


doc = DocTextReader('parus.doc')

root_entry = doc.root_entry

word_document = doc.get_entry_by_name('WordDocument')

one_table = root_entry.child.left_sibling.left_sibling


fc_clx = self.word_document.get_long(0x01a2)


one_table.seek(fc_clx)
print one_table.read(1)
print one_table.tell()# fc_clx + 1

print doc.read()

Tags:

Text at all costs: Miette

Project structure

Usage example

Also popular now: