Dissecting .XLSX Files: Introduction, Cell Styles

Not so long ago, as a matter of duty, it was necessary to edit MS Office files (primarily MS Word and MS Excel) using PL / SQL, that is, a language with .NET that is practically unconnected. In this regard, a problem arose that in the Microsoft manual on editing these files, as they say, “hands” did not say anything, and the only sane reference site on this case, apparently, was not updated for the year from 2010. Understanding the need to reduce putting together everything that I bit by bit from Stackoverflow and my own experiments came almost immediately.

Just want to make a few comments.

The first one . Basically I will talk about what I personally encountered. I have no claims to one hundred percent knowledge of the “wrong side” of the format.

Second. As many probably know, MS Office 2007 and higher files are an archive that can be opened using any archiver (WinRAR, 7zip, and so on).

The third . The “under the hood” of these files is mostly XML markup, proudly referred to as OOXML or simply OpenXML. Therefore, in principle, Notepad (or, more conveniently, Notepad ++) will be enough to understand the principles of editing files “by hand”.

So, let's start with the MS Excel format as the most used for generating all kinds of reports, uploads from the database and others like them.

When dealing with the .XLSX format, each question “why is it implemented ... so?” Should be answered: “optimization required it”. When Microsoft expanded the permissible sheet size to obscene from the point of view of the observer (about a million rows and 16 thousand with a ponytail of columns), they realized that there would be maniacs who would fill all this with data to the eyeballs. And through this, if you approach data storage “head-on”, then even the ever-increasing PC capacities will not be enough under any circumstances. Therefore, it is necessary to make the volume of data stored less than the amount of data in the book. How to do it? Looking ahead: for example, to exclude duplicate rows.

The first file we are interested in is% file% / xl / workbook.xml. Its main purpose is the manifest, that is, a list of sheets of which our Excel workbook, in fact, consists.
This list looks, for example, like this:


This means that there are 4 sheets in the book, and their names are indicated in the name attributes. Each tag must have a file in the% file% / xl / worksheets folder. Excel itself knows what these files should be called, and when you try to rename them, it will find the whole book damaged.

Even in the% file% / xl folder, we are interested in the styles.xml file.

As you might guess, information about cell design is stored here. Moreover, for the sake of optimization, it is stored in a rather interesting form. The file consists of the following sections:

1. Fonts:


As you can understand, only the unique text styles used in the book are listed here. Each tag is one style. Nested tags - style features such as bold writing (tag ), size () other.

2. Filling cells:


As you can see, the first option is no fill at all, and the second is a solid fill of the library color “gray125”.

3. Borders:


As you can see, one name here consists of five elements, 4 main borders and a diagonal, that is, everything that can be configured through the GUI of Excel itself.

4. Cell styles:


And here it is necessary to explain in more detail. When we specify the cell style in the sheet file, we will refer to this section. Each tag, which is one style, is a collection of links to previous sections, that is, it itself does not contain font declarations, borders and fills. Let's take a closer look at some interesting attributes:

  1. numFmtId - indicates the format of the text in the cell (date, currency, number, text, ...). A complete list of these types is here.
  2. fontId, fillId, borderId - link to the font / fill / border (see section 1, 2 and 3, respectively). Numbering starts at 0.
  3. applyFont, applyFill, applyBorder - an indication that a custom font / fill / border is generally used in the design of this cell. The default value is 0, so you can omit it at all, as seen in the example in element # 0.
  4. applyAlignment - an indication that the alignment of the text in the cell will differ from the standard. The default is 0, but if "1" is specified, then in the parent tag
    embeds tag as seen in the example starting at # 1.

Next, the conversation will go about text data and the layout of the sheets themselves.

Also popular now: