Uploading data to Excel. Civilized

  • Tutorial
There are tasks in the IT industry that, against the background of success in big data , machine learning , blockchain and other fashion trends, look completely unattractive, but for decades they have not ceased to be relevant for an entire army of developers. It will be about the old as the world task of forming and uploading Excel-documents, faced by everyone who has ever written business applications.

What possibilities of building Excel files exist in principle?

  1. VBA macros. Nowadays, for security reasons, the idea of ​​using macros is often not appropriate.
  2. Excel Excel external program through the API. Requires Excel on the same machine as the program that generates Excel reports. At a time when customers were thick and were written in the form of Windows desktop applications, this method was suitable (although it did not differ in speed and reliability), in today's realities this is a difficult case to be achieved.
  3. Generate an XML Excel file directly. As you know, Excel supports the XML format for saving a document, which can potentially be generated / modified using any XML tool. This file can be saved with the .xls extension, and although, strictly speaking, it is not an xls file, Excel opens it well. This approach is quite popular, but the disadvantages include the fact that any solution based on direct editing of the XML-Excel format is a one-time hack, lacking in common.
  4. Finally, it is possible to generate Excel files using open source libraries, of which Apache POI is especially known . Apache POI developers have done a titanic work on reverse engineering of binary formats of MS Office documents, and continue to maintain and develop this library for many years. The result of this reverse engineering, for example, is used in Open Office to implement saving documents in formats compatible with MS Office.

In my opinion, it is the latter method that is now preferable for generating MS Office-compatible documents. On the one hand, it does not require installation of any proprietary software on the server, but on the other hand, it provides a rich API that allows you to use all the functionality of MS Office.

But using Apache POI directly has its drawbacks. First, it is a Java library, and if your application is not written in one of the JVM languages, you can hardly use it. Secondly, it is a low-level library that works with such concepts as “cell”, “column”, “font”. Therefore, the “head-on” written procedure for generating a document quickly turns into an abundant “noodle” of hard-to-read code, where there is no separation between the data model and presentation, it is difficult to make changes, and in general - pain and shame. And an excellent reason to delegate the task to the most inexperienced programmer - let him pick.

But it could be completely different. Project Xylophoneunder the LGPL license, built on the basis of Apache POI, is based on an idea that has approximately 15 years of history. In the projects where I participated, it was used in combination with a variety of platforms and languages ​​- and probably due to the varieties of forms made with it in a wide variety of projects, there are already thousands going on. This is a Java project that can work both as a command line utility and as a library (if you have JVM code, you can connect it as a Maven dependency).

Xylophone implements the principle of separating the data model from their presentation. In the upload procedure, you need to generate data in XML format (without worrying about cells, fonts and dividing lines), and Xylophone, using an Excel template and a descriptor describing how to work around your XML file with data, will generate the result, as shown in the diagram:

The document template (xls / xlsx template) looks something like this:

As a rule, the procurement of such a template provides the customer. The involved customer is happy to take part in the creation of a template: starting with choosing the right shape from the “Consultant” or creating your own from scratch, and ending with font sizes and widths of dividing lines. The advantage of the template is that minor changes to it are easy to make when the report is fully developed.

When the “design” work is done, the developer remains

  1. Create a procedure for uploading the necessary data in XML format.
  2. Create a descriptor that describes how to crawl the elements of the XML file and copy fragments of the template into the resulting report
  3. Bind template cells to elements of an XML file using XPath expressions.

Everything is more or less clear with uploading to XML: it is enough to choose an adequate XML representation of the data necessary to fill out the form. What is a handle?

If the form we create did not have duplicate elements with different quantities (such as invoice lines, of which there are different numbers for different invoices), then the descriptor would look like this:


Here root is the name of the root element of our XML file with data, and the range A1: Z100 is the rectangular range of cells from the template that will be copied to the result. At the same time, as can be seen from the previous illustration, the wildcard fields whose values ​​are replaced with data from the XML file have the format ~{XPath-выражение}(tilde, brace, XPath expression relative to the current XML element, the closing brace).

What if we need duplicate elements in the report? Naturally, they can be represented as elements of an XML file with data, and a descriptor helps to iterate over them as needed. The repetition of elements in a report can have both a vertical direction (when we insert invoice lines, for example), and a horizontal one (when we insert columns of an analytical report). At the same time, we can use nesting of XML elements to reflect as deeply nested repetitive elements of the report as shown in the diagram:

The red squares mark the cells, which will be the upper left corner of the next rectangular fragment that docks the report generator.

There is also one more possible variant of repeating elements: sheets in the Excel book. The ability to organize such an iteration is also available.

Consider a slightly more complex example. Suppose we need to get a summary report like the following:

Let the user choose a range of years for uploading, therefore in this report both rows and columns are dynamically created. The XML data representation for such a report might look like this:

<?xml version="1.0" encoding="UTF-8"?><report><columnyear="2016"/><columnyear="2017"/><columnyear="2018"/><itemname="Товар 1"><yearamount="365"/><yearamount="286"/><yearamount="207"/></item><itemname="Товар 2"><yearamount="95"/><yearamount="606"/><yearamount="840"/></item><itemname="Товар 3"><yearamount="710"/><yearamount="437"/><yearamount="100"/></item><totals><yearamount="1170"/><yearamount="1329"/><yearamount="1147"/></totals></report>

We are free to choose the names of tags to your liking, the structure can also be arbitrary, but with an eye to the ease of conversion into the report. For example, I usually write the values ​​displayed on the sheet into attributes, because this simplifies XPath expressions (conveniently, when they look like @имяатрибута).

The template for such a report will look like this (compare XPath expressions with the attribute names of the corresponding tags):

Now comes the most interesting part: creating a descriptor. Since this is a report that is almost completely dynamically assembled, the descriptor is rather complicated, in practice (when we only have the header of the document, its lines and the footer) everything is usually much simpler. Here is the handle in this case:

<?xml version="1.0" encoding="UTF-8"?><elementname="report"><!-- Создаём лист --><outputworksheet="Отчет"sourcesheet="Лист1"/><!-- И за ним слева направо заголовки столбцов --><iterationmode="horizontal"><elementname="(before)"><!-- Выводим пустую ячейку в ЛВУ сводной таблицы --><outputrange="A1"/></element><elementname="column"><outputrange="B1"/></element></iteration><!-- Выводим строки: итерация с режимом вывода умолчанию, сверху вниз --><iterationmode="vertical"><elementname="item"><!-- И по строке - слева направо --><iterationmode="horizontal"><elementname="(before)"><!-- Заголовок строки --><outputrange="A2"/></element><!-- И за ним слева направо строку с данными --><elementname="year"><outputrange="B2"/></element></iteration></element></iteration><iteration><elementname="totals"><iterationmode="horizontal"><elementname="(before)"><!-- Заголовок строки --><outputrange="A3"/></element><!-- И за ним слева направо строку с данными --><elementname="year"><outputrange="B3"/></element></iteration></element></iteration></element>

Fully descriptor elements are described in the documentation . In short, the main descriptor elements mean the following:

  • element — Switch to the read mode for the element in the XML file. It can either be the root element of a descriptor, or be inside iteration. With the help of the attribute namevarious filters for elements can be set, for example
    • name="foo" - elements with tag name foo
    • name="*" - all items
    • name="tagname[@attribute='value']" - elements with a specific name and attribute value
    • name="(before)", name="(after)"Are “virtual” elements preceding the iteration and closing the iteration.
  • iteration - transition to the iteration mode. Can only be inside element. Various parameters can be set, for example
    • mode="horizontal" - horizontal display mode (by default - vertical)
    • index=0 - limit the iteration to only the very first element encountered.
  • output - switch to output mode. The main attributes are as follows:
    • sourcesheet—Leaf of the template book from which the output range is taken. If not specified, the current (last used) sheet is applied.
    • range- The range of the pattern to be copied to the resulting document, for example, “A1: M10”, or “5: 6”, or “C: C”. (Applying row ranges of the type “5: 6” in the horizontal output mode and column ranges of the type “C: C” in the vertical output mode will result in an error).
    • worksheet- if defined, a new sheet is created in the output file and the output position is shifted to cell A1 of this sheet. The value of this attribute, equal to the constant or XPath expression, is substituted into the name of the new sheet.

In fact, there are many more options in the descriptor, see the documentation.

Well, it's time to download Xylophone and start generating a report.
Take the archive from bintray or Maven Central (NB: at the time of reading this article, there may be more recent versions). A shell script is located in the / bin folder, and if you run it without parameters, you will see a prompt about the command line parameters. To get the result, we need to “feed” the xylophone all the previously prepared ingredients:

xylophone -data testdata.xml -template template.xlsx -descr descriptor.xml -out report.xlsx

Open the report.xlsx file and make sure that we get exactly what we need:

Since the library ru.curs: xylophone is available on Maven Central under the LGPL license, it can be used without problems in programs in any JVM language. Perhaps the most compact fully working example is in Groovy, the code does not need comments:

import ru.curs.xylophone.XML2Spreadsheet
baseDir = '.'
new File(baseDir, 'testdata.xml').withInputStream {
    input ->
        new File(baseDir, 'report.xlsx').withOutputStream {
            output ->
                        new File(baseDir, 'descriptor.xml'),
                        new File(baseDir, 'template.xlsx'),
                        false, output)
println 'Done.'

The class XML2Spreadsheethas several overloaded variants of the static method process, but they all boil down to the transfer of all the same "ingredients" needed to prepare the report.

An important option that I haven’t mentioned so far is the ability to choose between DOM and SAX parsers at the stage of parsing a file with XML data. As you know, the DOM parser loads the entire file into memory entirely, builds its object representation and makes it possible to bypass its contents in an arbitrary manner (including re-returning to the same element). The SAX parser never puts the entire data file in memory, instead it treats it as a “stream” of elements, making it impossible to return to the element again.

Using SAX mode in Xylophone (via the command line parameter -saxor by setting totrueuseSaxmethod parameter XML2Spreadsheet.process) is critically useful in cases when it is necessary to generate very large files. Due to the speed and efficiency of the resources of the SAX-parser, the speed of file generation increases many times. This comes at the cost of some small restrictions on the descriptor (described in the documentation), but in most cases the reports satisfy these restrictions, so I would recommend using SAX mode wherever possible.

I hope that the method of uploading to Excel via Xylophone you liked and will save a lot of time and nerves - as it saved us.

And finally, once again links:

Also popular now: