oberon87 May 14, 2015 at 12:28

Open Document Document Generator (ODF) on Go

I want to share with the community my experience in creating a library for generating documents with a user-friendly programmer interface. For golang, this niche is no less important than the next web toolkit, since the availability of reports and tools for their generation increases the attractiveness of go for a bloody enterprise.

Reporting is a multi-step process. Reporting tools can automate different stages of creating a report, working with the database, managing filtering criteria and calculating values, output to the final document. We’ll talk about the latter.

Introduction

At the moment, the results of the search query “golang odf” leave much to be desired. Of course, at the request of “golang pdf” everything is much more rosy. But based on my own experience in developing business applications that should generate certain reports in the form of office documents, I can say with confidence that often, after a beautiful PDF arrives on a user's computer, it checks the numbers, sees a discrepancy, and calls support with a request to correct the figure in the received file, because the report is needed “already yesterday”.

The solution may be to generate a document in Word / RTF / ODF / etc format or to edit PDF (there are ready-made tools for this, so this is more interesting for admins, not programmers). We’ll also leave the opportunity for proprietary formats to comment in the comments, but for now I’ll talk about ODF generation.

ODF format

The Open Document format is an open format for editable office documents. Of the well-known office suites that support it, OpenOffice and LibreOffice can be distinguished. Now the standard for the format of version 1.2 is valid, but there are few differences from version 1.0, and they are mainly of an organizational order. Given that the standard is already several years old, it can be considered stable and rely on it in its work. The format provides for various types of documents, text documents, spreadsheets, presentations, etc. The priority formats for storing reports are, in my experience, text (odt) and tabular (ods) documents.

The standard describes the rules for structuring a document model and saving this model in XML format. Technically, a document file should be either one large XML or a zip archive with several required files and an unlimited number of other files. This format is convenient for embedding images and other files, so I will only consider documents in a zip archive format.

It will also not be superfluous to note that ODF is the state standard in the Russian Federation. And although Microsoft Office holds strong positions in organizations, Open Office is often placed nearby and serves as an alternate aerodrome.

Why go?

Why not? Simple language, Google support, vibrant community, unoccupied niches. And then, it's just fun. And the ability to compile reliable code in javascript will allow in some cases to generally transfer the mechanisms for generating reports to the client, which will increase the flexibility of your web service. In addition, it’s a sin not to use fast native code for such difficult things as complex multi-level reports.

Library

I have been working with the ODF format for quite some time, since 2008, when I was required to implement an ODF report and form generator. Then I implemented the component in a different language for convenient (as it seems to me) programmatic document generation. In general, the result was satisfactory, my component still works.

After several years of use, as it usually happens, a number of comments have accumulated, which I decided to radically correct, rewriting the entire library from scratch. Since the time is now interesting, I chose the Go ecosystem for the implementation of creative itching. But in general, I tried to keep the interface solutions of the previous version as time-tested.

Next, I will talk more specific things about the format. To familiarize yourself with it, you can read the introduction to the ODF standard.
What is the main difficulty when working with ODF from the point of view of a programmer? It lies in the fact that the modification of the visible contents of the document implicitly leads to the modification of several areas of the document model. Changes affect the content area, the style description area, the contents of the zip package. At the same time, due to the peculiarities of XML, some means for combating invisible characters were introduced into the ODF format, which imposes additional responsibilities on the generation tool.

Another important point is the reuse of styles. The component had to follow the styles that the user forms and save them most optimally, without duplication.

Implementation

To implement all this mechanism, the concept of the Formatter was invented (spied). A formatter is an aggregate that contains information about the document model, the document model itself, and a bunch of auxiliary data structures that are hidden from the client code, but allow one way or another to control and verify the actions of the client code.

For the document model, the previous version used a pure DOM, which was redundant, so the new version uses a simplified data structure like the DOM, which is then translated into XML by semi-manual marshalling from the Go standard library. To work with the document model, the Carrier-Rider-Mapper (CRM) pattern is used. Carrier is a storage medium, in this case a node tree. Riders are used to pass through the tree and its modifications - runners, the peculiarity is that in the wooden data structure the slider takes the position of the node, and runs through the list of descendants and the attributes of this node. Mapper in this scheme is a high-level mechanism that, with the help of sliders, works with the document model, as you might guess, in our scheme it is a formatter.

The content of the document itself is written in the form of nodes with a special name from the desired namespace. Node attributes also belong to special namespaces. The text editor, when opened, interprets the sets of nodes and attributes in the display of beautiful text and even tablets. Therefore, the main task of the formatter is to correctly and in the right place record the node and its attributes, the name and meaning of which is described in the standard. For instance:

The element represents a paragraph, which is the basic unit of text in an OpenDocument file.

describes a paragraph element, the contents of which will be displayed as a paragraph to which the paragraph style will be applied: alignment on the page, line spacing, etc.

At the moment, I did not set myself the task of reading and modifying a ready-made document, but this data model allows this, it is only a matter of implementation.

To increase extensibility, the initially monolithic formatter was divided into several narrower ones, which perform the functions of recording certain sections of a document (tree nodes). ParaMapper records the contents of paragraphs, TableMapper records tables and their contents, while ParaMapper records the text in cells. This approach allows you to implement the necessary functions of a huge standard by point efforts, saving time and project resources.

Text attributes, character text, paragraph alignment on the page, and other necessary attributes are set using the generalized attribute generation mechanism.
For this or that family of attributes, a special builder is implemented that allows you to set the desired style.
An important feature of the work is that setting attributes does not mean that they are actually written to the document. In practice, this leads to such a scheme of work:

Prepared data for recording
Set future content attributes
Recorded content that receives set attributes
Attributes written to document

Until we write new attributes or reset them, each successive content will be attributed to these attributes. An additional opportunity to set the default attributes of the document, which will be attributed to any content that is not assigned special user attributes.

Since the ODF standard is quite voluminous, in my work I implemented only a minimum of features that may be needed for generating reports. Among these features are text attributes (color, font, size), paragraph attributes (alignment), table and cell attributes (border attributes, color, line thickness), image embedding.

The main efforts fell on the formation of an extensible framework, which allows you to implement the necessary attributes or even new elements in a few steps, even without modifying the library code (you still have to write new code). Interestingly, the spreadsheet format for the programmer looks almost the same as the format of a text document. Only the mimetype changes and the fact that the root element of the document is not text. The table writing engine works the same in both cases. This can be seen in the examples from the odf_test.go module .

Example

A simple example to demonstrate the health and how to use it.

package main
import (
    "odf/generators"
    "odf/mappers"
    "odf/model"
    _ "odf/model/stub" //не забываем загрузить код скрытой реализации
    "odf/xmlns"
    "os"
)
func main() {
    if output, err := os.Create("demo2.odf"); err == nil {
        //закроем файл потом 
        defer output.Close()
        //создадим пустую модель документа
        m := model.ModelFactory()
        //создадим форматтер
        fm := &mappers.Formatter{}
        //присоединим форматтер к документу
        fm.ConnectTo(m)
        //установим тип документа, в данном случае это текстовый документ
        fm.MimeType = xmlns.MimeText
        //инициализируем внутренние структуры
        fm.Init()
        //запишем простую строку
        fm.WriteString("Hello, Habrahabr!")
        //сохраним файл 
        generators.GeneratePackage(m, nil, output, fm.MimeType)
    }
}

More complex examples can be found in the odf_test.go, demo / report.go file in the project repository.
For a full-fledged example, a demo report was generated .

If Yandex-disk stops giving the file

Conclusion

In the end, I would like to note that the formation of an ODF document in itself is not something complicated. The main goal of both the first and second versions of the library was to provide a convenient program interface for further use in the tasks of generating reports and forms. Also, the basic level of such an interface opens up possibilities for building converters from other formats, for example, simple HTML, in which reports of web applications are often generated.

One of the drawbacks of this approach is the need for huge manual work to translate the standard into code, to come up with a convenient interface for client code. I fully admit that it might make sense to look at the automation of this work using the RelaxNG processing scheme of the entire ODF standard, which describes the capabilities and limitations of the format in a form convenient for automation.

If you write on Go and you need reports, then here you are. The spirit of opensource and team development can take the library into a vast niche. And from ODF you can get PDF in batch mode, which can significantly enhance the ability to edit digits and knock reports.

References

ODF format description: docs.oasis-open.org/office/v1.2/OpenDocument-v1.2.html
Project repository: github.com/kpmy/odf
Demo: yadi.sk/i/RghkBDHIgcey2

Tags: