“NoERP” or a new look at data normalization in corporate accounting systems

    Hello, Habr!

    This article is an attempt to look at the architecture of the old modern accounting and planning systems in the enterprise (let's call them ERP for short) from the perspective of new experience. The reason for writing was a small test.performance of a WEB application that managed to aggregate 10 million facts in a minute and a half with a simple brute force. Of course, for modern DBMSs this is a ridiculous amount, but I remembered that approximately the same number of transactions per month was generated by the ERP system of the commercial and industrial holding in which I once worked. Several factories, continuous production, wholesale distribution, a relatively simple financial model - a completely typical company, a typical accounting system, but reporting for a month was built more than 20 minutes - on an application cluster that worked with a SQL cluster, with indexes, statistics, and hints. The annual reporting had to wait up to 6 hours, and the need to recount it in the “what-if” mode turned into sleepless nights for accountants and economists. Systems of this class are quite complex and expensive,

    To clarify my idea, I will have to briefly describe the architecture of a typical western ERP (which will be done in the section below), and now imagine that we undertook to design such a system from scratch, taking into account the requirements of our days, and not 30 years ago, when all these systems were created. The first thing to do is to describe all the inputs and outputs of our “black box”, and then understand which internal abstractions are really necessary and which are redundant.

    Inputs and outputs


    Login . Business transactions, including forecasts, plans and budgets, are made out by primary documents of various types. The document has a tree structure up to 3 levels of nesting - header, lines, substrings (for example, warehouse lots or storage cells). Each type of document has its own set of attributes, some of which are involved in further calculation algorithms, and some are for reference. The input data is partially validated at the input stage, part of the checks are done at the posting stage, after adding additional attributes (which is usually entrusted to specially trained people). Thus, a document has a life cycle during which it “grows” with new data, and with other documents.

    At the exit, in theory, summary reports on the company’s activities are expected (balances, turnover, various types of financial results, cost, forecasts, plans, tax registers, etc.) - this is how these systems are presented to the customer - a dashboard for senior management, aggregating all macro indicators, plus the theoretical ability to decrypt any digit to the original document with ten clicks of the mouse. However, current exit requirements are somewhat wider:

    1. Now, in an era of total “presumption of financial guilt” - both the Federal Tax Service and the big four auditors work the same way - they request first consolidated statements, then all primary documents (most often in electronic form, but they can also request paper). Documents are uploaded to the auditor’s own accounting system, the calculation is performed in accordance with the accounting policies of the audited company, and the results are compared. In order to prepare for such a test in advance, companies are forced to generate their reports in the context of primary documents, which essentially makes sense the whole procedure of posting with hierarchical convolution - roughly speaking, instead of a simple P&L, you need to see the share of the salary of the shop cleaner in the cost of finished products, especially since modern OLAP systems are capable of processing such a volume of data.
    2. Modern systems of business analysis and machine learning require the whole bigdata, as it is - that is, even a little significant reference attribute of the source document (manager's position and geographical location) - is important in certain cases. And in our system there is a chain “source document - modular postings - general ledger”, where the higher the level, the less specifics, not a single table contains comprehensive information about the business transaction, and the data is “spread out” over a couple of dozen tables.
    3. There is not always an unambiguous relationship between the GC transaction and the line of the source document, usually this is a many-to-many relationship, which does not allow automatically, in simple joins, to pull the attributes of the source document to the summary report, and “distribution on the fly” is usually low-productivity, and gives the same tens of minutes and even hours when forming a seemingly simple report. And a couple of dozen external joins usually have a bad effect on performance.

    As a result , due to the need to decrypt and document each operational indicator, the value of P&L consolidated reports and the balance are decreasing, and the need for instant and most detailed reports is increasing. In retail already in the evening of this day, management expects a margin, and in production everyone wants to control the instant cost without waiting for the end of the month (I saw projects for the ten-day and weekly closing of the period, and this is not the limit).

    Summary - at the output, our system should produce all possible analytics in the maximum detail (that is, up to the line of the primary document), with all significant and insignificant attributes, that is, in fact, completely duplicate the input, supplementing it with the results of calculations.

    And now, to understand the suffering of ERP programmers and their users, we’ll dive a bit into the current database structure of such systems.

    Inherited Architecture


    In the distant time, when computers were low-powered, they used RDBMS as storage, and bigdata technologies were in its infancy, the architecture was proposed - to create a single accounting register in which to store the fingerprints of all documents in the most unified and anonymous form, but suitable for quick construction OLAP reporting. The Ledger became such a register (its analogue in 1C under the name “operations log” was nevertheless cut out due to performance problems). Each document in the process of posting eventually generates transactions in the general ledger, from which, in theory, all management reports should be generated. In fact, the process is 2-step, since each module has its own small books, and the document first generates modular transactions, and only then, on the basis of them, ledger transactions are formed.

    A typical system is divided into modules, which are about 10 (Inventories, Purchases, Sales, Subreport, KassaBank, Production, Repairs, Projects, Planning, Fixed Assets). We will very roughly describe the main stages of the document life cycle, and estimate the number of DBMS tables:

    1. Initial input of documents. As a rule, the data structure includes 1..3 tables — the heading of the document, lines of the document, and substrings (for example, a list of warehouse lots or storage cells). At this stage, basic directories are used (contractors, nomenclatures, units, goals, projects, etc.). In each module, documents / lines of each type are stored in separate tables - about 20 tables in total.
    2. Conversion of units and currencies. Typically, the calculated values ​​are placed in separate fields of tables 1.
    3. Manual and semi-automatic “binding” of a document to documents of other types (for example, the purchase of transport services is tied to movement documents, customs clearance costs are tied to a procurement document, etc.). A stage may be absent if all the bindings are done automatically in the process of closing the period, but in life it happens differently. We use 1 table per intermodule, about 5 in total.
    4. Account assignment of operations, that is, based on the directories of paragraph 1 - the definition of accounting registers, according to which this document should be posted. Usually this is a balance sheet with analysts, and a correspondent account with analysts - separately for the document heading and for each line. The registers form the most detailed module entries. We have 10 tables (according to the number of modules).
    5. The balance of the modular registers is updated (balances, debts, budget execution, etc.). Another 10 tables.
    6. Modular postings are collapsed according to the document, and are recorded in the general ledger - 1..2 tables.
    7. GK balance is updated - 1 table.
    8. Calculation of taxes per line and / or cap depending on the tax system - 2 tables. In Russia, you need to generate electronic invoices and lines of books of purchase / sale. 4 more tables, total 6.
    9. Comparison of the generated transactions with the transactions of other modules (based on the relationships of clause 3 or in batches). For example, consumable storage lots are matched according to the many-to-many scheme with arrivals, or the transactions forming the DZ / KZ are compared with advances issued / received. Here we have already come off the original document, and we compare the postings with the postings, that is, one artificial entity with another artificial entity. We use 1 table per intermodule, about 5.
    10. Revaluation of the cost of write-offs based on the comparisons graph of clause 9. Revaluations are made out by a separate category of transactions, we consider 2 more tables.
    11. We form the audit trail (who, when, what, how much). Another table.

    In total , even without reference books, our operations spread across 60 tables, none of which contain complete information about the fact of a business operation.

    Such a data scheme has several justifications:

    1. In those days, RDBMSs were used whose transactional performance was a bottleneck, so they tried to normalize the data as much as possible.
    2. Adding / expanding / renaming a column in the RDBMS table (in all indexes, respectively) is a difficult operation, usually a blocking one, requiring a system shutdown, therefore module wiring is separate from the document, ledger wiring is separate from module wiring, taxes are separate from wiring, etc. .d. This facilitates the implementation of changes in one module (for example, taxes), without greatly affecting the structure of tables, for example, the general ledger.
    3. It’s easier to set up data sharing, and more difficult to fraud (in order to make consistent changes to 15 tables, you need to be a professional).

    What this scheme leads to:

    1. The user is usually interested in complete information about the fact - who, what, whom he sold, on which transport and at what price he was taking, the sale was paid or not, current balances and balances at the time of the operation, name of the manager, etc.), therefore, in the reports you have to collect data from all these tables with an SQL query (materialized views, mirrored databases, OLAP, etc.).
    2. Programmers have at their disposal a rather powerful, dynamically optimized SQL language, which encourages their use of nested queries (each JOIN is essentially a nested loop at least by index) instead of more predictable single-pass algorithms, as a result of which the system performance decreases non-linearly with data growth, but progressively, which leads to difficulties in computing capacity planning (for example, one new entity was added, and productivity dipped 10 times).
    3. If we consider the ERP system as a black box - at the input we have a document with a complete set of data, at the output the user needs the same document, but with a set of additional attributes (accounts and analytics, the state of mutual settlements, data from related documents). The user is not interested in artificial entities such as postings, comparisons, registers - any reconciliation (with the primary or counterparty) is done according to the documents, any audit and bigdata requires decoding to the document line, etc. But between the input document and the output - there is a normalized scheme with artificial entities created by programmers and for the convenience of programmers and system administrators, as a result of which we first parse the document into molecules (when conducting), and then reassemble it from molecules (in reports).

    Target architecture


    Nowadays, when the performance of servers and workstations has grown by orders of magnitude, and the amount of credentials has not changed much - is it time to revise the architecture? Why do we need these conversions back and forth if the input and output are so similar? It is clear that the registers of balances, comparisons, costs will not go anywhere, since they are the result of the calculation, but is it necessary to have intermediate, artificial entities in the form of module transactions and general ledger transactions? Since the document is a cross-cutting entity, demanded both at the input and at the outputs - in the new ERP it is proposed to make the document the central data structure. Attributes that appear at different stages of the life cycle (analytics, comparisons, recounts) must be written into the document itself. This will allow reporting in a uniform and predictable way,

    In order to unify reporting algorithms - documents of all types are proposed to be stored in a single database, and since the set of details can be different, and, moreover, dynamic - this database should be NoSQL. Calculation and reporting algorithms scan the database and try to find the required attribute or a combination of them in each document (NoSQL supports indexes on attribute combinations, so a full fullscan is not required). Finding the attribute (s) will be a sign of the inclusion of the document in a specific calculation / report.

    Documents can be connected with each other, which means the storage junction is “with itself,” and this, in turn, requires, first, an orderly storage of all directories and documents (causality graph), and secondly, a review of approaches to generating reports in favor of single-pass algorithms with short-term memory - instead of nested loops, albeit optimized ones.

    There remains a number of additional entities (balances, comparisons, revaluations), but, firstly, these entities reflect real-world objects, and secondly, the basic principle must be respected - full data on the business transaction should be obtained from the main document and related documents , without the use of synthetic entities such as wiring, accounting ledger, etc.

    An additional bonus - the proposed system will allow you to create additional reference attributes of specific types of documents "on the fly", without stopping the database, and with minimal interference in the program code.

    The disadvantage will be the need to programmatically implement a storage scheme (valid attribute sets, valid relationships, data validation), without relying on the built-in DBMS facilities. In fairness - in most commercial systems, foreign keys and triggers are not used, and all data integrity is ensured by algorithms on the application server side.

    PS
    A separate article deserves a description of single-pass algorithms with short-term memory that will have to be used in calculations and reports with such a storage scheme, but even without it, the amount of work for the first MVP (storage scheme + input interface) significantly exceeds our current capabilities, so the project will move slowly. and I will probably inform him of his status in the following articles.

    PPSS
    Continuation of the topic.

    Also popular now: