Why E in the abbreviation EHD is about business processes
Data storage without E
Today, in any company relating to large and medium-sized businesses, the availability of data storage is the de facto corporate standard. It does not matter in which industry the company operates, without analyzing the available data about customers, suppliers, finances, it is impossible to maintain a competitive advantage. With the development of automation and optimization at every level of production of a product or service, an organization uses more and more IT systems that create data - production, accounting, planning, personnel management, and others.
How to build the process of creating a data warehouse is most effective in terms of global optimization of enterprise resources, new and current business needs, and why keeping metadata is important.
Tasks for using accumulated data are most often used for the following classes of tasks:
- regulatory reporting
- financial Accounting
- planning and control
- customer base analysis
- risk management
Often, for the most urgent purposes, it is enough to use one source - for example, if we are talking about providing the regulator with some level of detail from a certain system, or sending the entire history of his orders to the customer using CRM. Even when changing information systems, there is usually no difficulty in obtaining reports.
Methods and types of data warehouses
However, when the size of an organization becomes large enough, or if a competitive advantage is required, it is no longer enough just to create a product and bring it to the market. Current trends - in a comprehensive study of the consumer to increase his loyalty. It is necessary to analyze the business from different angles and learn how to more accurately assess costs. Typical tasks from the category must have the following:
- how to allocate expenses for business extractive units
- How to forecast demand depending on internal or external factors
- How to manage risk in financial and insurance organizations
- How to increase the average customer check (targeting)
Each of the above examples requires the use of more than one data source. In addition, it is important that the methods for comparing data between sources are consistent. Otherwise, a situation will inevitably arise when, for example, an organization, for example, the director of strategy and the director of sales, brings the same information to the general director, but with different numbers. And then a month they find out who was “more to the right”, using almost half of the staff at their disposal.
The most primitive way of organizing data storage is the so-called “data lake” (or data lake), when we simply take and pile data from different sources. In this case, we have a single technical platform for working with data and isolate complex analytical queries from the primary tasks of information systems. Such data storage can be completely self-related and non-relational. However, in this case, you can forget about the complex analysis, and operate only with simple queries. In addition, people working with data should be knowledgeable not only about the business area, but also about the data models of the source systems.
Further, according to the level of organization of the data warehouse, the data storage follows. Kimball classifications (Kimpball). Dimensions from different systems are unified, and thus, it turns out something like a network with two types of tables - facts and measurements. This is the primary enrichment of reference books, when, using some common natural key in the same tables of different sources, for example, the TIN in the directory of organizations, we get a single reference book.
The next in complexity and reliability is a data warehouse with a single data model reflecting the most important objects describing the organization’s activities. Reliability lies in the fact that the data, being presented in a form close to the third normal one, with a properly constructed model, are a universal means of describing the life of the entire business, and thus, the data model can be easily adapted not only for analytical and regulatory reporting, but and for the operation of some enterprise systems.
E - One
Speaking about the thesis of this article, I will list the main problems faced by those responsible for building data warehouses:
" Horse in a vacuum ." The storage is built, but nobody uses it.
" Black box ". The storage is built, but what is in it and how it works is incomprehensible. Because of this, there are constant errors, and if part of the development team has also left, then as a result, we slip into point a.
" Calculator ". The storage is built, but it satisfies only primitive requests, the business changes much faster than the implementation of the requirements, new business requests are not taken into account in it. In addition, some data may be outdated or rarely updated.
"Crystal vase ". Storage requires a lot of manual control, checks and non-automated control actions, if one of the support participants is not at work, there is a big risk to get invalid data or not to get them at all.
Let us examine all four cases in more detail
" Horse in a vacuum. " If you get this result, it happened for one of two reasons:
- Less likely. You did not collect requirements from the business units (or, the same thing, they were poorly developed). This seemingly absurd situation arises if the idea of creating a repository does not come from a business, but from an IT department, which simply has an “extra” budget, and the repository was intended because everyone has it. It seems like we will find customers later (even better is the option “we will come running with our hands outstretched”) - if we put everything in there. The persons responsible for allocating the budget consider this something necessary, they read it in books, heard it, well, it seems like modernization, and nod in agreement.
- More likely. The customers of the data warehouse have been identified, for example, this is the sales department, and here comes a bright idea: “let's make another small effort, delta, we will chase finance, personnel and a little more and the entire enterprise will use the storage”. The storage has been built, but it is used only by the sales department, although everything is beautiful there, and dairy take cares - I don’t want to take it, but no, I don’t have time for honey and sugar, they need to dig a bit of data from morning to night. After all, this is a piece taken by sweat and blood (read: spent working time).
In both cases, there is no element of taking responsibility for the top manager and lowering it down the hierarchy. It’s like a corporate culture. If a gene. Director of the enterprise 2 deputies, only the gene itself can make use of storage at the enterprise level. Dir, or the repository is built for a part of the enterprise - the one that is supervised by the head of the highest position, who is aware of the need to implement the CCD.
To avoid such situations, the following is necessary:
- Determine formally the sponsor of the data warehouse project - who will be responsible for the result both financially and spiritually
- Approve the project scopes, possibly phasing, designate approximate dates
- Coordinate with all departments - preferably with the construction of business processes as is and to be
Only after that you can begin to implement the project - the collection of requirements, architecture design, etc.
" Black box ". So, you claim that you built a repository, that all requirements are taken into account, however, no one understands how to use it, moreover, if one of the key developers left, it’s unrealistic to understand what was done and how it was done.
In this case, obviously, the development documentation process was not set. The principle of “first documenting”, then the development should be erected, if not in the Absolute, then in a fairly tight control. And not only from the team responsible for developing the data warehouse. Ideally, it is necessary that additional reporting developers (analytical, regulatory), owners of the company's internal information systems, and, of course, consumers themselves should be connected to the continuous and relevant documentation process.
In addition, the documentation process should comply with the following principles:
- Relevance - the current state of the program code is fully determined by the composition of the documentation
- Versioning - the ability to analyze the documentation of past releases and plan modifications for future releases
- Separability - several people can work simultaneously on a document
- Applicability. It says that for each type of storage documentation it is important to choose a structure that is best perceived by the target users: for example, the structure of the tables is better described in tabular form, business processes in the form of notations, interaction between information systems in the form of a diagram, business - dictionary in the form of wiki systems, etc.
Now there are software products that seriously simplify life, i.e. to link design and development, but so far there is no complete solution for data warehousing, it is:
- ER charts
- BPMN products
- ETL solutions
Without up-to-date documentation, the complexity of developing new requirements will increase, and with competent documentation, it will decrease.
" Calculator ". If we assume that we have not received a “horse in a vacuum,” then this situation is about when the requirements seem to be met, but they are formally fulfilled. You wanted to count the balances by day - please. If you want to get them in the context of the counterparty regions - there was no such requirement, you need to upload to excel, then take the upload from system X to counterparties with the choice of the Y field, and then to the CPA.
The current situation testifies to the lack of experience of the team, without an architectural look at the subsequent development of the repository, without even a primitive data model. Usually such storage facilities become temporary, or are quickly forgotten. In an amicable way, the vault should have the power of a snowball rolling down from the mountain. At first, when the lump is still small, and in front of loose snow, you yourself will hardly need to collect and push it. At some point in time, the fame of your product will spread, and users will look at the repository more often.
So, in order for the storage to not turn out to be a calculator, you need to provide:
- qualified personnel - architects, analysts, EtL and SQL developers
- Charter of the project, in which the objectives of the repository will be indicated not only for the next budget period, but also for subsequent years
- Quantitative and qualitative data warehouse criteria. If you do not have enough staff, it is recommended to involve consultants
- Be clear about what helps to optimize the data warehouse in the future - the cost of staff, software, increase the speed of report development, etc.
" Crystal Vase ". The storage is built, it seems to cope with its tasks, but to support it you need a lot of effort: maintaining some manual directories, constant reloading of some sources, failures in loading, duplicate data, etc.
This situation may occur for the following reasons:
- About it has already been said above - the lack of qualified personnel;
- Architecture-free concept — when different parts of a repository are made by different people or teams without a common, approved concept, as a result we have multiple ways to extract, transform, and load data;
- A very common situation is “outsourced development”, its support, while the acceptance of work is done poorly
- At some stage of development of the repository "budget is over." And then the repository is being finalized (supported) not by the team that created it, but by those who need data
To prevent these situations, the following actions are recommended:
- Told in paragraphs above - the qualified personnel, the project charter, the long-term plan and the budget, the interested person from the top manager.
- It is not the outsourcing who leads the process, but the internal employee (chief analyst or architect) directs the outsourcing.
- Any failing situations should be brought to meetings for consideration by the repository architect. If there are several architects, then the architectural committee.
- It is advisable to enter a data warehouse quality metric; you can use this metric to bind to a KPI command.
As can be seen, in all the above cases, despite the fact that the creation of a data warehouse is a project activity, the creation processes themselves must be regulated to create a high-quality result.
Transition from data storage to a single
As mentioned above, the success of a data warehouse project is determined by quite a lot of input data (budget, sponsor, team, goals, customers). However, we almost did not deal with business processes that are aimed at developing and maintaining HD itself. Below I will try to formulate the main business processes, which are designed to make the processes of working with data in the enterprise really uniform:
- The processes of maintaining technical and user documentation up to date
- The processes of keeping up to date the business dictionary (glossary) of data
- Data Quality Control Processes
- Processes for the collection and management of CD requirements and reporting system
- Processes for managing the storage and processing infrastructure
- Storage Optimization and Data Acquisition Processes
In the modern paradigm, this set of business processes forms the basis of the concept of Data Governance.
Very often, when trying to implement these processes, the HD creation and reporting team will actively resist, or ignore the processes. It is understandable, because in a local sense, this is an extension of the development.
Therefore, it will be useful to take the following actions:
- Introduction of a horizontal liability structure (each participant may be responsible for a small area)
- Graphic representation of all possible workflow for all employees (process formalization)
- Integration into the KPI system percentage and quality of performance of responsibility
Despite the fact that in the local sense, the transition process seems to be significantly “bureaucratic” and heavy, in the global sense, this provides significant advantages and time savings. Since the main loss of time - when inventing from scratch already existing solutions due to the impossibility or lack of desire to understand the existing mechanism.
A little bit about the target architectural solution
Although the EXD architecture pulls into a separate large article, or even a book, I will also outline the main technical requirements for a mature data warehouse:
- The data lake paradigm does not replace corporate data warehouses, but coexists with it
- EXD should have different data delivery interfaces: bi tools, the ability to perform ad-hoc sql queries, standard data rendering in json, xml, etc.
- A role-based data access model should be implemented.
- Response speed when accessing data: 90% of typical requests - less than 1 second, 99% of requests - less than 10 seconds. There should be a fairly good resource reserve.
- The presence of a single and coherent central layer of HD (preferably - Inmon methodology)
As a result, the data warehouse is called not only by the availability of sources, but by the availability of data consumers. And it is much more difficult than writing a universal ETL and adjusting petabytes of memory.