Denormalization of the database. What for? When? How?
Is denormalization evil or is it just necessary to be able to cook it?
Denormalization is not the result of crooked hands. This is not an unfinished normalization, it is an intentional violation of normal forms, to increase productivity.
The question of denormalization has arisen more than once. Each time I had to make a deal with my conscience, violating the principles of normal forms, there was a feeling of dissatisfaction, a false awareness of our incompetence. Over time, when working in a team, it turned out that this was not only my problem. Is it time to figure out: denormalization is evil, or just need to be able to cook it?
What I wanted to understand
- When is denormalization needed? Signs and smells.
- How to determine when denormalization is justified?
- How to correctly implement denormalization
When is denormalization needed? Signs and smells.
Consider some common situations in which denormalization may be useful.
A large number of table joins.
In queries to a fully normalized database, it is often necessary to join up to a dozen, or even more, tables. And each connection is a very resource-intensive operation. As a result, such requests consume server resources and run slowly.
In this situation, it can help:
- denormalization by reducing the number of tables. It is better to combine into one several tables, which are small in size, containing rarely changed (as they often say, conditionally constant, or normative-reference) information, moreover, information that is intimately related.
In general, if more than five or six tables need to be combined in a large number of queries, consider denormalizing the database. - Denormalization by entering an additional field in one of the tables. In this case, data redundancy appears, additional steps are required to maintain the integrity of the database.
Estimated values.
Often queries are slowly executed and consume a lot of resources, in which some complex calculations are performed, especially when using groupings and aggregate functions (Sum, Max, etc.). Sometimes it makes sense to add to the table 1-2 additional columns containing frequently used (and difficult to calculate) calculated data.Suppose you want to determine the total value of each order. To do this, you must first determine the cost of each product (according to the formula "number of product units" * "unit price" - discount). After that, it is necessary to group the costs of the orders.
The execution of this request is quite complicated and, if the database contains information about a large number of orders, it can take a lot of time. Instead of executing such a request, at the stage of placing an order, you can determine its value and save it in a separate column of the order table. In this case, to obtain the desired result, it is sufficient to extract the previously calculated values from this column.
Creating a column containing pre-calculated values can significantly save time when executing a query, but it requires timely data changes in this column.
Long fields.
If we have large tables in our database that contain long fields (Blob, Long, etc.), then we can seriously speed up the execution of queries to such a table if we take out the long fields in a separate table. We want, say, to create a catalog of photographs in the database, including storing in the blob fields the photographs themselves (professional quality, high resolution, and appropriate size). From the point of view of normalization, such a table structure would be absolutely correct:ID of the photo
ID of the author
ID of the camera model the photo
itself (blob field).
Now imagine how long the request will work, counting the number of photos taken by any author ...
The right decision (although violating the principles of normalization) in such a situation would be to create another table consisting of only two fields - the photo ID and the blob field with the photo itself. Then the samples from the main table (in which there is no huge blob field now) will go instantly, but when we want to see the photo itself - well, wait ...
How to determine when denormalization is justified?
Costs and benefits.
One way to determine how justified these or other steps is to conduct an analysis in terms of costs and possible benefits. How much does a denormalized data model cost?
Define requirements (what we want to achieve) -> determine data requirements (what needs to be observed) -> find the minimum step that satisfies these requirements -> calculate implementation costs -> implement.
Costs include physical aspects such as disk space, resources needed to manage this structure, and lost capabilities due to time delays associated with servicing this process. Denormalization has to be paid. In a denormalized database, data redundancy increases, which can improve performance but require more effort to control related data. The process of creating applications will become more complicated, as the data will be repeated and more difficult to track. In addition, the implementation of referential integrity is not a simple matter - the associated data is divided into different tables.
Advantages include higher performance when executing a request and the ability to get a faster response. In addition, you can get other benefits, including increased throughput, customer satisfaction and productivity, as well as more efficient use of tools by external developers.
Request frequency and performance stability.
For example, 70% of the 1000 queries generated daily by the enterprise are queries at the level of summary rather than detailed data. When using the summary table, queries are completed in about 6 seconds instead of 4 minutes, i.e. processing time is less by 2730 minutes. Even adjusted for those 105 minutes that you need to spend weekly on supporting summary tables, you save 2625 minutes per week, which completely justifies creating a summary table. Over time, it may happen that most of the queries will be addressed not to summary data, but to detailed data. The fewer the number of queries that use the summary table, the easier it is to abandon it without affecting other processes.
Other
The above criteria are not the only ones to consider when deciding whether to take the next step in optimization. Other factors must be considered, including business priorities and end-user needs. Users should understand how, from a technical point of view, the system architecture is affected by the requirement of users who want all requests to be completed in a few seconds. The easiest way to achieve this understanding is to outline the costs associated with creating and managing such tables.
How to correctly implement denormalization.
Save detailed tables
In order not to limit the capabilities of the database that are important for business, it is necessary to adhere to a coexistence strategy, not a replacement, i.e. save detailed tables for in-depth analysis by adding denormalized structures to them. For example, a hit counter. For a business, you need to know the number of visits to the web page. But for the analysis (by periods, by countries ...) we very likely will need detailed data - a table with information about each visit.
Using triggers
You can denormalize the database structure and still continue to take advantage of the normalization if you use database triggers to maintain the integrity of the information, the identity of duplicate data.
For example, when adding a calculated field to each of the columns on which the calculated field depends, a trigger is hung that calls a single stored procedure (this is important!), Which writes the necessary data to the calculated field. It is only necessary not to miss any of the columns on which the calculated field depends.
Software support
Naroimer, in MySQL version 4.1 there are no triggers and stored procedures at all. Therefore, application developers should take care to ensure data consistency in a denormalized database. By analogy with triggers, there should be one function that updates all fields depending on the field being changed.
Summary
To summarize. When denormalizing, it is important to maintain a balance between increasing the speed of the database and increasing the risk of conflicting data, between making life easier for programmers writing Selects, and complicating the tasks of those who provide database filling and updating the data. Therefore, it is necessary to denormalize the base very carefully, very selectively, only where there is no way to do without it.
If it is impossible to calculate in advance the pros and cons of denormalization, then initially it is necessary to implement a model with normalized tables, and only then, to optimize problematic queries, denormalize.