Relational Database Design Guide (10–13 of 15) [translation]

Original author: Rubenski
  • Transfer
To be continued.
Previous Parts: 1-3 , 4-6 , 7-9

10. Normalization of databases


Guidance for the proper design of relational databases is provided in the relational data model. They are collected in 5 groups, which are called normal forms . The first normal form represents the lowest level of database normalization. The fifth level represents the highest level of normalization.

Normal forms are recommendations for designing databases. You are not required to adhere to all five normal forms when designing databases. However, it is recommended that you normalize the database to some extent because this process has a number of significant advantages in terms of the efficiency and ease of handling your database.

  • In a normalized database structure, you can make complex data samples with relatively simple SQL queries.
  • Data integrity . A normalized database allows you to reliably store data.
  • Normalization prevents redundancy of stored data . Data is always stored in only one place, which makes the process of inserting, updating and deleting data easy. There is an exception to this rule. The keys themselves are stored in several places because they are copied as foreign keys to other tables.
  • Scalability is the ability of a system to cope with future growth. For a database, this means that it must be able to work quickly when the number of users and the amount of data increase. Scalability is a very important characteristic of any database model for an RDBMS.


Here are some of the main points that are related to database normalization :

  • Organize data into logical groups or sets.
  • Finding relationships between datasets. You have already seen examples of one-to-many and many-to-many relationships.
  • Minimize data redundancy.


A very small number of databases follow all five normal forms provided in the relational data model. Databases are usually normalized to the second or third normal form. The fourth and fifth forms are rarely used. Therefore, I will limit myself to telling you only about the first three.

11. The first normal form (1NF)


The first normal form says that a database table is a representation of the essence of your system that you are creating. Examples of entities: orders, customers, ordering tickets, hotel, goods, etc. Each entry in the database represents one entity instance. For example, in the customer table, each record represents one customer.

Primary key

Rule: each table has a primary key consisting of the smallest possible number of fields.

As you know, a primary key can consist of several fields. For example, you can choose the first and last name as the primary key (and hope that this combination will be unique always). It would be a much better choice social room. Insurance as a primary key, as it is the only field that uniquely identifies a person.
Even better, when there is no obvious candidate for the primary key title, create a surrogate primary key in the form of a numerical auto-increment field.

Atomicity.

Rule: fields do not have duplicates in each record and each field contains only one value.

Take, for example, a car collectors site where every collector can register his cars. The table below stores information about registered cars.

image
Horizontal data duplication is bad practice.

With this design option, you can save only five cars and if you have less than 5, then you are wasting free space in the database for storing empty cells.
Another example of bad design practice is storing multiple values ​​in a cell.

image
Multiple values ​​in one cell.

The right decision in this case will be the allocation of cars in a separate table and the use of a foreign key that refers to this table.

The order of the entries should not matter.

Rule: the order of table entries should not matter.

You may be inclined to use the order of entries in the customer table to determine which customer has registered first. For these purposes, you better create date and time fields for registering customers. The order of the records will inevitably change when customers are deleted, changed or added. This is why you should never rely on the order of entries in a table.

In the next part, we consider the second normal form (2NF).

12. The second normal form.


In order for the database to be normalized according to the second normal form, it must be normalized according to the first normal form. The second normal form is associated with data redundancy.

Data redundancy.

Rule: fields with a non-primary key must not be dependent on the primary key.

It may sound a little abstruse. And this means that you should only store data in the table that is directly related to it and not related to another entity. Following the second normal form is the question of finding data that is often duplicated in table entries and which may belong to another entity.

image
Duplication of data among records in the store field.

The table above may belong to a company that sells cars and has several stores in the Netherlands.

If you look at this table, you will see multiple examples of data duplication among records. The brand field could be highlighted in a separate table. As well as the type field(model), which could also be separated into a separate table, which would have a many-to-one relationship with the brand table because the brand may have different models.

The store column contains the name of the store where the machine is currently located. Store is an obvious example of data redundancy and a good candidate for a separate entity, which should be connected to a car table by foreign key communication .
Below is an example of how you could model a database for cars, avoiding data redundancy.

image

In the example above, the car table has a foreign key - a link to the type and store tables. The brand column has disappeared because the brand is implicitly referenced through the type table . When there is a link to type, there is a link to brand, as type belongs to brand.

Data redundancy has been substantially eliminated from our database model. If you are picky enough, then you may not be satisfied with this decision. What about the country_of_origin field in the brand table ? There are no duplicates so far because there are only four brands from different countries. An attentive database designer should highlight country names in a separate country table .

And even now you should not be satisfied with the result because you could also highlight the color fieldin a separate table.

How strictly you approach the creation of your tables is up to you and depends on the specific situation. If you plan to store a huge number of car units in the system and you want to be able to search by color, it would be a wise decision to select the colors in a separate table so that they are not duplicated.

There is another case where you might want to highlight colors in a separate table. If you want to allow company employees to enter data on new cars you will want them to be able to choose the color of the car from a predefined list. In this case, you will want to store all possible colors in your database. Even if there are no cars with this color yet, you want these colors to be present in the database so that employees can select them. This is definitely the case when you need to highlight colors in a separate table.

13. The third normal form.


The third normal form is associated with transitive dependencies . Transitive dependencies between database fields exist when the values ​​of non-key fields depend on the values ​​of other non-key fields. For the database to be in the third normal form, it must be in the second normal form.

Transitive dependencies.

Rule: There can be no transitive dependencies between the fields in the table.
The customer table (my customers are players of the German and French football teams) below contains transitive dependencies.

image

In this table, not all fields depend solely on the primary key. There is a separate relationship between the postal_code field and the fields of the city and province. In the Netherlands, both meanings: city and province are determined by the postal code, postal code. Thus, there is no need to store the city and province in the client table. If you know the postal code, then you already know the city and the province.

Such a transitive dependency should be avoided if you want your database model to be in third normal form.

In this case, the elimination of the transitive dependence from the table can be achieved by removing the city and province fields from the table and storing them in a separate table containing the postal code (primary key), province name and city name. Getting a postal code-city-province combination for an entire country can be a very non-trivial task. That is why such tables are often sold.

Another example for applying the third normal form is the (too) simple example of an online store order table below.

image

VAT (value added tax) is the percentage that is added to the price of the product (19% in this table). This means that the value of total_ex_vat can be calculated from the values ​​of total_inc_vat and vice versa. You should store one of these values ​​in the table, but not both at once. You must assign the task of calculating total_inc_vat from total_ex_vat or vice versa to a program that uses the database.

The third normal form says that you should not store data in a table that can be obtained from other (non-key) fields in the table. Especially in the example with the customer table, following the third normal form requires either a large amount of work or the acquisition of a commercial version of the data for such a table.

The third normal form is not always used in database design. When designing a database, you should always compare the benefits of a higher normal form compared to the amount of work required to apply the third normal form and keep the data in that state. In the case of the client table, I personally would prefer not to normalize the table to the third normal form. In the last example with VAT, I would use the third normal form. Keeping data from existing data is usually a bad idea.

Also popular now: