1ex February 7, 2011 at 01:52

The second normal form (in SQL terminology)

Since the first post had already torn off the roof of several habrazhiteli in general and shook karma in particular, I decided to write a translation of the article in terms of SQL. It will be useful to me and, possibly, not only to me. In general, since childhood, I have been striving to land theory with practice using various means, among which was alcohol, and it seems to me useless to spend time studying something, which you cannot come up with an example from real life.

The only funny thing is that all this rubbish under the cut was born in Codd’s mind even before the emergence of SQL as a language, and now, in terms of SQL, give everything ...

What is the second normal form or 2NF? So that a three-year-old child really understands ...
To begin with, we will understand the goals that normalization pursues. There are no discrete terms under the cut ...

The goal of reducing to the first normal form (1NF) is to enable the use of WHERE conditions when fetching data with a SELECT query. Since all column values are of the same and predefined type, they can be compared with each other and with constants.

For example, if there is a column 'Kids' of type VARCHAR in the table 'Family', we can easily compare the two lines 'Vasya' and 'Anya' and determine their lexicographic order, for example, with the>

Family	Kids
Ivanovs	Vasya
Petrovs	Anya

If in some line in the field 'Children' is indicated 'Vanya, Sasha', we can no longer unambiguously determine the order of the children. It is pointless to compare the lines of 'Vasya' and 'Vanya, Sasha' in this situation. since the first is a string, and the second is already a list. Suppose we want to find all the children in the letter 'C'.

Family	Kids
Ivanovs	Vasya
Petrovs	Anya
Sidorovs	Vanya, Sasha

Type request

SELECT Kids FROM Family WHERE kids LIKE 'С%'

will not work as it should in this situation and will not find Sasha, since LIKE cannot parse lists, extract values and treat them as arguments for comparison with the template. 'Vanya, Sasha' in this case, a non-atomic value such as a list of strings. To teach SQL how to work with such data, you need to either expand the language or simplify the model to 1NF. Decomposition to 1NF is achieved by splitting the compound value into atomic:

Family	Kids
Ivanovs	Vasya
Petrovs	Anya
Sidorovs	Vania
Sidorovs	Sasha

That is, the first NF deals with the structure of column values.

The second (and the third, but not about it today) NF deals already with the keys and dependencies between the columns of the table. We list its goals with explanations.

The main goal of bringing to the second normal form is the desire to get rid of the redundancy of data storage and, as a result, to avoid anomalies in the modification of this data (anomalies of change, insertion and deletion)
The second, but not least, goal of normalization in 2NF is to split the data model into separate tables as much as possible so that they can be combined and used in queries in new ways that were not originally provided for.
Minimize efforts to modify tables if necessary. The fewer the dependencies between the columns of the table, the less changes in it will be required when changing the data model.
The clarity of the tables for the user. Than to keep all the data in one big table, it is easier to present the data as several connected and logically separated labels. It is easier to read, perceive, design and maintain. In the end, any data model begins on a blackboard or paper in the form of circles, blocks and lines that children and programmers love to draw.

For example, we have a table

ID	Cd_name	Artist
10	Six Degrees Of Inner Turbulence	Dream theater
20	Metropolis, pt. 2: Scenes From A Memory	Dream theater
thirty	Master of puppets	Dream theater

where the primary key is an ID . This scheme is in 2NF, because the Artist column , which is not included in the key, is determined only by the entire key.

The table is in 2NF if any non-key column is determined only by the whole key and cannot be determined by its part.

In general, the question of the 2NF mismatch can only be made if the table has composite keys. Tables with simple keys, as in the example, always have 2NF. The specified table is just an example of such a case, since both keys (and this is the ID and the natural key CD_name ) in it are simple, and they have no parts.

2NF mismatch will be considered in the table

Artist	Cd_name	Track	Lyrics
Dream theater	Six Degrees Of Inner Turbulence	Misunderstood	Petrucci
Dream theater	Metropolis, pt. 2: Scenes From A Memory	Overture 1928	(instrumental)
Dream theater	Master of puppets	Battery	Notfield
Metallica	Master of puppets	Battery	Notfield
Ensiferum	Tale of revenge	Battery	Notfield

One and the same song can be included in several discs, and albums of the same name with songs of the same name from different groups, such as tributes, are theoretically possible. Therefore, the key will be {Artist, CD_name, Track} . In this case, the value of the Lyrics column , denoting the author of the words, is uniquely determined from the {Artist, Track} columns , which are part of the key. This is a violation of 2NF.

The consequence of this is that the values in the Lyrics column are redundant for each disc that the song belongs to. In the field of music, these values do not change, but in other domain domains, a careless change in such redundant data can lead to an inconsistent state of the database when not all values are updated. This is an example of a modification anomaly.

Another consequence is that songs that have not yet been released on CDs, but simply broadcast on the radio or released on other media, do not fit the specified data scheme. Accordingly, we will not be able to add a new song to the database until it is released on CD. This is an example of an insertion anomaly.

Similarly, if we want to remove a disc from the database, we will be forced to lose information about the authors of all songs that are included only in this disc, since in this model it is not possible to present information about the author if the song is not included in any CD . For example, the desire to remove the Six Degrees Of Inner Turbulence disc will cause the songwriter Misunderstood to be lost, which is unforgivable. This is an example of a removal anomaly.

To avoid such anomalies and remove redundancy, we need to divide the table, that is, carry out its decomposition into two:

Artist	Cd_name	Track
Dream theater	Six Degrees Of Inner Turbulence	Misunderstood
Dream theater	Metropolis, pt. 2: Scenes From A Memory	Overture 1928
Dream theater	Master of puppets	Battery
Metallica	Master of puppets	Battery
Ensiferum	Tale of revenge	Battery

Artist	Track	Lyrics
Dream theater	Misunderstood	Petrucci
Dream theater	Overture 1928	(instrumental)
Metallica	Battery	Notfield

In a real database for constructing queries, you also need to introduce semantic relationships between tables, for example, to associate them using a foreign key, but for our example it is enough to understand that these tables are connected by meaning.

Both tables have 2NF, the first - because she has all the columns included in the key, and the second - because Lyrics is determined by the {Artist, Track} key and is not uniquely determined by any of the Artist or Track columns .

I’m probably not going to talk about the warehouse, I’m tired of typing tablets in HTML :)

That's all.
I hope that right now it was clear, I went to deal with 3NF!

Tags:

The second normal form (in SQL terminology)

Also popular now: