The second normal form (in SQL terminology)

    Since the first post had already torn off the roof of several habrazhiteli in general and shook karma in particular, I decided to write a translation of the article in terms of SQL. It will be useful to me and, possibly, not only to me. In general, since childhood, I have been striving to land theory with practice using various means, among which was alcohol, and it seems to me useless to spend time studying something, which you cannot come up with an example from real life.

    The only funny thing is that all this rubbish under the cut was born in Codd’s mind even before the emergence of SQL as a language, and now, in terms of SQL, give everything ...


    What is the second normal form or 2NF? So that a three-year-old child really understands ...
    To begin with, we will understand the goals that normalization pursues. There are no discrete terms under the cut ...

    The goal of reducing to the first normal form (1NF) is to enable the use of WHERE conditions when fetching data with a SELECT query. Since all column values ​​are of the same and predefined type, they can be compared with each other and with constants.

    For example, if there is a column 'Kids' of type VARCHAR in the table 'Family', we can easily compare the two lines 'Vasya' and 'Anya' and determine their lexicographic order, for example, with the>

    FamilyKids
    IvanovsVasya
    PetrovsAnya


    If in some line in the field 'Children' is indicated 'Vanya, Sasha', we can no longer unambiguously determine the order of the children. It is pointless to compare the lines of 'Vasya' and 'Vanya, Sasha' in this situation. since the first is a string, and the second is already a list. Suppose we want to find all the children in the letter 'C'.

    FamilyKids
    IvanovsVasya
    PetrovsAnya
    SidorovsVanya, Sasha


    Type request

    SELECT Kids FROM Family WHERE kids LIKE 'С%'
    

    will not work as it should in this situation and will not find Sasha, since LIKE cannot parse lists, extract values ​​and treat them as arguments for comparison with the template. 'Vanya, Sasha' in this case, a non-atomic value such as a list of strings. To teach SQL how to work with such data, you need to either expand the language or simplify the model to 1NF. Decomposition to 1NF is achieved by splitting the compound value into atomic:

    FamilyKids
    IvanovsVasya
    PetrovsAnya
    SidorovsVania
    SidorovsSasha


    That is, the first NF deals with the structure of column values.

    The second (and the third, but not about it today) NF deals already with the keys and dependencies between the columns of the table. We list its goals with explanations.

    1. The main goal of bringing to the second normal form is the desire to get rid of the redundancy of data storage and, as a result, to avoid anomalies in the modification of this data (anomalies of change, insertion and deletion)
    2. The second, but not least, goal of normalization in 2NF is to split the data model into separate tables as much as possible so that they can be combined and used in queries in new ways that were not originally provided for.
    3. Minimize efforts to modify tables if necessary. The fewer the dependencies between the columns of the table, the less changes in it will be required when changing the data model.
    4. The clarity of the tables for the user. Than to keep all the data in one big table, it is easier to present the data as several connected and logically separated labels. It is easier to read, perceive, design and maintain. In the end, any data model begins on a blackboard or paper in the form of circles, blocks and lines that children and programmers love to draw.


    For example, we have a table

    IDCd_nameArtist
    10Six Degrees Of Inner TurbulenceDream theater
    20Metropolis, pt. 2: Scenes From A MemoryDream theater
    thirtyMaster of puppetsDream theater


    where the primary key is an ID . This scheme is in 2NF, because the Artist column , which is not included in the key, is determined only by the entire key.

    The table is in 2NF if any non-key column is determined only by the whole key and cannot be determined by its part.

    In general, the question of the 2NF mismatch can only be made if the table has composite keys. Tables with simple keys, as in the example, always have 2NF. The specified table is just an example of such a case, since both keys (and this is the ID and the natural key CD_name ) in it are simple, and they have no parts.

    2NF mismatch will be considered in the table

    ArtistCd_nameTrackLyrics
    Dream theaterSix Degrees Of Inner TurbulenceMisunderstoodPetrucci
    Dream theaterMetropolis, pt. 2: Scenes From A MemoryOverture 1928(instrumental)
    Dream theaterMaster of puppetsBatteryNotfield
    MetallicaMaster of puppetsBatteryNotfield
    EnsiferumTale of revengeBatteryNotfield


    One and the same song can be included in several discs, and albums of the same name with songs of the same name from different groups, such as tributes, are theoretically possible. Therefore, the key will be {Artist, CD_name, Track} . In this case, the value of the Lyrics column , denoting the author of the words, is uniquely determined from the {Artist, Track} columns , which are part of the key. This is a violation of 2NF.

    The consequence of this is that the values ​​in the Lyrics column are redundant for each disc that the song belongs to. In the field of music, these values ​​do not change, but in other domain domains, a careless change in such redundant data can lead to an inconsistent state of the database when not all values ​​are updated. This is an example of a modification anomaly.

    Another consequence is that songs that have not yet been released on CDs, but simply broadcast on the radio or released on other media, do not fit the specified data scheme. Accordingly, we will not be able to add a new song to the database until it is released on CD. This is an example of an insertion anomaly.

    Similarly, if we want to remove a disc from the database, we will be forced to lose information about the authors of all songs that are included only in this disc, since in this model it is not possible to present information about the author if the song is not included in any CD . For example, the desire to remove the Six Degrees Of Inner Turbulence disc will cause the songwriter Misunderstood to be lost, which is unforgivable. This is an example of a removal anomaly.

    To avoid such anomalies and remove redundancy, we need to divide the table, that is, carry out its decomposition into two:

    ArtistCd_nameTrack
    Dream theaterSix Degrees Of Inner TurbulenceMisunderstood
    Dream theaterMetropolis, pt. 2: Scenes From A MemoryOverture 1928
    Dream theaterMaster of puppetsBattery
    MetallicaMaster of puppetsBattery
    EnsiferumTale of revengeBattery


    ArtistTrackLyrics
    Dream theaterMisunderstoodPetrucci
    Dream theaterOverture 1928(instrumental)
    MetallicaBatteryNotfield


    In a real database for constructing queries, you also need to introduce semantic relationships between tables, for example, to associate them using a foreign key, but for our example it is enough to understand that these tables are connected by meaning.

    Both tables have 2NF, the first - because she has all the columns included in the key, and the second - because Lyrics is determined by the {Artist, Track} key and is not uniquely determined by any of the Artist or Track columns .

    I’m probably not going to talk about the warehouse, I’m tired of typing tablets in HTML :)

    That's all.
    I hope that right now it was clear, I went to deal with 3NF!

    Also popular now: