
Auto-incremental primary keys (surrogate keys) = evil?
In this article, I will take a look (mostly negative) by Josh Berkus, CEO of PostgreSQL Experts Inc. to use surrogate keys for database tables, the same INT NOT NULL AUTO_INCREMENT PRIMARY KEY we are used to. In fact, it will be a free, greatly reduced translation of his article on ittoolbox .
The article will be followed by a review of my own mistakes on this topic made in one old project. I was young and stupid, but that does not excuse me.
Honestly, after reading this article and not noticing who the author is, I thought that he was still exaggerating, and in general, I would somehow figure out where and what keys to use without me. Then I thought a little more and got into a dump of the base structure of my old project. It was interesting.
If you are an experienced DBA, you should probably pass by so as not to be upset.
But first things first. First, a VERY abbreviated translation:
“Surrogate numeric keys fell into the SQL89 standard for saving with older applications that needed line numbers. Subsequently, in a conversation with Joe Selco , Codd said that he regrets that he allowed this.
Inexperienced developers, not realizing that using surrogate keys is a pragmatic compromise with performance considerations, use them everywhere. Even authors of books on databases are advised to create them in all tables in any case.
In the theory of relational databases, there is no concept of primary keys. All database keys are of equal importance. The concept of a primary key is based on the notion that one and only one key determines the order of tuples on a disk, and relational theory tells us that we should just ignore this in the logical model of our data. So primary keys in general are a violation of relational theory.
I am not saying that surrogate keys cannot be used at all, I am saying that their use cannot be abused.
What reasons might motivate us to use surrogate keys?
Multi-Column Key Compromise. Usually pretty convincing. The syntax of SQL queries using multi-column keys and the join mechanism currently leaves much to be desired, as is the performance of queries of this kind. Once these problems are resolved, this reason will disappear.
Data does not have a real key . A very bad reason. Its appearance illustrates both the poor design of the database as a whole, and the fact that the developer does not really understand the data with which he works.
External requirements . Usually convincing. Typically, development environments and database tools support only surrogate keys. And if you think that this tool is indispensable in the problem you are solving, well ...
Data consistency. Usually convincing. But only if you really scrupulously follow the plan and your entire design is carefully planned.
Adhering to the SQL standard and principles of good design . A very bad reason. It is completely based on ignorance. Usually, they follow her because somewhere they heard someone reading on the blog of someone who is studying at the UNIVERSITY that the use of surrogate keys is the standard in the industry. Keep in mind that neither modern SQL standards nor the relational theory itself contain even a mention of surrogate keys.
Easy changeability. Dont clear. Indeed, some DBMSs do not know how to perform ON UPDATE CASCADE or do it too inefficiently (by the way, think of it as a reason for changing the DBMS). And in this case, this reason can be significant. However, sometimes developers say that the [primary] keys for recording should not be changed and should remain the same throughout the entire recording life cycle. Keep in mind that this statement is not worth a damn and, of course, is completely absent in relational theory.
Performance. Usually a bad reason. Yes, indeed, situations may arise in which the use of natural keys greatly slows down the system compared to surrogate ones. But in 80% of cases, this statement is not based on real tests, and such a statement remains unfounded. Pre-optimization is the root of many ills in database design.
For mega-volume databases, the resulting table size may also be relevant. But for this the base must be very large.
The performance of connections or sorting also matters on a large amount of data, depending on the type of primary key and the number of its components. However, my experience shows that when they call this reason, it is very rarely behind real calculations or performance measurements. For instance,www.bricolage.cc has been using 14-byte numeric primary keys for its tables for many years. However, in this case, after the appearance of a user with a three millionth record in history, when the question arose about changing primary keys for the sake of performance, this problem was solved by rewriting the queries. An approximately 10-fold increase in productivity was achieved.
Please note that the problem is not the use of surrogate keys, but the abuse of them. "
End of my VERY ABBREVIATED translation. The original here (called Primary Keyvil): it.toolbox.com/home/search.aspx?r=%22Primary+kevill%22&community=1&contentType=5
If I missed something important in the translation, please tell me about it. I will add.
Yet the article seemed a little dramatic to the problem. It seems to me that surrogate keys are chosen more often precisely because of avoiding performance problems later and recently everyone got used to them so much that they are planted at the level of the DBMS themselves. For example, InnoDB, if you do not create the primary key, just create it yourself. By the way, in the case of InnoDB, the choice of the primary key has serious consequences from the point of view of performance, since clustering is performed on it (accordingly, the choice of a natural key can both improve and worsen the situation).
Despite the fact that the article sounds as if surrogate keys are embodied evil, the author several times emphasizes that the problem is not their use, but their abuse.
This article opened my eyes in the sense that I always thought it natural to not look for special candidates for primary keys, but simply create an INT NOT NULL AUTO_INCREMENT PRIMARY KEY field and sit still. Of course, I knew that you could choose any unique key as the primary key, but I never focused on that. I never really thought about what really makes this database row unique and why it is important. As it turned out, in vain.
As an example, I want to give you my small old project. There are only a few tables. At first, I wanted to choose something bigger, but I think that is superfluous. Only in vain will I take your time. Let everyone open their own old project and look at it from the point of view of the described position. I actually added one mistake right now for the sake of justice. I would have done it anyway. I was saved only by accident.
The project is a closed torrent tracker. I ask you not to pay attention now to problems with normalization and all sorts of others. If I wrote it now, maybe something I would do differently. Let's focus on surrogate keys.

pastebin.com/LstH8Xfx
The first table I would like to talk about is the log table. In general, it was this case that dumbfounded me a bit or something, because I suddenly saw a mistake. Very small, not worth much attention, but, nevertheless, this is a mistake that I have not noticed for many, many years. I didn’t notice at all. Take a break now from the text and return to the structure of this table. See? I did not see.
This table stores simple information. IP, user ID, date of occurrence of the event and its text. Yes, of course, the text could be replaced with code and much more could be done, but this is not about that. After reading the article, I looked at this table and thought that, so I created a surrogate key. But what is the real data key? What makes a particular table row unique?
The answer is very simple. A combination of the user ID and the time the event occurred. And here I suddenly saw the situation from the other side. In almost all of my old projects, the DATETIME field is used to store time in logs. Just because it's convenient. Yes, I knew that it was stored accurate to the second and it completely suited me. Now, when I began to look for natural keys, it suddenly occurred to me what consequences this bears. The torrent tracker in question is very heavily loaded and a lot can happen within a second. In fact, if in the log with this damn surrogate key there are several events with the same time and they happened one after another very quickly, I can tell which one happened first and which last only being guided by the auto-increment of the surrogate key. The date information field itself, which was created precisely to report such things, will not help me. And I won’t be able to find out exactly the interval between events at all.
In general, this, of course, does not matter. The probability that I will need to find out the interval between two events, which in any case is less than a second, is very small. But I always consider all my projects, both old and new, as educational. The design could be a little different, and it could become important.
I want to say that considering the problem from the point of view of finding the natural key is a somewhat different view. Try to look at the design of your project this way and see what comes out.
It seems that my explanation turned out to be chaotic. I hope, nevertheless, I managed to convey to you my thought.
Now the table is peer. She already has a unique key that just asks for the primary role. Many hundreds of inserts / deletions per second are made into the peer table, and keeping an extra index in the form of a primary key is simply unprofitable. So I eliminated it.
Session table. For some reason, I did not rely on PHP sessions completely, but partially implemented mine. The primary key of this table is a random value. Not only is it stupid to use 40-character random sequences, but it is not really needed here at all. What acts as a natural key for entries in this table? In this project, the user was not allowed to be logged in from multiple computers at the same time. Um. user_id? Everything else with respect to this value is secondary. I will not analyze now what follows from this simple statement. A lot of things up to deleting the session table and implementing another mechanism. There are many options.
Let's move on to the torrent table.
A small digression so that you imagine the subject of discussion. The torrent tracker that I developed was at the same time the first seed for the files that were distributed. The torrent table kept information about the files that were sitting. These files were in the server file system, corresponding .torrent files were created for them according to the scheme one file = one torrent, which were downloaded by users. Each torrent has a so-called info_hash, which uniquely identifies it .
This field in the peer table is called peer_info_hash. And in the torrent table, this is the torrent_info_hash field. torrent_id is superfluous there. Absolutely. Please note that peer torrent_id is also in the table. It is not clear why.
Well, the user table. It would seem that here I just could not make mistakes. Was wrong.
In the authorization system, torrent trackers, a GET parameter with a value unique to the user is used. In the table, this value is user_torrent_uid. So ask me, who prevented using this value as a natural key in one or another version? Yes, it can change. In a very rare case. So what? If 8 bytes is too long, you could take a regular random INT and convert it to text, as smart people do on Flickr. It was possible ... Yes, a lot of things were possible.
Like this. Everything is obvious, right? :)
The article will be followed by a review of my own mistakes on this topic made in one old project. I was young and stupid, but that does not excuse me.
Honestly, after reading this article and not noticing who the author is, I thought that he was still exaggerating, and in general, I would somehow figure out where and what keys to use without me. Then I thought a little more and got into a dump of the base structure of my old project. It was interesting.
If you are an experienced DBA, you should probably pass by so as not to be upset.
But first things first. First, a VERY abbreviated translation:
“Surrogate numeric keys fell into the SQL89 standard for saving with older applications that needed line numbers. Subsequently, in a conversation with Joe Selco , Codd said that he regrets that he allowed this.
Inexperienced developers, not realizing that using surrogate keys is a pragmatic compromise with performance considerations, use them everywhere. Even authors of books on databases are advised to create them in all tables in any case.
In the theory of relational databases, there is no concept of primary keys. All database keys are of equal importance. The concept of a primary key is based on the notion that one and only one key determines the order of tuples on a disk, and relational theory tells us that we should just ignore this in the logical model of our data. So primary keys in general are a violation of relational theory.
I am not saying that surrogate keys cannot be used at all, I am saying that their use cannot be abused.
What reasons might motivate us to use surrogate keys?
Multi-Column Key Compromise. Usually pretty convincing. The syntax of SQL queries using multi-column keys and the join mechanism currently leaves much to be desired, as is the performance of queries of this kind. Once these problems are resolved, this reason will disappear.
Data does not have a real key . A very bad reason. Its appearance illustrates both the poor design of the database as a whole, and the fact that the developer does not really understand the data with which he works.
External requirements . Usually convincing. Typically, development environments and database tools support only surrogate keys. And if you think that this tool is indispensable in the problem you are solving, well ...
Data consistency. Usually convincing. But only if you really scrupulously follow the plan and your entire design is carefully planned.
Adhering to the SQL standard and principles of good design . A very bad reason. It is completely based on ignorance. Usually, they follow her because somewhere they heard someone reading on the blog of someone who is studying at the UNIVERSITY that the use of surrogate keys is the standard in the industry. Keep in mind that neither modern SQL standards nor the relational theory itself contain even a mention of surrogate keys.
Easy changeability. Dont clear. Indeed, some DBMSs do not know how to perform ON UPDATE CASCADE or do it too inefficiently (by the way, think of it as a reason for changing the DBMS). And in this case, this reason can be significant. However, sometimes developers say that the [primary] keys for recording should not be changed and should remain the same throughout the entire recording life cycle. Keep in mind that this statement is not worth a damn and, of course, is completely absent in relational theory.
Performance. Usually a bad reason. Yes, indeed, situations may arise in which the use of natural keys greatly slows down the system compared to surrogate ones. But in 80% of cases, this statement is not based on real tests, and such a statement remains unfounded. Pre-optimization is the root of many ills in database design.
For mega-volume databases, the resulting table size may also be relevant. But for this the base must be very large.
The performance of connections or sorting also matters on a large amount of data, depending on the type of primary key and the number of its components. However, my experience shows that when they call this reason, it is very rarely behind real calculations or performance measurements. For instance,www.bricolage.cc has been using 14-byte numeric primary keys for its tables for many years. However, in this case, after the appearance of a user with a three millionth record in history, when the question arose about changing primary keys for the sake of performance, this problem was solved by rewriting the queries. An approximately 10-fold increase in productivity was achieved.
Please note that the problem is not the use of surrogate keys, but the abuse of them. "
End of my VERY ABBREVIATED translation. The original here (called Primary Keyvil): it.toolbox.com/home/search.aspx?r=%22Primary+kevill%22&community=1&contentType=5
If I missed something important in the translation, please tell me about it. I will add.
Now a little about what I myself think.
Yet the article seemed a little dramatic to the problem. It seems to me that surrogate keys are chosen more often precisely because of avoiding performance problems later and recently everyone got used to them so much that they are planted at the level of the DBMS themselves. For example, InnoDB, if you do not create the primary key, just create it yourself. By the way, in the case of InnoDB, the choice of the primary key has serious consequences from the point of view of performance, since clustering is performed on it (accordingly, the choice of a natural key can both improve and worsen the situation).
Despite the fact that the article sounds as if surrogate keys are embodied evil, the author several times emphasizes that the problem is not their use, but their abuse.
This article opened my eyes in the sense that I always thought it natural to not look for special candidates for primary keys, but simply create an INT NOT NULL AUTO_INCREMENT PRIMARY KEY field and sit still. Of course, I knew that you could choose any unique key as the primary key, but I never focused on that. I never really thought about what really makes this database row unique and why it is important. As it turned out, in vain.
As an example, I want to give you my small old project. There are only a few tables. At first, I wanted to choose something bigger, but I think that is superfluous. Only in vain will I take your time. Let everyone open their own old project and look at it from the point of view of the described position. I actually added one mistake right now for the sake of justice. I would have done it anyway. I was saved only by accident.
The project is a closed torrent tracker. I ask you not to pay attention now to problems with normalization and all sorts of others. If I wrote it now, maybe something I would do differently. Let's focus on surrogate keys.
Database structure

pastebin.com/LstH8Xfx
The first table I would like to talk about is the log table. In general, it was this case that dumbfounded me a bit or something, because I suddenly saw a mistake. Very small, not worth much attention, but, nevertheless, this is a mistake that I have not noticed for many, many years. I didn’t notice at all. Take a break now from the text and return to the structure of this table. See? I did not see.
This table stores simple information. IP, user ID, date of occurrence of the event and its text. Yes, of course, the text could be replaced with code and much more could be done, but this is not about that. After reading the article, I looked at this table and thought that, so I created a surrogate key. But what is the real data key? What makes a particular table row unique?
The answer is very simple. A combination of the user ID and the time the event occurred. And here I suddenly saw the situation from the other side. In almost all of my old projects, the DATETIME field is used to store time in logs. Just because it's convenient. Yes, I knew that it was stored accurate to the second and it completely suited me. Now, when I began to look for natural keys, it suddenly occurred to me what consequences this bears. The torrent tracker in question is very heavily loaded and a lot can happen within a second. In fact, if in the log with this damn surrogate key there are several events with the same time and they happened one after another very quickly, I can tell which one happened first and which last only being guided by the auto-increment of the surrogate key. The date information field itself, which was created precisely to report such things, will not help me. And I won’t be able to find out exactly the interval between events at all.
In general, this, of course, does not matter. The probability that I will need to find out the interval between two events, which in any case is less than a second, is very small. But I always consider all my projects, both old and new, as educational. The design could be a little different, and it could become important.
I want to say that considering the problem from the point of view of finding the natural key is a somewhat different view. Try to look at the design of your project this way and see what comes out.
It seems that my explanation turned out to be chaotic. I hope, nevertheless, I managed to convey to you my thought.
Now the table is peer. She already has a unique key that just asks for the primary role. Many hundreds of inserts / deletions per second are made into the peer table, and keeping an extra index in the form of a primary key is simply unprofitable. So I eliminated it.
Session table. For some reason, I did not rely on PHP sessions completely, but partially implemented mine. The primary key of this table is a random value. Not only is it stupid to use 40-character random sequences, but it is not really needed here at all. What acts as a natural key for entries in this table? In this project, the user was not allowed to be logged in from multiple computers at the same time. Um. user_id? Everything else with respect to this value is secondary. I will not analyze now what follows from this simple statement. A lot of things up to deleting the session table and implementing another mechanism. There are many options.
Let's move on to the torrent table.
A small digression so that you imagine the subject of discussion. The torrent tracker that I developed was at the same time the first seed for the files that were distributed. The torrent table kept information about the files that were sitting. These files were in the server file system, corresponding .torrent files were created for them according to the scheme one file = one torrent, which were downloaded by users. Each torrent has a so-called info_hash, which uniquely identifies it .
This field in the peer table is called peer_info_hash. And in the torrent table, this is the torrent_info_hash field. torrent_id is superfluous there. Absolutely. Please note that peer torrent_id is also in the table. It is not clear why.
Well, the user table. It would seem that here I just could not make mistakes. Was wrong.
In the authorization system, torrent trackers, a GET parameter with a value unique to the user is used. In the table, this value is user_torrent_uid. So ask me, who prevented using this value as a natural key in one or another version? Yes, it can change. In a very rare case. So what? If 8 bytes is too long, you could take a regular random INT and convert it to text, as smart people do on Flickr. It was possible ... Yes, a lot of things were possible.
Like this. Everything is obvious, right? :)