Published on April 13, 2015


Everyone who works with databases knows what AUTO_INCREMENT is. A lot has been written about him, including on the Habr . In this article I want to express my thoughts on this topic, because before I did not meet with arguments in exactly this way. But first, let's decide why we need a database at all.

Indeed, we will store everything in RAM. Although, no, it will not be enough. And with an unexpected shutdown, all data will be lost. We put ups. No, better 2. And even better imagine that our memory is endless and does not lose data when turned off.

Now you can write something like this:

Order order = new Order();
User user = new User();
order.creator = user;

And to create at least 10 such orders, at least 20, at least 100 ... Hmm, I created a lot of things, I would need some kind of collection. And to her search in the fields. And a query language so that everything is universal. Plus indexes on these fields to speed up the search.

What did we do? It turned out to be an almost ordinary database. With one exception, objects do not have a primary key. All objects are identified by address in the address space.

Now we can recall that RAM has its limitations, programs close, computers shut down, several programs may need data at once. So, you need to make some abstraction over the RAM so that the addresses of all are always the same. Something like virtual addressing in protected processor mode. And store it in files on disk. And make a management system.

This leads to the idea of ​​why integer keys are easy to use. The reasons are not only in the implementation of database management systems. A database is an address space for placing objects. An integer key (ID) is a reference to an object.

It follows that auto_increment within the same table is not entirely correct. Each entry in the database must have a unique address. It turns out, as it were, a two-dimensional address space - addresses grow in one direction, the objects themselves in another; the size of one object does not affect the addresses of neighboring objects. It doesn’t matter whether the distributed database or not, how many servers, databases and tables are in it. This is one address space, and addressing must be unambiguous.

It turns out that theoretically a record can be found by key without specifying a table; or vice versa, you can find the table itself, in which there is a record with such a key. You can even divide the address space into ranges: for example, up to 1,000,000 system tables, settings, directories; after 1,000,000 real data. When overflowing, you can add a predefined sufficiently large constant to the range, or pre-configure the range map.

However, the ID does not need to be attributed to the attributes of the object itself as an element of the data model. For example:

int x = 2;

The variable x has a value of 2 and the address is 0x123456. But you cannot say that an address is an attribute of integer values. It also cannot be said that ID is an attribute of objects of type User and Order. It serves merely as a connection between the abstract model and technical implementation.

All restrictions of the natural key must be done with additional technical means, at the level of the database itself or the application. In fact, I can’t even imagine a situation in which the natural primary key is “natural”. All options such as passport number, phone number, TIN are artificially entered numbering of objects, precisely because it is impossible to distinguish a unique set of features from them or their owners. And it can be used only for identifying precisely these objects - issued passports, telephone network nodes, an economic entity paying taxes. In reality, there is always the possibility of duplicates with the same properties, and a set of different properties at different times in the same entity.

PS: My reasoning is theoretical, I did not work with databases with a similar structure. Technically, this is not difficult to do - one sequence for all tables. If anyone has such an experience, please share the information in the comments - the pros, cons, pitfalls.