maximw August 14, 2014 at 18:58

MySQL Sharding ID Generation

The subject of sharding is quite extensive both from the point of view of the programmer and from the point of view of the database administrator. I now want to touch only on the issues of generating a unique entity ID and shard selection algorithms.

Shard Algorithms

First, let's talk about algorithms, so that later, when describing how to generate IDs, you can refer here.

The fundamental condition for shard selection algorithms is re-computability, i.e. on the same source data, the algorithm should return the same result. Two algorithms are most common: range partitioning and the remainder of modulo division.

Range Break

The algorithm consists in breaking the set of IDs into continuous subsets - ranges. Each of the ranges corresponds to one shard. For example, from 1 to 1000 - the first shard, from 1001 - 3000 to the second shard, etc. Having an ID, it’s not difficult enough to understand which range it belongs to and which shard to use.
The ID does not have to be an integer, but the set of IDs must be ordered, that is, if you take two IDs, you can say which one is larger, which is smaller or that they are equal. But almost always this condition is met.
Pros of this algorithm:

The gradual increase in the number of shards. That is, you do not need to have many shards at once, they can be connected as needed, which is convenient for beginner projects.
Flexibility in load balancing. If one shard is more productive, it can be given a larger range.
Solving problems with redistribution of load. For example, if a user stored on one of the shards becomes too active, so that creates a large load, then you can allocate a new range including this user ID on a separate shard. Those. actually divide one range into 2-3 new ranges, thus redistributing the load.

Minuses:

It is required to maintain the relevance of the ranges, monitor the allocation of new and connecting shards.

Modulo remainder

Of course, a prerequisite - with the ID should be possible division operation modulo. Initially, the required number of shards is determined, maybe with a margin. The shard number is calculated from the remainder of dividing the ID by the number of shards.
Even if the ID is a string, and the shard is selected by its substring, this is actually an analogue of modulo division. For example, ID is the hexadecimal record of the md5 hash, and the first 2 hexadecimal digits are used to select the shard. Just in this case, we can assume that other numbers and / or the order of precedence of the ranks in the number are used.
It is also worth noting that to use this variation of the shard selection algorithm for a substring, one must be sure of a uniform distribution of substring characters, otherwise the load on the shards will not be uniform. For example, it is safe to say that the md5 hexadecimal notation satisfies this condition, but the UUID does not guarantee uniformity in substrings.
Pros:

Even distribution of new users among shards. The load is growing gradually.
Simple and quick calculation of the shard number is one arithmetic operation.

Minuses:

You must immediately reserve the right amount of shards.
Adding a new shard is not a trivial task, it is necessary to redistribute data between shards.

Ways to generate ID

Uuid

The simplest solution "head on" is to generate a unique ID, with a probability of collision so small that it is more likely that MySQL auto-increment will glitch and return the value already used. This is what the UUID standard does .
MySQL itself can generate UUID () . There is a library for PHP .
The disadvantage of this approach is the excess field length, especially considering that it can be used for a foreign key. Although MySQL allows you to slightly reduce from 36 to 16 bytes, making it completely unreadable:

SELECT UNHEX(REPLACE(UUID(), '-', '')); #Надо аккуратно выбирать charset для такого поля и аккуратно сравнивать

Choosing a shard may be a variation of the "remainder of modulo division" algorithm. Since the UUID does not have a uniform distribution, it is worth somehow transforming it, for example, using hashing.

Auto increment master table

The master database is stored in the main database with an auto-increment field and, possibly, some additional most used entity fields. Accordingly, the auto-increment of this table takes care of generating the ID.
Any algorithm for choosing a shard will do, but most often it’s the same “range breakdown” or “modulo remainder”.
This is a very commonly used method, due to the simplicity and flexibility of shard distribution. The disadvantage of this approach is that the table is split not only horizontally, but also vertically.

auto_increment_increment and auto_increment_offset

These MySQL configuration options let you specify the increment increment of the auto increment and the initial offset.
This can be used so that each shard generates its unique ID in pass-through numbering. It is enough to set auto_increment_increment equal to the number of shards, and write the number of the shard in auto_increment_offset.
The algorithm for choosing a shard to add for does not depend on the ID, and therefore it can be any, for example, random or as the spirits of balancing prophesy. The inverse transformation can be done by dividing the ID modulo auto_increment_increment.
The disadvantage of this approach is that there is no flexibility with the distribution algorithm and with an increase in the number of shards, unless auto -increment_increment is pre-stocked. But in contrast to the previous method, if one of the shards “fell”, it simply won’t give out IDs from its sequence, there is no dependence on the main database, which increases fault tolerance.

Imitation Sequence

This method allows you to get rid of the vertical partition of the table, as in the method with the master table. MySQL has no mechanism for generating sequences, with the exception of auto-increment, which is not quite what you need.
But Sequence can be simulated like this:

CREATE TABLE IF NOT EXISTS `sequence` (
  `id` int(11) NOT NULL DEFAULT '0'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `sequence` SET `id` = 0;

Next, you can get the following ID from the desired sequence:

UPDATE `sequence` SET `id` = last_insert_id(`id` + 1);
SELECT last_insert_id();

In the sequence table, you can keep several fields for several sequences. And the unit in the expression last_insert_id (`id` + 1) can be replaced with another number by implementing auto_increment_increment.
Shard selection algorithms are similar to those used in the master table method.

Dynamic transition between shards

It is assumed that the distribution of shards occurs over ranges of IDs. The main idea is to request a new ID from the shard and, if it is outside the range, create a table on the next shard with a predefined value for auto-increment at the initial border of the new range. To do this, somewhere in the shared memory (for example, memcache) the current shard used for writing should be stored. There are no special problems with the need for transactions and locks:

insert the record
we get last_insert_id (),
if outside the range, we switch to a new shard, which is written to the shared memory as current,
create a table on the new shard (and IF NOT EXISTS helps)
add the entry again.

As a side effect, extra entries in the previous shard are possible, which can then be erased.

Tags: