Unique key in a distributed database environment

    If you share data across multiple physical databases,
    supporting globally unique identifiers is not a trivial task.
    I tried to put together the possible options and consider their pros and cons.


    What requirements (besides uniqueness) can be presented to keys?



    The key should increase monotonously.
    Why is this needed:
    • Automatic natural sorting (including chronological)
    • Faster insertion than with random keys

    The above is relevant only for indexes using trees.

    The key must be formed on the client side.
    Why is this needed:
    • The application can immediately turn to the desired shard (cluster node), since the unique key is known even before the object is saved
    • An application can save several related objects at once (in one transaction or batch job)


    Key size limitation (for example, only 64 bits)
    Why do I need this:
    • In some cases, 128- or 160-bit identifiers can be a problem. For example, Sphinx search only supports integer document identifiers (64 bits in the 64-bit version).
    • An index with 32- or 64-bit keys should theoretically work faster than with keys of arbitrary length.
    • An index with 32- or 64-bit keys is more compact and, as a result, is entirely placed in memory (relevant for large amounts of data).


    What are the ways to generate unique keys?



    Generating IDs on the application side (UUID, etc.)

    With UUID, everything is relatively simple - we take a library and generate keys on the application side. UUID is widely used in many systems. The value is formed in such a way that collisions are impossible.

    Cons:
        • The length of the standard UUID is 128 bits (which, in fact, you often want to store directly as a string of hexadecimal digits, which is 32 bytes)
        • As a rule, the UUID does not allow natural sorting of keys

    A separate service for generating keys

    This option is used e.g. Flick and Twitter .

    Cons:
        • complication of the system due to the introduction of additional components
        • potentially a single point of failure, or additional efforts are required to ensure high availability of the key generator

    Auto-increment on the database side by value

    ranges The entire range of values ​​is divided into sub-ranges and each shard (cluster node) is responsible for its range, for example:
        1 - 0 .. 3FFFFFFF,
        2 - 40,000,000 .. 7FFFFFFF,
        3 - 80000000 .. BFFFFFFF,
        4 - C0000000 .. FFFFFFFF
    

    Cons:
        • The database must support auto-increment (not applicable for many NoSQL-storages).
    Natural sorting is impossible.
        • The client “does not know” the key before inserting the object. To find out the key you need to make a separate request.
        • It is almost impossible to increase the number of cluster nodes.

    Autoincrement from Instagram.

    This decision was described in the technical blog of the Instgram team (photo sharing application).

    They use a 64-bit key, which is generated on the side of the database (PostgreSQL) and consists of bit fields

    63                          31                          0
    |======|======|======|======|======|======|======|======|
    |        CUSTOM TIMESTAMP          | SHARD ID  |  AUTO  |
    

    where
    - CUSTOM TIMESTAMP(41 bits) - time in milliseconds from 2011-01-01 00:00:00
    - SHARD ID(13 bits) - logical partition identifier (the number of shards is greater than the number of physical nodes)
    - AUTO(10 bits) - sequence (sequence) , unique within a logical partition (shard)

    Pros (compared to auto-incrementing by ranges):
        • Automatic chronological sorting

    And what options do you know and use?

    Also popular now: