All You Need to Know About Partitioning (Part 1)

    Part 2

    Good evening / day / morning dear habralyudi! We continue to develop and supplement the blog about my favorite open source rdbms Postgresql. Miraculously, it so happened that the topic of today's topic has never been raised here. I must say that the partitioning in postgresql is very well described in the documentation , but will this really stop me?).

    Introduction


    In general, partitioning is generally understood not as a technology, but rather as an approach to database design, which appeared long before the DBMS began to support the so-called partitioned tables. The idea is very simple - divide the table into several smaller parts. There are two subspecies - horizontal and vertical sectioning.

    Horizontal partitioning

    Parts of the table contain its different rows. Suppose we have a log table of some abstract application - LOGS. We can break it into parts - one for the logs for January 2009, the other for February 2009, etc.

    Vertical partitioning

    Parts of the table contain its different columns. Finding an application for vertical sectioning (when it is really justified) is somewhat more difficult than for horizontal. As a spherical horse, I propose to consider this option: the NEWS table has the columns ID, SHORTTEXT, LONGTEXT, and let the LONGTEXT field be used much less often than the first two. In this case, it makes sense to split the NEWS table into columns (create two tables for SHORTTEXT and LONGTEXT, respectively, connected by primary keys + create a view NEWS containing both columns). Thus, when we only need a description of the news, the DBMS does not have to read the entire text of the news from the disk.

    Partitioning support in modern DBMS

    Most modern DBMSs support table partitioning in one form or another.
    • Oracle - supports partitioning since version 8. Working with sections on the one hand is very simple (in general you can not think about them, you work as with a regular table *), and on the other, everything is very flexible. Sections can be divided into "subpartitions", deleted, divided, transferred. Different indexing options for a partitioned table are supported (global index, partitioned index). A reference to the voluminous description.
    • Microsoft SQL Server - Partitioning support appeared recently (in 2005). The first impression of use is "Well, finally !! :)", the second - "It works, everything seems to be ok." Msdn documentation
    • MySQL - supported since version 5.1. Very good description on a habr
    • Etc…

    * -I'm lying, of course, there is a standard set of difficulties - to create a new section in time, throw out the old one, etc., but still somehow it's simple and clear.

    Partitioning in Postgresql


    Partitioning tables in postgresql is slightly different in implementation from the rest of the database. The basis for partitioning is table inheritance (a thing unique to postgresql). That is, we should have the main table (master table), and its sections will be the successor tables. We will consider partitioning as an example of a task close to reality.

    Formulation of the problem

    The database is used to collect and analyze data about site visitors / sites. The data volumes are large enough to think about partitioning. The analysis in most cases uses data from the last day.
    1. Create the main table:
    CREATE TABLE analytics.events
    (
      event_id BIGINT DEFAULT nextval('analytics.seq_events') PRIMARY KEY,
      user_id UUID NOT NULL,
      event_type_id SMALLINT NOT NULL,
      event_time TIMESTAMP DEFAULT now() NOT NULL,
      url VARCHAR(1024) NOT NULL,
      referrer VARCHAR(1024),
      ip INET NOT NULL
    );

    * This source code was highlighted with Source Code Highlighter.

    2. We will partition daily by the field event_time. For every day we will create a new section. We will name the sections by the rule: analytics.events_DDMMYYYY. Here's an example section for January 1st, 2010.
    CREATE TABLE analytics.events_01012010
    (
      event_id BIGINT DEFAULT nextval('analytics.seq_events') PRIMARY KEY,
      CHECK ( event_time >= TIMESTAMP '2010-01-01 00:00:00' AND event_time < TIMESTAMP '2010-01-02 00:00:00' )
    ) INHERITS (analytics.events);

    * This source code was highlighted with Source Code Highlighter.

    When creating a section, we explicitly set the event_id field (PRIMARY KEY is not inherited) and create a CHECK CONSTRAINT on the event_time field so as not to insert too much.

    3. Create an index on the event_time field. When dividing the table into sections, we mean that most queries to the events table will use the condition on the event_time field, so the index on this field will help us a lot.
    CREATE INDEX events_01012010_event_time_idx ON analytics.events_01012010 USING btree(event_time);

    * This source code was highlighted with Source Code Highlighter.

    4. We want to ensure that when inserted into the main table, the data appears in the section intended for them. To do this, we make the next feint - we create a trigger that will control the data flows.
    CREATE OR REPLACE FUNCTION analytics.events_insert_trigger()
    RETURNS TRIGGER AS $$
    BEGIN
      IF (NEW.event_time >= TIMESTAMP '2010-01-01 00:00:00' AND
        NEW.event_time < TIMESTAMP '2010-01-02 00:00:00') THEN
        INSERT INTO analytics.events_01012010 VALUES (NEW.*);
      ELSE
        RAISE EXCEPTION 'Date % is out of range. Fix analytics.events_insert_trigger', NEW.event_time;
      END IF;
      RETURN NULL;
    END;
    $$
    LANGUAGE plpgsql;

    * This source code was highlighted with Source Code Highlighter.

    CREATE TRIGGER events_before_insert
      BEFORE INSERT ON analytics.events
      FOR EACH ROW EXECUTE PROCEDURE analytics.events_insert_trigger();

    * This source code was highlighted with Source Code Highlighter.


    5. Everything is ready, we now have a partitioned analytics.events table. We can begin to violently analyze its data. By the way, we created CHECK constraints not only to protect sections from incorrect data. Postgresql can use them when drawing up a query plan (though with a live index on event_time, this will give a minimum gain), it is enough to use the constraint_exclusion directive:
    SET constraint_exclusion = on;
    SELECT * FROM analytics.events WHERE event_time > CURRENT_DATE;

    * This source code was highlighted with Source Code Highlighter.


    End of the first part

    So what do we have? Let's point by point:
    1. The events table, divided into sections, analysis of the available data for the last day becomes easier and faster.
    2. The horror of the realization that all this needs to be supported somehow, to create sections on time, not forgetting to change the trigger accordingly.

    I will talk about how to simply and carefreely work with partitioned tables in the second part.

    UPD1: Replaced partitioning for partitioning
    UPD2:
    Based on the remark of one of the readers, who, unfortunately, does not have an account on the hub:
    With inheritance, there are several points that should be considered when designing. Sections do not inherit the primary key and foreign keys on their columns. That is, when creating a section, you need to explicitly create PRIMARY KEY and FOREIGN KEYs on the columns of the section. On my own, I note that creating a FOREIGN KEY on the columns of a partitioned table is not the best way. In most cases, a partitioned table is a "fact table" and itself refers to the "dimension" of the table.

    Also popular now: