Understanding the partitions in PostgreSQL 9

From the sandbox

PostgreSQL 10 was released in early October 2017, almost a year ago.

One of the most interesting new features is unconditionally declarative partitioning. But what if you are not in a hurry to upgrade to 10k? Here, Amazon, for example, is not in a hurry, and introduced PostgreSQL 10 support only in the last days of February 2018.

Then the good old partitioning through inheritance comes to the rescue. I am a software architect of a finance department in a taxi company, so all the examples will somehow be related to travel (we will leave the problems with money for another time).

Since we began to rewrite our financial system in 2015, when I just joined the company, there was no talk of any declarative partitioning. So to this day, the technique described below is successfully used.

The original reason for writing this article was that most of the PostgreSQL partitioning examples I encountered were very basic. Here is a table, here is one column that we are looking at, and perhaps even know in advance which values lie in it. It would seem that everything is simple. But real life makes its own adjustments.

In our case, we partition the tables in two columns, one of which contains the travel dates. This is the case we consider.

Let's start with how our table looks like:

createtable rides (
 id bigserial notnull primary key,
 tenant_id varchar(20) notnull,
 ride_id varchar(36) notnull,
 created_at timestampwithtime zone notnull,
 metadata jsonb
 -- Probably more columns and indexes coming here
);

For each tenant, the table contains millions of rows per month. Fortunately, the data between the tenants never intersect, and the most difficult requests are made at a cut of one or two months.

For those who did not delve into how the partitions work in PostgreSQL (the lucky ones from Oracle, hello!) I will briefly describe the process.

PostgreSQL relies on its two “features” for this: the ability to inherit tables, table inheritance, and checked conditions.

Let's start with inheritance. Using the INHERITS keyword, we indicate that the table we create inherits all fields of the inherited table. It also creates a relationship between two tables: by making a request from parent, we also get all the data from the child.

Checked conditions complete the picture by ensuring that data is not intersected. Thus, the PostgreSQL optimizer can cut off part of the child tables, relying on data from the query.

The first pitfall of this approach would seem quite obvious: any request must contain a tenant_id. And nevertheless, if you do not remind yourself of this constantly, sooner or later you will write custom SQL itself, in which you will forget to specify this tenant_id. As a result - scanning all partitions and non-functioning database.

But back to what we want to achieve. At the application level, I would like transparency - we always write in the same table, and already the database chooses where to put this data.

To do this, we use the following stored procedure:

CREATEORREPLACEFUNCTION insert_row()
 RETURNSTRIGGERAS
$BODY$
DECLARE
 partition_env TEXT;
 partition_date TIMESTAMP;
 partition_name TEXT;
 sql TEXT;
BEGIN-- construct partition name
 partition_env := lower(NEW.tenant_id);
 partition_date := date_trunc('month', NEW.created_at AT TIME ZONE 'UTC');
 partition_name := format('%s_%s_%s', TG_TABLE_NAME, partition_env, to_char(partition_date, 'YYYY_MM'));
 -- create partition, if necessary
 IF NOT EXISTS(SELECT relname FROM pg_class WHERE relname = partition_name) THEN
   PERFORM create_new_partition(TG_TABLE_NAME, NEW.tenant_id, partition_date, partition_name);
 ENDIF;
 selectformat('INSERT INTO %s values ($1.*)', partition_name) intosql;
 -- Propagate insertEXECUTEsqlUSINGNEW;
 RETURN NEW; -- RETURN NULL; if no ORMEND;
$BODY$
LANGUAGE plpgsql;

The first thing to notice is the use of TG_TABLE_NAME. Since we are talking about the trigger, PostgreSQL fills in for us quite a few variables that we can access. Full list can be found here .

In our case, we want to get the parent name of the table on which the trigger worked. In our case it will be rides. We use a similar approach in several microservices, and this part can be transferred almost unchanged.

PERFORMuseful if we want to call a function that returns nothing. Usually in the examples all the logic is tried to be put into one function, but we try to be careful.

USING NEWindicates that in this query we are using the values from the string we tried to add.

$1.*expands all the new line values. In fact, it can be translated into NEW.*. What translates into the NEW.ID, NEW.TENANT_ID, …

Next procedure that we call with the help PERFORMwill create a new partition if it does not already exist. This will happen once a period for each tenant.

CREATEORREPLACEFUNCTION create_new_partition(parent_table_name text,
                                           env text,
                                           partition_date timestamp,
                                           partition_name text) RETURNSVOIDAS
$BODY$
DECLAREsqltext;
BEGIN-- NotifyingRAISENOTICE'A new % partition will be created: %', parent_table_name, partition_name;
 selectformat('CREATE TABLE IF NOT EXISTS %s (CHECK (
         tenant_id = ''%s'' AND
         created_at AT TIME ZONE ''UTC'' > ''%s'' AND
         created_at AT TIME ZONE ''UTC'' <= ''%s''))
         INHERITS (%I)', partition_name, env, partition_date,
               partition_date + interval'1 month', parent_table_name) intosql;
 -- New table, inherited from a master oneEXECUTEsql;
 PERFORM index_partition(partition_name);
END;
$BODY$
LANGUAGE plpgsql;

As described earlier, we use INHERITSto create a table similar to parent, and CHECKin order to determine which data should go there.

RAISE NOTICEjust prints the string to the console. If we run INSERTfrom psql now, we can see if the partition has been created.

We have a new problem. INHERITSdoes not inherit indexes. To do this, we have two solutions:

Create indexes using inheritance:
Use CREATE TABLE LIKE, and then ALTER TABLE INHERITS

Or create indexes procedurally:

CREATEORREPLACEFUNCTION index_partition(partition_name text) RETURNSVOIDAS
$BODY$
BEGIN-- Ensure we have all the necessary indices in this partition;EXECUTE'CREATE INDEX IF NOT EXISTS ' || partition_name || '_tenant_timezone_idx ON ' || partition_name || ' (tenant_id, timezone(''UTC''::text, created_at))';
 -- More indexes here...END;
$BODY$
LANGUAGE plpgsql;

It is very important not to forget about child indexing of tables, since even after partitioning there will be millions of rows in each of them. Indexes on parent are not needed in our case, since the parent will always remain empty.

Finally, we create a trigger that will be called when creating a new line:

CREATETRIGGER before_insert_row_trigger
BEFOREINSERTON rides
FOREACHROWEXECUTEPROCEDURE insert_row();

There is another subtlety that is rarely focused on. It is best to partition by columns, the data in which never change. In our case, it works: the trip never changes tenant_id and created_at. The problem that arises if it is not so - PostreSQL will not return part of the data to us. We then promised him a CHECK that all data is valid.

There are several solutions (except for the obvious - do not mutate data by which we partition):

Instead of UPDATE'and at the application level, we always do DELETE+INSERT
Add another trigger to UPDATE, which will transfer data to the correct partition

Another thing to consider is how to properly index columns containing dates. If we use AT TIME ZONE requests, we need to remember that this is actually a function call. So, our index should be function based. I forgot. As a result, the base is again dead from the load.

The last aspect to consider is how partitions interact with various ORM frameworks, be it ActiveRecord in Ruby or GORM in Go.

PostgreSQL partitions rely on the fact that the parent table is always empty. If you do not use ORM, then you can safely return to the first stored procedure, and change RETURN NEW; on RETURN NULL ;. Then the row in the parent table simply does not add what we actually want.

But the fact is that most ORMs use the INSERT RETURNING clause. If we return NULL from our trigger, ORM will panic, considering that the line is not added. It is added, but not where the ORM is looking.

There are several ways to get around this:

Do not use ORM for INSERTs
Patch ORM (which is sometimes advised in case of ActiveRecord)
Add another trigger that will remove the string from parent.

The last option is undesirable, since for each operation we will perform three. But nevertheless, sometimes inevitable, because we consider it separately:

CREATEORREPLACEFUNCTION delete_parent_row()
 RETURNSTRIGGERAS
$BODY$
DECLAREBEGINdeletefromonly rides whereid = NEW.ID;
 RETURN null;
END;
$BODY$
LANGUAGE plpgsql;

CREATETRIGGER after_insert_row_trigger
AFTERINSERTON rides
FOREACHROWEXECUTEPROCEDURE delete_parent_row();

The last thing left for us to do is test our solution. To do this, we generate a certain number of lines:

DO
$script$
DECLARE
 year_start_epoch bigint := extract(epoch from'20170101'::timestamptz attime zone 'UTC');
 delta bigint := extract(epoch from '20171231 23:59:59'::timestamptz at time zone 'UTC') - year_start_epoch;
 tenant varchar;
 tenants varchar[] := array['tenant_a', 'tenant_b', 'tenant_c', 'tenant_d'];
BEGIN
 FOREACH tenant INARRAY tenants LOOPFOR i IN1..100000LOOPinsertinto rides (tenant_id, created_at, ride_id)
     values (tenant, to_timestamp(random() * delta + year_start_epoch) attime zone 'UTC', i);
   ENDLOOP;
 ENDLOOP;
END
$script$;

And let's see how the database behaves:

explainselect *
from rides
where tenant_id = 'tenant_a'and created_at ATTIME ZONE 'UTC' > '20171102'and created_at ATTIME ZONE 'UTC' <= '20171103';

If everything went as it should, we should see the following result:

 Append  (cost=0.00..4803.76 rows=4 width=196)
   ->  Seq Scan on rides  (cost=0.00..4795.46 rows=3 width=196)
         Filter: (((created_at)::timestamp without time zone > '2017-11-02 00:00:00'::timestamp without time zone) AND ((created_at)::timestamp without time zone <= '2017-11-03 00:00:00'::timestamp without time zone) AND ((tenant_id)::text = 'tenant_a'::text))
   ->  Index Scan using rides_tenant_a_2017_11_tenant_timezone_idx on rides_tenant_a_2017_11  (cost=0.28..8.30 rows=1 width=196)
         Index Cond: (((tenant_id)::text = 'tenant_a'::text) AND ((created_at)::timestamp without time zone > '2017-11-02 00:00:00'::timestamp without time zone) AND ((created_at)::timestamp without time zone <= '2017-11-03 00:00:00'::timestamp without time zone))
(5 rows)

Despite the fact that each tenant has a hundred thousand rows, we only select from the required data slice. Success!

I hope that this article was interesting for those who were not yet familiar with what partitioning is and how it is implemented in PostgreSQL. And those for whom this topic is no longer new, still learned a couple of interesting tricks.

UPD:
As bigtrot correctly noted , all this street magic will not work if the CONSTRAINT_EXCLUSION setting is turned off .

You can check this with the command

show CONSTRAINT_EXCLUSION

The setting has three values: on, off and partition

Setting up a partition is more optimal if you suddenly like to use CHECK CONSTRAINTS not only for partitions, but also for normalizing data.

Tags:

Understanding the partitions in PostgreSQL 9

Also popular now: