Parallelization of long operations
I often have to deal with tasks that require very high performance from a database when processing large amounts of data. Today I will talk about a very simple but effective technique that can help you out if the database no longer keeps up with the amount of data that is being accumulated and needs to be processed. The method does not depend on the database, but out of habit I publish on the PostgreSQL blog, and the example will be on it. Let's move on to the example.
Suppose we are talking about the simplest billing (it is clear that the method is applicable not only for billing, but with it it will look quite clearly). The table in which data about calls of subscribers is collected will in our case have the following format:
So that's all simple. Call data is put on this plate at an insane rate, and should be charged for a reasonable amount of time.
For billing, we have a database function with the following signature:
We carry out billing every 5 minutes by running this request:
And at one point, we understand that this request simply does not have time to complete in 5 minutes. During this time, even more data runs up, then more, and now we sit and wait for the night when the stream weakens a little, and the queue finally rakes. Such is the prospect. It must be said that we did not sit and wait alone. Together with us, we spent 3 (for example) remaining kernels of our server, while one was piling over the request. PostgreSQL, unfortunately, does not know how to parallelize queries, but in our case this is not necessary. Much better results will come from a very simple and obvious trick. We create an index for the function "remainder of dividing subscriber_id by 4":
And now run in four threads (for example, four different jobs):
where mod is 0,1,2 or 3 (for each stream its own).
This technique solves problems with locks that may occur if two different flows get a call from one subscriber. Also, in parallel, these jobs will work out faster than if we hoped to parallelize the database itself (if we have not postgre, but an Oracle, for example).
The method is applicable for any database supporting a function index (Oracle, Postgresql). In the case of MSSQL, you can create a calculated column and an index on it. There is no support for functional indexes in MySQL, but, as a workaround, you can create a new column with an index on it and update it with a trigger.
Spherical horse
Suppose we are talking about the simplest billing (it is clear that the method is applicable not only for billing, but with it it will look quite clearly). The table in which data about calls of subscribers is collected will in our case have the following format:
CREATE TABLE billing.calls
(
call_id BIGINT,
call_time TIMESTAMP,
subscriber_id INTEGER,
duration INTERVAL
);
* This source code was highlighted with Source Code Highlighter.
So that's all simple. Call data is put on this plate at an insane rate, and should be charged for a reasonable amount of time.
For billing, we have a database function with the following signature:
FUNCTION calculate(IN subscriber_id INTEGER, IN duration INTERVAL, OUT status_code text) RETURNS void
* This source code was highlighted with Source Code Highlighter.
We carry out billing every 5 minutes by running this request:
SELECT calculate(subscriber_id, duration) FROM billing.calls;
* This source code was highlighted with Source Code Highlighter.
And at one point, we understand that this request simply does not have time to complete in 5 minutes. During this time, even more data runs up, then more, and now we sit and wait for the night when the stream weakens a little, and the queue finally rakes. Such is the prospect. It must be said that we did not sit and wait alone. Together with us, we spent 3 (for example) remaining kernels of our server, while one was piling over the request. PostgreSQL, unfortunately, does not know how to parallelize queries, but in our case this is not necessary. Much better results will come from a very simple and obvious trick. We create an index for the function "remainder of dividing subscriber_id by 4":
CREATE INDEX billing.calls_subscriber_id_mod_idx ON billing.calls USING btree ((subscriber_id % 4));
* This source code was highlighted with Source Code Highlighter.
And now run in four threads (for example, four different jobs):
SELECT calculate(subscriber_id, duration) FROM billing.calls WHERE subscriber_id % 4 = @mod;
* This source code was highlighted with Source Code Highlighter.
where mod is 0,1,2 or 3 (for each stream its own).
As a result
This technique solves problems with locks that may occur if two different flows get a call from one subscriber. Also, in parallel, these jobs will work out faster than if we hoped to parallelize the database itself (if we have not postgre, but an Oracle, for example).
The method is applicable for any database supporting a function index (Oracle, Postgresql). In the case of MSSQL, you can create a calculated column and an index on it. There is no support for functional indexes in MySQL, but, as a workaround, you can create a new column with an index on it and update it with a trigger.