MySQL and partitioning
After reading the article I remembered about one very specific customer and system for collecting statistics on events. It’s the 21st century and I’m aware of the availability of ClickHouse , but the customer doesn’t want to change the database (the reason is not clear to me and religion is probably not allowed), and even so, I warned him several times about the consequences. When it becomes slow at all, he realizes the problem.
But this is not about that. In general, after reading the article, I remembered about this project and decided to try to integrate partitioning into a table with 7,000,000 records. There are already many more entries on the prod stand.
Also, sharding was used in the project, which, by and large, is superfluous there. It makes no sense in such a system to do sharding and even in time (each month has its own table).
In general, there were really few options for how to divide the data, and the most obvious was chosen: add the column dYm (date Year month) to the table, since the time is already written in the table, it was not difficult to do this. True, with a certain caveat, since there is not enough memory on the server, I had to recreate the table and import the data into a new table, after adding the necessary field.
Creating a table with partitions (removed some of the fields):
As described in the article that I cited initially, the advantages of this division are obvious:
Next, you need to optimize the queries, because with an illiterate query, MySQL will go through all partitions, which adds additional costs, which is not very good.
After reading the article , the optimization decision also suggests itself: we need to use the search through between using a unique key in the request. As a result, if in the application all requests are replaced with the following:
then we get a very good explain:
And we have achieved the following:
The essence of the problem
But this is not about that. In general, after reading the article, I remembered about this project and decided to try to integrate partitioning into a table with 7,000,000 records. There are already many more entries on the prod stand.
Also, sharding was used in the project, which, by and large, is superfluous there. It makes no sense in such a system to do sharding and even in time (each month has its own table).
In general, there were really few options for how to divide the data, and the most obvious was chosen: add the column dYm (date Year month) to the table, since the time is already written in the table, it was not difficult to do this. True, with a certain caveat, since there is not enough memory on the server, I had to recreate the table and import the data into a new table, after adding the necessary field.
Creating a table with partitions (removed some of the fields):
CREATE TABLE `event_list_test` (
`dYd` int(6) unsigned NOT NULL COMMENT 'год и месяц',
`hash` varchar(13) NOT NULL COMMENT 'hash',
`time` int(10) unsigned NOT NULL COMMENT 'timestamp',
PRIMARY KEY (`time`,`dYd`,`hash`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8
PARTITION BY LIST (dYd)
(PARTITION p201703 VALUES IN (201703) ENGINE = InnoDB,
PARTITION p201704 VALUES IN (201704) ENGINE = InnoDB,
PARTITION p201705 VALUES IN (201705) ENGINE = InnoDB,
PARTITION p201706 VALUES IN (201706) ENGINE = InnoDB,
PARTITION p201707 VALUES IN (201707) ENGINE = InnoDB,
PARTITION p201708 VALUES IN (201708) ENGINE = InnoDB,
PARTITION p201709 VALUES IN (201709) ENGINE = InnoDB,
PARTITION p201710 VALUES IN (201710) ENGINE = InnoDB,
PARTITION p201711 VALUES IN (201711) ENGINE = InnoDB,
PARTITION p201712 VALUES IN (201712) ENGINE = InnoDB)
As described in the article that I cited initially, the advantages of this division are obvious:
- Ease of administration, as a column can no longer be added to a table with 7,000,000 rows with 1GB of memory, and the index is even more so
- Initially, these kinds of tables were sharded, but the obvious minus is writing sql queries. Often it was necessary to make inquiries for several months, and if aggregation is needed, then this is really a disaster.
- Adding a partition to the table is as simple as possible (especially if you put in the crowns the task of sending a letter to telegram or e-mail, then to whom it is more convenient)
Next, you need to optimize the queries, because with an illiterate query, MySQL will go through all partitions, which adds additional costs, which is not very good.
After reading the article , the optimization decision also suggests itself: we need to use the search through between using a unique key in the request. As a result, if in the application all requests are replaced with the following:
SELECT `time` FROM `event_list_test` WHERE (`time` BETWEEN 1505385901 AND 1506934784) AND (`dYd` BETWEEN 201709 AND 201710) LIMIT 10
then we get a very good explain:
SIMPLE event_list_test p201709,p201710 range PRIMARY,time PRIMARY 8 NULL 145875 11.11 Using where
What have we achieved?
And we have achieved the following:
- Unnecessary data sharding has disappeared
- It is very easy to build data requests (there were a lot of problems with sharding)
- It’s very easy to administer tables (insert partition, delete partition, insert data, select data, modify table, work with indexes)
- And as a result - simplification of the application at times.