Determine the order of columns in a composite index

Original author: Baron Schwartz
  • Transfer
I want to share a simple empirical method that I use to determine in which order the columns should go in a composite index. This method is not only suitable for MySQL, it is also applicable to any DBMS that uses b-tree indexes.

Let's start with a query that returns an empty result, but does a full table scan. EXPLAIN will show on it that there are no indexes available (i.e. possible_keys = NULL)

SELECT * FROM tbl
WHERE
  status='waiting' AND
  source='twitter' AND
  no_send_before <= '2009-05-28 03:17:50' AND
  tries <= 20
ORDER BY date ASC LIMIT 1;

Do not try to understand the meaning of this request, it is given only as an example. In the simplest case, we want to put the most selective column in the index first, so that the possible number of matching rows is minimal, so we will find the necessary rows as quickly as possible. Assuming that all columns have some distribution of values, we can simply calculate the number of matches for each condition.

SELECT
    sum(status='waiting'),
    sum(source='twitter'),
    sum(no_send_before <= '2009-05-28 03:17:50'),
    sum(tries <= 20),
    count(*)
FROM tbl\G

*************************** 1. row ***************************
                     sum(status ='waiting'): 550
                      sum(source='twitter'): 37271
sum(no_send_before <= '2009-05-28 03:17:50'): 36975
                            sum(tries <= 20): 36569
                                    count(*): 37271


It's simple - I wrapped each condition with the SUM () function, which for MySQL is equivalent to COUNT (number_time_when_tru). As you can see, the most selective condition is “status = waiting”. Let's put this column in the index first, after which we transfer the condition from SELECT to WHERE and run the query again to count the matches in the remaining set.

SELECT
    sum(source='twitter'),
    sum(no_send_before <= '2009-05-28 03:17:50'),
    sum(tries <= 20),
    count(*)
FROM tbl
WHERE
    status='waiting'\G
*************************** 1. row ***************************
                      sum(source='twitter'): 549
sum(no_send_before <= '2009-05-28 03:17:50'): 255
                            sum(tries <= 20): 294
                                    count(*): 549


Now we are down to an acceptable number of lines. It can be seen that the “source” does not have selectivity at all, that is, Using it, you won’t be able to filter anything, and adding it to the index will not bring any benefit. You can filter the remaining set using either 'no_send_before' or 'tries'. Running a query with any of them in where will reduce the number of matches for another condition to zero.

SELECT
    sum(source='twitter'),
    sum(no_send_before <= '2009-05-28 03:17:50'),
    sum(tries <= 20),
    count(*)
FROM tbl
WHERE
    status='waiting' AND
    no_send_before <= '2009-05-28 03:17:50'\G
*************************** 1. row ***************************
                      sum(source='twitter'): 255
sum(no_send_before <= '2009-05-28 03:17:50'): 255
                            sum(tries <= 20): 0
                                    count(*): 255

***************************************************************
                                         
SELECT
    sum(source='twitter'),
    sum(no_send_before <= '2009-05-28 03:17:50'),
    sum(tries <= 20),
    count(*)
FROM tbl
WHERE
    status='waiting' AND
    tries <= 20\G
*************************** 1. row ***************************
                      sum(source='twitter'): 294
sum(no_send_before <= '2009-05-28 03:17:50'): 0
                            sum(tries <= 20): 294
                                    count(*): 294
* This source code was highlighted with Source Code Highlighter.


This means that we can make an index with any of them - (status, tries) or (status, no_send_before) and we will find our zero rows very efficiently. Which one is better depends on what this table is really used for (as well as on the availability and structure of other queries to this table - approx. Per.) .

Also popular now: