Using all the features of indexes in PostgreSQL

Original author: pgDash blog
  • Transfer
  • Tutorial

In the Postgres world, indexes are crucial for navigating a database repository efficiently (called heap, heap). Postgres does not support clustering for it, and the MVCC architecture causes you to accumulate many versions of the same tuple. Therefore, it is very important to be able to create and maintain effective indexes to support applications.

Here are some tips for optimizing and improving the use of indexes.

Note: the queries shown below work on an unmodified pagila sample database .

Using Covering Indexes


Let's review the request to retrieve email addresses for inactive users. There customeris a column in the table active, and the query is simple:

pagila=# EXPLAIN SELECT email FROM customer WHERE active=0;
                        QUERY PLAN
-----------------------------------------------------------
 Seq Scan on customer  (cost=0.00..16.49 rows=15 width=32)
   Filter: (active = 0)
(2 rows)

The full table scan sequence is called in the query customer. Let's create an index for the column active:

pagila=# CREATE INDEX idx_cust1 ON customer(active);
CREATE INDEX
pagila=# EXPLAIN SELECT email FROM customer WHERE active=0;
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Index Scan using idx_cust1 on customer  (cost=0.28..12.29 rows=15 width=32)
   Index Cond: (active = 0)
(2 rows)

It helped, the subsequent scan turned into a " index scan". This means that Postgres will scan the index " idx_cust1", and then continue to search the heap of the table to read the values ​​of the other columns (in this case, the column email) that the query needs.

PostgreSQL 11 introduced covering indexes. They allow you to include one or more additional columns in the index itself - their values ​​are stored in the index data store.

If we used this feature and added an email value inside the index, then Postgres would not need to look up the value in the table heap email. Let's see if this works:

pagila=# CREATE INDEX idx_cust2 ON customer(active) INCLUDE (email);
CREATE INDEX
pagila=# EXPLAIN SELECT email FROM customer WHERE active=0;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Index Only Scan using idx_cust2 on customer  (cost=0.28..12.29 rows=15 width=32)
   Index Cond: (active = 0)
(2 rows)

" Index Only Scan" tells us that the query now needs only one index, which helps to avoid all disk I / O to read the table heap.

Today, covering indexes are available only for B-trees. However, in this case, escort efforts will be higher.

Using partial indexes


Partial indexes index only a subset of the rows in a table. This saves the size of indexes and faster scans.

Suppose we need to get a list of email addresses from our California customers. The request will be like this:

SELECT c.email FROM customer c
JOIN address a ON c.address_id = a.address_id
WHERE a.district = 'California';
which has a query plan that involves scanning both the tables that are joined:
pagila=# EXPLAIN SELECT c.email FROM customer c
pagila-# JOIN address a ON c.address_id = a.address_id
pagila-# WHERE a.district = 'California';
                              QUERY PLAN
----------------------------------------------------------------------
 Hash Join  (cost=15.65..32.22 rows=9 width=32)
   Hash Cond: (c.address_id = a.address_id)
   ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=34)
   ->  Hash  (cost=15.54..15.54 rows=9 width=4)
         ->  Seq Scan on address a  (cost=0.00..15.54 rows=9 width=4)
               Filter: (district = 'California'::text)
(6 rows)

What ordinary indexes will give us:

pagila=# CREATE INDEX idx_address1 ON address(district);
CREATE INDEX
pagila=# EXPLAIN SELECT c.email FROM customer c
pagila-# JOIN address a ON c.address_id = a.address_id
pagila-# WHERE a.district = 'California';
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Hash Join  (cost=12.98..29.55 rows=9 width=32)
   Hash Cond: (c.address_id = a.address_id)
   ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=34)
   ->  Hash  (cost=12.87..12.87 rows=9 width=4)
         ->  Bitmap Heap Scan on address a  (cost=4.34..12.87 rows=9 width=4)
               Recheck Cond: (district = 'California'::text)
               ->  Bitmap Index Scan on idx_address1  (cost=0.00..4.34 rows=9 width=0)
                     Index Cond: (district = 'California'::text)
(8 rows)

The scan addresswas replaced by an index scan idx_address1, and then the heap was scanned address.

Since this is a frequent query and needs to be optimized, we can use a partial index that indexes only those rows with addresses in which the district ‘California’:

pagila=# CREATE INDEX idx_address2 ON address(address_id) WHERE district='California';
CREATE INDEX
pagila=# EXPLAIN SELECT c.email FROM customer c
pagila-# JOIN address a ON c.address_id = a.address_id
pagila-# WHERE a.district = 'California';
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Hash Join  (cost=12.38..28.96 rows=9 width=32)
   Hash Cond: (c.address_id = a.address_id)
   ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=34)
   ->  Hash  (cost=12.27..12.27 rows=9 width=4)
         ->  Index Only Scan using idx_address2 on address a  (cost=0.14..12.27 rows=9 width=4)
(5 rows)

Now the query reads only idx_address2and does not touch the table address.

Using Multi-Value Indexes


Some columns that need to be indexed may not contain a scalar data type. Column types are similar jsonb, arraysand tsvectorcontain multiple or multiple values. If you need to index such columns, you usually have to search for all the individual values ​​in these columns.

Let's try to find the names of all films containing cuts from unsuccessful takes. The table filmhas a text column called special_features. If the film has this “special property”, then the column contains an element in the form of a text array Behind The Scenes. To search for all such films, we need to select all rows with “Behind The Scenes” for any array values special_features:

SELECT title FROM film WHERE special_features @> '{"Behind The Scenes"}';

The containment operator @>checks whether the right-hand side is a subset of the left-hand side.

Request Plan:

pagila=# EXPLAIN SELECT title FROM film
pagila-# WHERE special_features @> '{"Behind The Scenes"}';
                           QUERY PLAN
-----------------------------------------------------------------
 Seq Scan on film  (cost=0.00..67.50 rows=5 width=15)
   Filter: (special_features @> '{"Behind The Scenes"}'::text[])
(2 rows)

Which requests a full scan of the heap with a cost of 67.

Let's see if the regular B-tree index helps us:

pagila=# CREATE INDEX idx_film1 ON film(special_features);
CREATE INDEX
pagila=# EXPLAIN SELECT title FROM film
pagila-# WHERE special_features @> '{"Behind The Scenes"}';
                           QUERY PLAN
-----------------------------------------------------------------
 Seq Scan on film  (cost=0.00..67.50 rows=5 width=15)
   Filter: (special_features @> '{"Behind The Scenes"}'::text[])
(2 rows)

The index was not even considered. The B-tree index is not aware of the existence of individual elements in the indexed values.

We need a GIN index.

pagila=# CREATE INDEX idx_film2 ON film USING GIN(special_features);
CREATE INDEX
pagila=# EXPLAIN SELECT title FROM film
pagila-# WHERE special_features @> '{"Behind The Scenes"}';
                                QUERY PLAN
---------------------------------------------------------------------------
 Bitmap Heap Scan on film  (cost=8.04..23.58 rows=5 width=15)
   Recheck Cond: (special_features @> '{"Behind The Scenes"}'::text[])
   ->  Bitmap Index Scan on idx_film2  (cost=0.00..8.04 rows=5 width=0)
         Index Cond: (special_features @> '{"Behind The Scenes"}'::text[])
(4 rows)

GIN-index supports the comparison of individual values ​​with indexed composite values, as a result, the cost of the query plan is reduced by more than half.

Get rid of duplicate indexes


Indexes accumulate over time, and sometimes a new index may contain the same definition as one of the previous ones. To obtain human-readable SQL definitions of indexes, you can use the catalog view pg_indexes. You can also easily find the same definitions:

 SELECT array_agg(indexname) AS indexes, replace(indexdef, indexname, '') AS defn
    FROM pg_indexes
GROUP BY defn
  HAVING count(*) > 1;
And here’s the result when run on the stock pagila database:
pagila=#   SELECT array_agg(indexname) AS indexes, replace(indexdef, indexname, '') AS defn
pagila-#     FROM pg_indexes
pagila-# GROUP BY defn
pagila-#   HAVING count(*) > 1;
                                indexes                                 |                                defn
------------------------------------------------------------------------+------------------------------------------------------------------
 {payment_p2017_01_customer_id_idx,idx_fk_payment_p2017_01_customer_id} | CREATE INDEX  ON public.payment_p2017_01 USING btree (customer_id
 {payment_p2017_02_customer_id_idx,idx_fk_payment_p2017_02_customer_id} | CREATE INDEX  ON public.payment_p2017_02 USING btree (customer_id
 {payment_p2017_03_customer_id_idx,idx_fk_payment_p2017_03_customer_id} | CREATE INDEX  ON public.payment_p2017_03 USING btree (customer_id
 {idx_fk_payment_p2017_04_customer_id,payment_p2017_04_customer_id_idx} | CREATE INDEX  ON public.payment_p2017_04 USING btree (customer_id
 {payment_p2017_05_customer_id_idx,idx_fk_payment_p2017_05_customer_id} | CREATE INDEX  ON public.payment_p2017_05 USING btree (customer_id
 {idx_fk_payment_p2017_06_customer_id,payment_p2017_06_customer_id_idx} | CREATE INDEX  ON public.payment_p2017_06 USING btree (customer_id
(6 rows)

Superset Indexes


It may happen that you accumulate many indexes, one of which indexes a subset of the columns that index other indexes. This can be either desirable or not - a superset can only scan by index, which is good, but it can take up too much space, or the query for which this superset was intended to be optimized is no longer used.

If you need to automate the definition of such indexes, then you can start with pg_index from the table pg_catalog.

Unused Indexes


As applications that use databases develop, so do the queries they use. Indexes added earlier may no longer be used by any query. Each time the index is scanned, it is marked by the statistics manager, and in the system catalog pg_stat_user_indexesview you can see the value idx_scan, which is a cumulative counter. Tracking this value over a period of time (say, a month) will give a good idea of ​​which indexes are not used and can be deleted.

Here is a request to get the current scan counters of all indices in the schema ‘public’:

SELECT relname, indexrelname, idx_scan
FROM   pg_catalog.pg_stat_user_indexes
WHERE  schemaname = 'public';
with output like this:
pagila=# SELECT relname, indexrelname, idx_scan
pagila-# FROM   pg_catalog.pg_stat_user_indexes
pagila-# WHERE  schemaname = 'public'
pagila-# LIMIT  10;
    relname    |    indexrelname    | idx_scan
---------------+--------------------+----------
 customer      | customer_pkey      |    32093
 actor         | actor_pkey         |     5462
 address       | address_pkey       |      660
 category      | category_pkey      |     1000
 city          | city_pkey          |      609
 country       | country_pkey       |      604
 film_actor    | film_actor_pkey    |        0
 film_category | film_category_pkey |        0
 film          | film_pkey          |    11043
 inventory     | inventory_pkey     |    16048
(10 rows)

Re-create indexes with fewer locks


Often indexes have to be recreated, for example, when they are inflated in size, and re-creation can speed up the scan. Also, indices may be corrupted. Changing index parameters may also require re-creating it.

Enable parallel index creation


In PostgreSQL 11, creating a B-Tree index is competitive. To speed up the creation process, several parallel workers can be used. However, make sure that these configuration parameters are set correctly:

SET max_parallel_workers = 32;
SET max_parallel_maintenance_workers = 16;

The default values ​​are too small. Ideally, these numbers should be increased along with the number of processor cores. Read the documentation for more details .

Background Index Creation


You can create an index in the background using the CONCURRENTLYcommand parameter CREATE INDEX:

pagila=# CREATE INDEX CONCURRENTLY idx_address1 ON address(district);
CREATE INDEX


This index creation procedure differs from the usual one in that it does not require table locking, and therefore does not block write operations. On the other hand, it takes more time and consumes more resources.

Postgres provides many flexible options for creating indexes and ways to solve any particular cases, as well as provides ways to manage the database in case of explosive growth of your application. We hope that these tips will help you make your queries quick and your database ready to scale.

Also popular now: