erogov December 4, 2017 at 11:06

PostgreSQL Indexes - 8

We have already examined the PostgreSQL indexing mechanism , the access method interface and all the basic access methods, such as hash indexes , B-trees , GiST , SP-GiST and GIN . And in this part, let's look at turning gin into rum.

RUM

Although the authors claim that gin is a powerful spirit, the theme of drinks still won: the next generation GIN was called RUM.

This access method develops the idea embodied in the GIN and allows you to perform full-text searches even faster. This is the only method in this series of articles that is not part of the standard PostgreSQL package and is a third-party extension. There are several options for installing it:

Get the yum or apt package from the PGDG repository . For example, if you installed PostgreSQL from the postgresql-10 package, then postgresql-10-rum.
Self-assemble and install from the source code on github (instructions there).
Use as part of Postgres Pro Enterprise (or at least read the documentation from there ).

GIN Limitations

What are the limitations of the GIN index that RUM can overcome?

First, the tsvector data type, in addition to the tokens themselves, contains information about their positions within the document. In the GIN index, as we saw last time , this information is not stored. Because of this, the phrasal search operations that appeared in version 9.6 are served by the GIN index inefficiently and are forced to access the original data for double-checking.

Secondly, search engines usually return results in order of relevance (whatever that means). To do this, you can use the ranking functions ts_rank and ts_rank_cd, but they have to be calculated for each row of the result, which, of course, is slow.

As a first approximation, the RUM access method can be considered as a GIN, in which positional information is added, and which supports the output of the result in the desired order (similar to how GiST can issue the nearest neighbors). Let's go in order.

Phrase Search

A query in a full-text search may contain special constructions that take into account the distance between tokens. For example, you can find documents in which one more word separates the grandmother from the grandfather: Or indicate that the words should stand behind each other: The regular GIN index can produce documents that contain both tokens, but you can check the distance between them only by looking at tsvector: In the RUM index, each token does not just refer to table rows: along with each TID, there is also a list of positions in which the token appears in the document. Here's how you can imagine an index created on a table already familiar to us with a white birch (by default, the rum_tsvector_ops operator class is used for tsvector): The gray squares in the figure are the added positional information:

postgres=# select to_tsvector('Бабка за дедку, дедка за репку...') @@

                  to_tsquery('бабка <2> дедка');

 ?column?

----------

 t

(1 row)

postgres=# select to_tsvector('Бабка за дедку, дедка за репку...') @@

                  to_tsquery('дедка <-> дедка');

 ?column?

----------

 t

(1 row)

postgres=# select to_tsvector('Бабка за дедку, дедка за репку...');

         to_tsvector

------------------------------

 'бабк':1 'дедк':3,4 'репк':6

(1 row)

postgres=# create extension rum;

CREATE EXTENSION

postgres=# create index on ts using rum(doc_tsv);

CREATE INDEX

postgres=# select ctid, doc, doc_tsv from ts;

  ctid  |           doc           |            doc_tsv            

--------+-------------------------+--------------------------------

 (0,1)  | Во поле береза стояла   | 'берез':3 'пол':2 'стоя':4

 (0,2)  | Во поле кудрявая стояла | 'кудряв':3 'пол':2 'стоя':4

 (0,3)  | Люли, люли, стояла      | 'люл':1,2 'стоя':3

 (0,4)  | Люли, люли, стояла      | 'люл':1,2 'стоя':3

 (1,1)  | Некому березу заломати  | 'берез':2 'заломат':3 'нек':1

 (1,2)  | Некому кудряву заломати | 'заломат':3 'кудряв':2 'нек':1

 (1,3)  | Люли, люли, заломати    | 'заломат':3 'люл':1,2

 (1,4)  | Люли, люли, заломати    | 'заломат':3 'люл':1,2

 (2,1)  | Я пойду погуляю         | 'погуля':3 'пойд':2

 (2,2)  | Белую березу заломаю    | 'бел':1 'берез':2 'залома':3

 (2,3)  | Люли, люли, заломаю     | 'залома':3 'люл':1,2

 (2,4)  | Люли, люли, заломаю     | 'залома':3 'люл':1,2

(12 rows)

There is still a delayed insertion in the GIN when specifying the fastupdate parameter; RUM has removed this functionality.

To see how the index works on real data, we use the pgsql-hackers mailing list archive that we know about . Here's how a query using a phrase search is executed with the GIN index: As you can see from the plan, the GIN index is used, but it returns 1776 potential matches, of which 259 remain, and 1517 are discarded at the rechecking stage. Now delete the GIN index and build the RUM. Now the index has all the necessary information and the search is performed accurately:

fts=# alter table mail_messages add column tsv tsvector;

ALTER TABLE

fts=# set default_text_search_config = default;

SET

fts=# update mail_messages

set tsv = to_tsvector(body_plain);

...

UPDATE 356125

fts=# create index tsv_gin on mail_messages using gin(tsv);

CREATE INDEX

fts=# explain (costs off, analyze)

select * from mail_messages where tsv @@ to_tsquery('hello <-> hackers');

                                   QUERY PLAN                                    

---------------------------------------------------------------------------------

 Bitmap Heap Scan on mail_messages (actual time=2.490..18.088 rows=259 loops=1)

   Recheck Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))

   Rows Removed by Index Recheck: 1517

   Heap Blocks: exact=1503

   ->  Bitmap Index Scan on tsv_gin (actual time=2.204..2.204 rows=1776 loops=1)

         Index Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))

 Planning time: 0.266 ms

 Execution time: 18.151 ms

(8 rows)

fts=# drop index tsv_gin;

DROP INDEX

fts=# create index tsv_rum on mail_messages using rum(tsv);

CREATE INDEX

fts=# explain (costs off, analyze)

select * from mail_messages

where tsv @@ to_tsquery('hello <-> hackers');

                                   QUERY PLAN                                  

--------------------------------------------------------------------------------

 Bitmap Heap Scan on mail_messages (actual time=2.798..3.015 rows=259 loops=1)

   Recheck Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))

   Heap Blocks: exact=250

   ->  Bitmap Index Scan on tsv_rum (actual time=2.768..2.768 rows=259 loops=1)

         Index Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))

 Planning time: 0.245 ms

 Execution time: 3.053 ms

(7 rows)

Relevance Sort

In order to issue documents immediately in the right order, the RUM index supports the ordering operators, which we discussed in the part about GiST . The rum extension defines such an operator <=>that returns a certain distance between the document (tsvector) and the query (tsquery). For example: The document turned out to be more relevant to the first request than to the second: the more often the word appears in the document, the less valuable it is. Again, try to compare GIN and RUM on a relatively large amount of data: we select the ten most relevant documents containing “hello” and “hackers”. The GIN index returns 1776 matches, which are then separately sorted to select the ten most suitable ones.

fts=# select to_tsvector('Бабка за дедку, дедка за репку...') <=> to_tsquery('репка');

 ?column?

----------

  16.4493

(1 row)


fts=# select to_tsvector('Бабка за дедку, дедка за репку...') <=> to_tsquery('дедка');

 ?column?

----------

  13.1595

(1 row)

fts=# explain (costs off, analyze)

select * from mail_messages 

where tsv @@ to_tsquery('hello & hackers') 

order by ts_rank(tsv,to_tsquery('hello & hackers')) 

limit 10;

                                         QUERY PLAN

---------------------------------------------------------------------------------------------

 Limit (actual time=27.076..27.078 rows=10 loops=1)

   ->  Sort (actual time=27.075..27.076 rows=10 loops=1)

         Sort Key: (ts_rank(tsv, to_tsquery('hello & hackers'::text)))

         Sort Method: top-N heapsort  Memory: 29kB

         ->  Bitmap Heap Scan on mail_messages (actual ... rows=1776 loops=1)

               Recheck Cond: (tsv @@ to_tsquery('hello & hackers'::text))

               Heap Blocks: exact=1503

               ->  Bitmap Index Scan on tsv_gin (actual ... rows=1776 loops=1)

                     Index Cond: (tsv @@ to_tsquery('hello & hackers'::text))

 Planning time: 0.276 ms

 Execution time: 27.121 ms

(11 rows)

With the RUM index, the query is performed by simple index scanning: no extra documents are scanned, no separate sorting is required:

fts=# explain (costs off, analyze)

select * from mail_messages

where tsv @@ to_tsquery('hello & hackers')

order by tsv <=> to_tsquery('hello & hackers')

limit 10;

                                         QUERY PLAN

--------------------------------------------------------------------------------------------

 Limit (actual time=5.083..5.171 rows=10 loops=1)

   ->  Index Scan using tsv_rum on mail_messages (actual ... rows=10 loops=1)

         Index Cond: (tsv @@ to_tsquery('hello & hackers'::text))

         Order By: (tsv <=> to_tsquery('hello & hackers'::text))

 Planning time: 0.244 ms

 Execution time: 5.207 ms

(6 rows)

Additional Information

The RUM index, like the GIN, can be built on several fields. But if the GIN tokens of different columns are stored independently of each other, then RUM allows you to "connect" the main field (tsvector in our case) with an additional one. To do this, use a special class of operators rum_tsvector_addon_ops: You can use such an index to display the results in sort order by an additional field: Here we look for suitable lines located as close as possible to the specified date, it does not matter sooner or later. To get results strictly preceding the date (or following it), you need to use the operation (or ). The query, as we expect, is performed by a simple index scan:

fts=# create index on mail_messages using rum(tsv rum_tsvector_addon_ops, sent)

  with (attach='sent', to='tsv');

CREATE INDEX

fts=# select id, sent, sent <=> '2017-01-01 15:00:00'

from mail_messages

where tsv @@ to_tsquery('hello')

order by sent <=> '2017-01-01 15:00:00'

limit 10;

   id    |        sent         | ?column? 

---------+---------------------+----------

 2298548 | 2017-01-01 15:03:22 |      202

 2298547 | 2017-01-01 14:53:13 |      407

 2298545 | 2017-01-01 13:28:12 |     5508

 2298554 | 2017-01-01 18:30:45 |    12645

 2298530 | 2016-12-31 20:28:48 |    66672

 2298587 | 2017-01-02 12:39:26 |    77966

 2298588 | 2017-01-02 12:43:22 |    78202

 2298597 | 2017-01-02 13:48:02 |    82082

 2298606 | 2017-01-02 15:50:50 |    89450

 2298628 | 2017-01-02 18:55:49 |   100549

(10 rows)

<=||=>

ts=# explain (costs off)

select id, sent, sent <=> '2017-01-01 15:00:00' 

from mail_messages

where tsv @@ to_tsquery('hello')

order by sent <=> '2017-01-01 15:00:00'

limit 10;

                                   QUERY PLAN

---------------------------------------------------------------------------------

 Limit

   ->  Index Scan using mail_messages_tsv_sent_idx on mail_messages

         Index Cond: (tsv @@ to_tsquery('hello'::text))

         Order By: (sent <=> '2017-01-01 15:00:00'::timestamp without time zone)

(4 rows)

If we created an index without additional information about the relationship of the fields, then for a similar query we would have to sort all the results received from the index.

Of course, in addition to the date, you can add fields and other data types to the RUM index - almost all basic types are supported. For example, an online store can quickly display products by novelty (date), price (numeric), popularity or discount size (integer or floating point).

Other operator classes

For completeness, it’s worth mentioning about other available classes of operators.

Let's start with rum_tsvector_hash_ops and rum_tsvector_hash_addon_ops. They are in all respects similar to rum_tsvector_ops and rum_tsvector_addon_ops already discussed above, but not the token itself, but its hash code is stored in the index. This can reduce the size of the index, but, of course, makes the search less accurate and requires double-checking. In addition, the index no longer supports the search for partial matches.

The rum_tsquery_ops operator class is curious .It allows you to solve the "inverse" problem: find queries that match the document. Why might this be needed? For example, subscribe a user to new products by his filter. Or automatically classify new documents. Here is a simple example: The operator classes rum_anyarray_ops and rum_anyarray_addon_ops remain - they are designed to work not with tsvector, but with arrays. For GIN, this has already been considered the last time , so there is no reason to repeat it.

fts=# create table categories(query tsquery, category text);

CREATE TABLE

fts=# insert into categories values

  (to_tsquery('vacuum | autovacuum | freeze'), 'vacuum'),

  (to_tsquery('xmin | xmax | snapshot | isolation'), 'mvcc'),

  (to_tsquery('wal | (write & ahead & log) | durability'), 'wal');

INSERT 0 3

fts=# create index on categories using rum(query);

CREATE INDEX


fts=# select array_agg(category)

from categories

where to_tsvector(

  'Hello hackers, the attached patch greatly improves performance of tuple

   freezing and also reduces size of generated write-ahead logs.'

) @@ query;

  array_agg  

--------------

 {vacuum,wal}

(1 row)

Index and prerecord log size

It is clear that since RUM contains more information than GIN, then it will take up more space. Last time we compared the sizes of different indexes; add RUM to this table: As you can see, the volume has grown quite significantly - this is the fee for a quick search. Another non-obvious point that you should pay attention to is that RUM is an extension, that is, it can be installed without making any changes to the kernel of the system. This was made possible in version 9.6 thanks to the patch made by Alexander Korotkov

  rum   |  gin   |  gist  | btree

--------+--------+--------+--------

 457 MB | 179 MB | 125 MB | 546 MB

. One of the tasks that had to be solved was the generation of journal entries. The journaling mechanism must be absolutely reliable, so the extension should not be allowed into this kitchen. Instead of allowing the extension to create its own types of journal entries, this is done: the extension code informs you of the intention to change the page, makes any changes to it and signals completion, and the kernel of the system already compares the old and new versions of the page and generates the necessary unified journal ones records.

The current generation algorithm compares the pages byte-by-byte, finds the changed fragments and logs each such fragment along with the offset from the beginning of the page. This works well when changing just a few bytes, and when the page has changed completely. But if you add a fragment to the inside of the page by moving the rest of the content down (or, conversely, remove the fragment by moving the content up), significantly more bytes will formally change than was actually added or deleted.

Because of this, an actively changing RUM index can generate journal entries of significantly larger size than the GIN (which, being not an extension, but part of the kernel, manages the journal itself). The degree of this unpleasant effect depends heavily on the actual load, but in order to somehow feel the problem, let's try several times to delete and add a certain number of lines, alternating these actions with cleaning (vacuum). You can estimate the size of the log entries as follows: at the beginning and at the end, remember the position in the log with the pg_current_wal_location function (up to the tenth version - pg_current_xlog_location) and then look at their difference.

Here, of course, many factors must be kept in mind. You need to make sure that only one user is working in the system, otherwise “extra” entries will be taken into account. Even in this case, we take into account not only RUM, but also changes in the table itself and the index supporting the primary key. The values of configuration parameters also affect (here we used the replica log level, without compression). But still try. So, it turned out about 3 GB. And if you repeat the same experiment with the GIN index, there will be only about 700 MB. Therefore, I would like to have a different algorithm that finds the minimum number of insert and delete operations with which one page state can be brought to another - similar to how the diff utility works. Such an algorithm has already been implemented by Oleg Ivanov , his patch

fts=# select pg_current_wal_location() as start_lsn \gset


fts=# insert into mail_messages(parent_id, sent, subject, author, body_plain, tsv)

  select parent_id, sent, subject, author, body_plain, tsv

  from mail_messages where id % 100 = 0;

INSERT 0 3576

fts=# delete from mail_messages where id % 100 = 99;

DELETE 3590

fts=# vacuum mail_messages;

VACUUM


fts=# insert into mail_messages(parent_id, sent, subject, author, body_plain, tsv)

  select parent_id, sent, subject, author, body_plain, tsv

  from mail_messages where id % 100 = 1;

INSERT 0 3605

fts=# delete from mail_messages where id % 100 = 98;

DELETE 3637

fts=# vacuum mail_messages;

VACUUM


fts=# insert into mail_messages(parent_id, sent, subject, author, body_plain, tsv)

  select parent_id, sent, subject, author, body_plain, tsv from mail_messages

  where id % 100 = 2;

INSERT 0 3625

fts=# delete from mail_messages where id % 100 = 97;

DELETE 3668

fts=# vacuum mail_messages;

VACUUM


fts=# select pg_current_wal_location() as end_lsn \gset

fts=# select pg_size_pretty(:'end_lsn'::pg_lsn - :'start_lsn'::pg_lsn);

 pg_size_pretty

----------------

 3114 MB

(1 row)

is being discussed. In the above example, this patch, at the cost of a slight slowdown, can reduce the volume of journal entries by one and a half times, to 1900 MB.

The properties

Traditionally, we look at the properties of the rum access method (requests were given earlier ), paying attention to the differences from gin.

Method properties: Index properties: Note that RUM, unlike GIN, supports index scanning - otherwise it would be impossible to get exactly the required number of results in queries with the phrase limit. Accordingly, there is no need for an analogue of the gin_fuzzy_search_limit parameter. Well, and as a result, the index can be used to support exception constraints. Column Level Properties: The difference here is that RUM supports collating operators. Although not for all operator classes: for example, for tsquery_ops it will be false. To be continued .

 amname |     name      | pg_indexam_has_property

--------+---------------+-------------------------

 rum    | can_order     | f

 rum    | can_unique    | f

 rum    | can_multi_col | t

 rum    | can_exclude   | t -- f для gin

     name      | pg_index_has_property

---------------+-----------------------

 clusterable   | f

 index_scan    | t -- f для gin

 bitmap_scan   | t

 backward_scan | f

        name        | pg_index_column_has_property 

--------------------+------------------------------

 asc                | f

 desc               | f

 nulls_first        | f

 nulls_last         | f

 orderable          | f

 distance_orderable | t -- f для gin

 returnable         | f

 search_array       | f

 search_nulls       | f

Tags: