geoolekom March 12, 2019 at 18:47

We are preparing a full-text search in Postgres. Part 2

In the last article, we optimized the search in PostgreSQL using standard tools. In this article, we will continue to optimize using the RUM index and analyze its pros and cons compared to the GIN.

Introduction

RUM is an extension for Postgres, a new index for full-text search. It allows you to return results sorted by relevance when passing through the index. I will not focus on its installation - it is described in the README in the repository.

We use an index

An index is created similar to the GIN index, but with some parameters. The entire list of parameters can be found in the documentation.

CREATE INDEX idx_rum_document 
ON documents_documentvector 
USING rum ("text" rum_tsvector_ops);

Search query for RUM:

SELECT document_id, "text" <=> plainto_tsquery('запрос') AS rank
FROM documents_documentvector
WHERE "text" @@ plainto_tsquery('запрос')
ORDER BY rank;

Request for GIN

SELECT document_id, ts_rank("text", plainto_tsquery('запрос')) AS rank
FROM documents_documentvector
WHERE "text" @@ plainto_tsquery('запрос')
ORDER BY rank DESC;

Unlike GIN is that relevance is obtained not by ts_rank function, and using a query c operator <=>: "text" <=> plainto_tsquery('запрос'). Such a query returns some distance between the search vector and the search query. The smaller it is, the better the query matches the vector.

Comparison with GIN

Here we will compare on a test basis with ~ 500 thousand documents to notice differences in the search results.

Request Speed

Let's see what EXPLAIN for GIN will produce on this base:

Gather Merge (actual time=563.840..611.844 rows=119553 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Sort (actual time=553.427..557.857 rows=39851 loops=3)
         Sort Key: (ts_rank(text, plainto_tsquery('запрос'::text)))
         Sort Method: external sort  Disk: 1248kB
         ->  Parallel Bitmap Heap Scan on documents_documentvector (actual time=13.402..538.879 rows=39851 loops=3)
               Recheck Cond: (text @@ plainto_tsquery('запрос'::text))
               Heap Blocks: exact=5616
               ->  Bitmap Index Scan on idx_gin_document (actual time=12.144..12.144 rows=119553 loops=1)
                     Index Cond: (text @@ plainto_tsquery('запрос'::text))
 Planning time: 4.573 ms
 Execution time: 617.534 ms

And for RUM?

Sort (actual time=1668.573..1676.168 rows=119553 loops=1)
   Sort Key: ((text <=> plainto_tsquery('запрос'::text)))
   Sort Method: external merge  Disk: 3520kB
   ->  Bitmap Heap Scan on documents_documentvector (actual time=16.706..1605.382 rows=119553 loops=1)
         Recheck Cond: (text @@ plainto_tsquery('запрос'::text))
         Heap Blocks: exact=15599
         ->  Bitmap Index Scan on idx_rum_document (actual time=14.548..14.548 rows=119553 loops=1)
               Index Cond: (text @@ plainto_tsquery('запрос'::text))
 Planning time: 0.650 ms
 Execution time: 1679.315 ms

What is it? What is the use of this vaunted RUM, you ask, if it runs three times slower than the GIN? And where is the notorious sorting inside the index?

Calm: try to add to the request LIMIT 1000.

EXPLAIN for RUM

Limit (actual time = 115.568..137.313 rows = 1000 loops = 1)
   -> Index Scan using idx_rum_document on documents_documentvector (actual time = 115.567..137.239 rows = 1000 loops = 1)
         Index Cond: (text @@ plainto_tsquery ('query' :: text))
         Order By: (text <=> plainto_tsquery ('query' :: text))
 Planning time: 0.481 ms
 Execution time: 137.678 ms

EXPLAIN for GIN

Limit (actual time = 579.905..585.650 rows = 1000 loops = 1)
   -> Gather Merge (actual time = 579.904..585.604 rows = 1000 loops = 1)
         Workers Planned: 2
         Workers Launched: 2
         -> Sort (actual time = 574.061..574.171 rows = 992 loops = 3)
               Sort Key: (ts_rank (text, plainto_tsquery ('query' :: text))) DESC
               Sort Method: external merge Disk: 1224kB
               -> Parallel Bitmap Heap Scan on documents_documentvector (actual time = 8.920..555.571 rows = 39851 loops = 3)
                     Recheck Cond: (text @@ plainto_tsquery ('query' :: text))
                     Heap Blocks: exact = 5422
                     -> Bitmap Index Scan on idx_gin_document (actual time = 8.945..8.945 rows = 119553 loops = 1)
                           Index Cond: (text @@ plainto_tsquery ('query' :: text))
 Planning time: 0.223 ms
 Execution time: 585.948 ms

~ 150 ms vs ~ 600 ms! Already not in favor of GIN, right? And sorting has moved inside the index!

And if you look for LIMIT 100?

EXPLAIN for RUM

Limit (actual time = 105.863..108.530 rows = 100 loops = 1)
   -> Index Scan using idx_rum_document on documents_documentvector (actual time = 105.862..108.517 rows = 100 loops = 1)
         Index Cond: (text @@ plainto_tsquery ('query' :: text))
         Order By: (text <=> plainto_tsquery ('query' :: text))
 Planning time: 0.199 ms
 Execution time: 108.958 ms

EXPLAIN for GIN

Limit (actual time = 582.924..588.351 rows = 100 loops = 1)
   -> Gather Merge (actual time = 582.923..588.344 rows = 100 loops = 1)
         Workers Planned: 2
         Workers Launched: 2
         -> Sort (actual time = 573.809..573.889 rows = 806 loops = 3)
               Sort Key: (ts_rank (text, plainto_tsquery ('query' :: text))) DESC
               Sort Method: external merge Disk: 1224kB
               -> Parallel Bitmap Heap Scan on documents_documentvector (actual time = 18.038..552.827 rows = 39851 loops = 3)
                     Recheck Cond: (text @@ plainto_tsquery ('query' :: text))
                     Heap Blocks: exact = 5275
                     -> Bitmap Index Scan on idx_gin_document (actual time = 16.541..16.541 rows = 119553 loops = 1)
                           Index Cond: (text @@ plainto_tsquery ('query' :: text))
 Planning time: 0.487 ms
 Execution time: 588.583 ms

The difference is even more noticeable.

The thing is that the GIN does not matter exactly how many lines you get in the end - it must go through all the lines for which the request was successful, and rank them. RUM does this only for those lines that we really need. If we need a lot of lines, GIN wins. It ts_rankperforms calculations more efficiently than the operator <=>. But on small queries, the advantage of RUM is undeniable.

Most often, the user does not need to unload all 50 thousand documents from the database at once. He needs only 10 posts on the first, second, third page etc. And it is precisely under such cases that this index is sharpened, and it will give a good increase in search performance on a large base.

Join Tolerance

What if a search requires you to join another or more tables? For example, to display in the results the type of document, its owner? Or, as in my case, filter by the names of related entities?

Compare:

Request with two join for GIN

SELECT document_id, ts_rank("text", plainto_tsquery('запрос')) AS rank, case_number
FROM documents_documentvector
  RIGHT JOIN documents_document ON documents_documentvector.document_id = documents_document.id
  LEFT JOIN documents_case ON documents_document.case_id = documents_case.id
WHERE "text" @@ plainto_tsquery('запрос')
ORDER BY rank DESC
LIMIT 10;

Result:

Limit (actual time = 1637.902..1643.483 rows = 10 loops = 1)
   -> Gather Merge (actual time = 1637.901..1643.479 rows = 10 loops = 1)
         Workers Planned: 2
         Workers Launched: 2
         -> Sort (actual time = 1070.614..1070.687 rows = 652 loops = 3)
               Sort Key: (ts_rank (documents_documentvector.text, plainto_tsquery ('query' :: text))) DESC
               Sort Method: external merge Disk: 2968kB
               -> Hash Left Join (actual time = 323.386..1049.092 rows = 39851 loops = 3)
                     Hash Cond: (documents_document.case_id = documents_case.id)
                     -> Hash Join (actual time = 239.312..324.797 rows = 39851 loops = 3)
                           Hash Cond: (documents_documentvector.document_id = documents_document.id)
                           -> Parallel Bitmap Heap Scan on documents_documentvector (actual time = 11.022..37.073 rows = 39851 loops = 3)
                                 Recheck Cond: (text @@ plainto_tsquery ('query' :: text))
                                 Heap Blocks: exact = 9362
                                 -> Bitmap Index Scan on idx_gin_document (actual time = 12.094..12.094 rows = 119553 loops = 1)
                                       Index Cond: (text @@ plainto_tsquery ('query' :: text))
                           -> Hash (actual time = 227.856..227.856 rows = 472089 loops = 3)
                                 Buckets: 65536 Batches: 16 Memory Usage: 2264kB
                                 -> Seq Scan on documents_document (actual time = 0.009..147.104 rows = 472089 loops = 3)
                     -> Hash (actual time = 83.338..83.338 rows = 273695 loops = 3)
                           Buckets: 65536 Batches: 8 Memory Usage: 2602kB
                           -> Seq Scan on documents_case (actual time = 0.009..39.082 rows = 273695 loops = 3)
Planning time: 0.857 ms
Execution time: 1644.028 ms

On three join s and more, the request time reaches 2-3 seconds and grows with the number of join s.

But what about RUM? Let the request be immediately with five join.

Five join request for RUM

SELECT document_id, "text" <=> plainto_tsquery('запрос') AS rank, case_number,
  classifier_procedure.title, classifier_division.title, classifier_category.title
FROM documents_documentvector
  RIGHT JOIN documents_document ON documents_documentvector.document_id = documents_document.id
  LEFT JOIN documents_case ON documents_document.case_id = documents_case.id
  LEFT JOIN classifier_procedure ON documents_case.procedure_id = classifier_procedure.id
  LEFT JOIN classifier_division  ON documents_case.division_id = classifier_division.id
  LEFT JOIN classifier_category ON documents_document.category_id = classifier_category.id
WHERE "text" @@ plainto_tsquery('запрос') AND documents_document.is_active IS TRUE
ORDER BY rank
LIMIT 10;

Result:

Limit (actual time = 70.524..72.292 rows = 10 loops = 1)
  -> Nested Loop Left Join (actual time = 70.521..72.279 rows = 10 loops = 1)
        -> Nested Loop Left Join (actual time = 70.104..70.406 rows = 10 loops = 1)
              -> Nested Loop Left Join (actual time = 70.089..70.351 rows = 10 loops = 1)
                    -> Nested Loop Left Join (actual time = 70.073..70.302 rows = 10 loops = 1)
                          -> Nested Loop (actual time = 70.052..70.201 rows = 10 loops = 1)
                                -> Index Scan using document_vector_rum_index on documents_documentvector (actual time = 70.001..70.035 rows = 10 loops = 1)
                                      Index Cond: (text @@ plainto_tsquery ('query' :: text))
                                      Order By: (text <=> plainto_tsquery ('query' :: text))
                                -> Index Scan using documents_document_pkey on documents_document (actual time = 0.013..0.013 rows = 1 loops = 10)
                                      Index Cond: (id = documents_documentvector.document_id)
                                      Filter: (is_active IS TRUE)
                          -> Index Scan using documents_case_pkey on documents_case (actual time = 0.009..0.009 rows = 1 loops = 10)
                                Index Cond: (documents_document.case_id = id)
                    -> Index Scan using classifier_procedure_pkey on classifier_procedure (actual time = 0.003..0.003 rows = 1 loops = 10)
                          Index Cond: (documents_case.procedure_id = id)
              -> Index Scan using classifier_division_pkey on classifier_division (actual time = 0.004..0.004 rows = 1 loops = 10)
                    Index Cond: (documents_case.division_id = id)
        -> Index Scan using classifier_category_pkey on classifier_category (actual time = 0.003..0.003 rows = 1 loops = 10)
              Index Cond: (documents_document.category_id = id)
Planning time: 2.861 ms
Execution time: 72.865 ms

If you can’t do without join when searching, then RUM is clearly suitable for you.

Disk space

On a test base of ~ 500 thousand documents and 3.6 GB indexes occupied very different volumes.

idx_rum_document | 1950 MB
 idx_gin_document | 418 MB

Yes, the drive is cheap. But 2 GB instead of 400 MB cannot please. Half the size of the base is a bit much for the index. Here GIN unconditionally wins.

conclusions

You need a RUM if:

You have a lot of documents, but you give search results page by page
You need sophisticated filtering of search results
You do not mind the disk space

You will be completely satisfied with the GIN if:

You have a small base
You have a large base, but you need to produce results immediately and that's it
You do not need filtering with join s
Are you interested in the minimum index size on disk

I hope this article will remove a lot of WTF?! That occurs when working and setting up search in Postgres. I will be glad to hear advice from those who know how to configure everything even better!)

In the next parts I plan to tell more about RUM in my project: about using additional RUM options, working in the Django + PostgreSQL bundle.

Tags: