A little bit about query optimization
I want to tell you with a simple example how sometimes it is possible to optimize very simple at first glance queries. Let's take such code, for example, on PostgreSQL 9.3, but the principle applies to all subdos in which hash join is present.
The task is simple - to join two tables - one is very large, the other is small - but the join is not simple, butgolden with OR. (As a real case - join of the table of postings on accounts to the accounts themselves, given that there are two fields with an account in the posting - for debit and credit.)
We prepare test data:
And here is our request in its original form, with the condition of or.
A join with this condition will receive a guaranteed nested loop. The plan is as follows:
Pay attention to two million cycles of passage through a small table. This is the main minus of the nested loop. Even if there were two million searches by index, it’s still bad.
What to do?
Had join would help us - a hash of a small table that fits into memory + one pass through a large one - perfect. But with OR, the condition is not to get it. There is an exit!
Let's make two joins on different fields and put OR in the filter:
The plan has become much faster. A 5kb hash table is what you need: even taking into account that there are two joins, you win ten times!
Thanks for attention.
The task is simple - to join two tables - one is very large, the other is small - but the join is not simple, but
We prepare test data:
create table public.test1
( id bigint,
id2 bigint,
value varchar(100)
);
create table public.test2
( id bigint,
value varchar(100)
) ;
insert into public.test1
select generate_series(1,2000000), 1, 'abcdef';
insert into public.test2
select generate_series(1,100), 'abcdefssdf';
create index ix_test2_id on public.test2 (id);
/* наличие индекса на маленькой таблице принципиально ничего не меняет, но он тут в реальности наверняка будет, так что пусть и в тесте пусть будет. */
And here is our request in its original form, with the condition of or.
select *
from public.test1 t1
inner join public.test2 t2 on t1.id = t2.id or t1.id2 = t2.id;
A join with this condition will receive a guaranteed nested loop. The plan is as follows:
"Nested Loop (cost=0.00..3532741.25 rows=2000099 width=42) (actual time=0.043..61627.189 rows=2000099 loops=1)"
" Join Filter: ((t1.id = t2.id) OR (t1.id2 = t2.id))"
" Rows Removed by Join Filter: 197999901"
" -> Seq Scan on test1 t1 (cost=0.00..32739.00 rows=2000000 width=23) (actual time=0.010..385.658 rows=2000000 loops=1)"
" -> Materialize (cost=0.00..2.50 rows=100 width=19) (actual time=0.000..0.007 rows=100 loops=2000000)"
" -> Seq Scan on test2 t2 (cost=0.00..2.00 rows=100 width=19) (actual time=0.005..0.018 rows=100 loops=1)"
"Total runtime: 61717.751 ms"
Pay attention to two million cycles of passage through a small table. This is the main minus of the nested loop. Even if there were two million searches by index, it’s still bad.
What to do?
Had join would help us - a hash of a small table that fits into memory + one pass through a large one - perfect. But with OR, the condition is not to get it. There is an exit!
Let's make two joins on different fields and put OR in the filter:
select *
from public.test1 t1
left join public.test2 t2 on t1.id = t2.id
left join public.test2 t3 on t1.id2 = t3.id
where t2.id is not null or t3.id is not null;
The plan has become much faster. A 5kb hash table is what you need: even taking into account that there are two joins, you win ten times!
"Hash Left Join (cost=6.50..67746.50 rows=2000000 width=61) (actual time=0.124..2230.636 rows=2000000 loops=1)"
" Hash Cond: (t1.id2 = t3.id)"
" Filter: ((t2.id IS NOT NULL) OR (t3.id IS NOT NULL))"
" -> Hash Left Join (cost=3.25..40243.25 rows=2000000 width=42) (actual time=0.073..1065.822 rows=2000000 loops=1)"
" Hash Cond: (t1.id = t2.id)"
" -> Seq Scan on test1 t1 (cost=0.00..32739.00 rows=2000000 width=23) (actual time=0.012..338.759 rows=2000000 loops=1)"
" -> Hash (cost=2.00..2.00 rows=100 width=19) (actual time=0.041..0.041 rows=100 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 5kB"
" -> Seq Scan on test2 t2 (cost=0.00..2.00 rows=100 width=19) (actual time=0.004..0.015 rows=100 loops=1)"
" -> Hash (cost=2.00..2.00 rows=100 width=19) (actual time=0.039..0.039 rows=100 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 5kB"
" -> Seq Scan on test2 t3 (cost=0.00..2.00 rows=100 width=19) (actual time=0.004..0.016 rows=100 loops=1)"
"Total runtime: 2318.009 ms"
Thanks for attention.