About interval indices
Under the cut, we will understand whether a special index is needed for indexing intervals, how to deal with multidimensional intervals, is it true that a 2-dimensional rectangle can be treated like a 4-dimensional point, etc. All this with the example of PostgreSQL.
In development of the topic ( 1 , 2 , 3 , 4 , 5 , 6 , 7 ).
Sometimes data is presented not by values, but, say, by a confidence interval. Or are these really ranges of values. Again, the rectangle is also an interval, but not one-dimensional. Is there a possibility for a logarithmic time to check for the presence in the data of intervals that intersect with the desired one?
Oh sure. Sometimes they come up withingenious schemes with the calculation and indexing of block (with further fine filtering) values based on, for example, Gray codes . In this case, the interval with loss of accuracy is encoded into a certain (indexed) number; when searching, the interval generates a set of values (sub-intervals) that should be checked. This set is limited by the logarithm.
But GiST can use R-trees to index time intervals (tsrange).
create table reservations(during tsrange);
insert into reservations(during) values
('[2016-12-30, 2017-01-09)'),
('[2017-02-23, 2017-02-27)'),
('[2017-04-29, 2017-05-02)');
create index on reservations using gist(during);
select * from reservations where during && '[2017-01-01, 2017-04-01)';
Of course, the author, with his
So how does the R-tree handle intervals? As well as with other spatial objects. An interval is a one-dimensional rectangle; when splitting a page, the R-tree tries to minimize the variance inside descendant pages and maximize it between them.
Is it possible to turn an interval into a point? Considering, for example, its beginning as X and the end as Y. Formally it is possible, after all, who can forbid us to use two numbers as a two-dimensional point? But in such data there will be an internal dependence. And the semantics in the data will lead to semantics in the search queries and the interpretation of the results.
The easiest way to figure out a situation is by looking at it with your eyes.
Here are 100 random intervals ranging in size from 0 to 10,000 with the beginning of an interval from 0 to 100,000.
Figure 1 The
search query is [41,000 ... 47,000], we are interested in all intervals intersecting with a given range.
The numbers indicate the search zones that arise. Everything below the diagonal for obvious reasons does not exist.
So, by zones:
- here fall intervals that are strictly inside the search query
- here everything that started and ended before the interval we need
- these intervals started before and ended during the search query
- here are the intervals that completely cover the search
- these intervals started during, but ended after the request
- here everything that started and ended after him
Thus, to search for all intersections, the query must be expanded to [0 ... 47 000, 41 000 ... 100 000], i.e. it should include zones 1,3,4,5.
The fact that the request has become the size of a quarter of the entire layer should not be scary, because only a narrow strip along the diagonal is populated. Yes, there may be large intervals the size of the entire layer, but there are relatively few.
Of course, large objects will affect performance, but nothing can be done about it, because they really exist and really should fall into the issuance of almost any request. If you try to index the “pasta monster”, the index will actually stop working, it will be cheaper to view the table and filter out the excess.
However, apparently, any indexing method is ill from such data.
Are there any other disadvantages of this approach? Since the data has “degenerated” in some way, perhaps the heuristic of splitting pages of the R-tree will be less efficient. But the biggest inconvenience is that we must constantly keep in mind that we are dealing with an interval and correctly set search queries.
However, this does not apply to the R-tree, but to any spatial index. Including the Z-curve (built on the Z-order B-tree) that we used earlier. Data degeneration does not matter for the Z-curve, because these are just numbers with their natural order.
Among other things, the Z-curve is also more compact than the R-tree. Therefore, we will try to clarify its prospects in the designated area.
For example, we accepted that the X-coordinate is the beginning of the segment, and Y is its end. But what if the opposite? The R-tree should not care, for the Z-curve, little will change either. The crawl will be organized differently, but the average number of pages read will be the same.
How about mirroring data on one of the axes . In the end, we have the Z-curve (mirrored in Y), so let the data lie on the oblique crossbar of the letter Z. Well, let's conduct a numerical experiment.
Take 1,000 random intervals and divide them into 10 pages depending on the Z-value. Those. sort and divide by 100 pieces, after we'll see what we got. We will prepare 2 sets - X, Y and max (X) -X, Y
Figure 2 Normal Z-curve
Figure 3 Mirrored data.
You can clearly see how “worse” the displayed data is, the pages crawl on top of each other, the perimeter of the pages is much larger, and this guarantees that, on average, other things being equal, this option will read more pages.
Why did this happen? In this case, the reason lies on the surface. For the 2-dimensional rectangle Z-value of the lower left corner <= than for the upper right. Therefore, the ideal way to carefully cut almost linear data into pages is to place them along the line X = Y.
And if you index not the beginning-end of the range, but its center-length ? In this case, the data stretches along the X axis, as if we turned Figure 2 clockwise by 45 degrees. The problem is that at the same time, the search area will also rotate 45 degrees and it will no longer be possible to search in it with a single query.
Is it possible to index intervals with more than one dimension in this way ? Of course. A 2-dimensional rectangle turns into a 4-dimensional point. The main thing is not to forget in which of the coordinates that we put.
Is it possible to mix intervals with non-interval values ? Yes, because the Z-curve search algorithm does not know the semantics of values, it works with faceless numbers. If the search extent is set in accordance with this semantics and the interpretation of the results is based on it, no problems arise.
Numerical experiment
Let's try to check the performance of the Z-curve in conditions close to the “pasta monster”. The data model is as follows:
- 8-dimensional index, the performance and performance of which we have already tested
- as usual, 100 million points
- objects - 3-dimensional parallelepipeds - occupy 6 dimensions
- we will give the remaining 2 dimensions to the time interval
- parallelepipeds of 100 pieces - they form an X&Y lattice 10X10
- each box lives one unit of time
- for each unit of time, the parallelepiped grows by 10 units in Z, starting from a height of 0
Figure 4 a slice of data at time T = 1000.
In this test, we check the indexing of extended objects, as it affects the search performance.
Create an 8-dimensional table, fill in the data and index
create table test_points_8d (p integer[8]);
COPY test_points_8d from '/home/.../data.csv';
create index zcurve_test_points_8d on test_points_8d (zcurve_num_from_8coords (p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8]));
Key order: xmax, xmin, ymax, ymin, zmax, zmin, tmax, tmin
data file (data.csv):
{210000,200000,210000,200000,0,0,0,0}
{210000,200000,210000,200000,10,0,1,1}
{210000,200000,210000,200000,20,0,2,2}
{210000,200000,210000,200000,30,0,3,3}
{210000,200000,210000,200000,40,0,4,4}
{210000,200000,210000,200000,50,0,5,5}
{210000,200000,210000,200000,60,0,6,6}
{210000,200000,210000,200000,70,0,7,7}
{210000,200000,210000,200000,80,0,8,8}
{210000,200000,210000,200000,90,0,9,9}
...
Test request [200 000 ... 300 000; 200,000 ... 300,000; 100 ... 1000; 10 ... 11]:
select c, t_row from (select c, (select p from test_points_8d t where c = t.ctid) t_row
from zcurve_8d_lookup_tidonly('zcurve_test_points_8d',
200000,0,200000,0,100,0,10,0,
1000000,300000,1000000,300000,1000000,1000,1000000,11
) as c) x;
Answer:
c | t_row
-------------+-------------------------------------------
(0,11) | {210000,200000,210000,200000,100,0,10,10}
(0,12) | {210000,200000,210000,200000,110,0,11,11}
(103092,87) | {260000,250000,210000,200000,100,0,10,10}
(103092,88) | {260000,250000,210000,200000,110,0,11,11}
(10309,38) | {210000,200000,260000,250000,100,0,10,10}
(10309,39) | {210000,200000,260000,250000,110,0,11,11}
(113402,17) | {260000,250000,260000,250000,100,0,10,10}
(113402,18) | {260000,250000,260000,250000,110,0,11,11}
(206185,66) | {310000,300000,210000,200000,100,0,10,10}
(206185,67) | {310000,300000,210000,200000,110,0,11,11}
(216494,93) | {310000,300000,260000,250000,100,0,10,10}
(216494,94) | {310000,300000,260000,250000,110,0,11,11}
(20618,65) | {210000,200000,310000,300000,100,0,10,10}
(20618,66) | {210000,200000,310000,300000,110,0,11,11}
(123711,44) | {260000,250000,310000,300000,100,0,10,10}
(123711,45) | {260000,250000,310000,300000,110,0,11,11}
(226804,23) | {310000,300000,310000,300000,100,0,10,10}
(226804,24) | {310000,300000,310000,300000,110,0,11,11}
(18 rows)
Why such a data model is good, you can easily evaluate the correctness of the result.
On a freshly raised server, the same request via EXPLAIN (ANALYZE, BUFFERS)
shows: that 501 pages were read .
Ok, let's move the index. But what if you look at the same query on the index with the key order - xmin, ymin, zmin, xmax, ymax, zmax, tmin, tmax. The launch shows 531 reads. Too much.
Well, we eliminate the geometric bias, now let the columns grow not by 10, but by 1 unit per cycle. Request (key: xmax, xmin, ymax, ymin, zmax, zmin, tmax, tmin)
EXPLAIN (ANALYZE,BUFFERS) select c, t_row from (select c, (select p from test_points_8d t where c = t.ctid) t_row
from zcurve_8d_lookup_tidonly('zcurve_test_points_8d',
200000,0,200000,0,0,0,10,0,
1000000,300000,1000000,300000,1000000,1000,1000000,11) as c) x;
says it read 156 pages!
Now for some statistics, let's look at the average number of reads / hits in the cache for different types of keys in a series of 10,000 random requests of
size 100,000X100,000X100,000X10
When the key order is
(xmin, 1,000,000-xmax, ymin, 1,000,000-ymax, zmin , 1 000 000-zmax, tmin, 1 000 000-tmax) The
average number of reads is 21.8217, hits in the cache are 2437.51.
In this case, the average number of objects in the output is 17.81
With the key order - (xmax, xmin, ymax, ymin, zmax, zmin, tmax, tmin) The
average number of reads is 14.2434, and hits in the cache are 1057.19.
In this case, the average number of objects in the output is 17.81
With the key order - (xmin, xmax, ymin, ymax, zmin, zmax, tmin, tmax)
The average number of reads is 14.0774, hits in the cache are 1053.22.
At the same time, the average number of objects in the output is 17.81.
As you can see, the page cache in this case works very efficiently. That number of readings is already acceptable.
And finally, let's look with our eyes, just out of curiosity, how the Z-curve algorithm copes with 8-dimensional space.
Below are projections of parsing the request in the extent -
[200 000 ... 300 000; 200,000 ... 300,000; 0 ... 700 001; 700 000 ... 700 001].
A total of 18 results, 1,687 subqueries. The generated subqueries are shown, green crosses - readings in a B-tree, blue crosses - results found.
Under each cross there may be several marks. Nearby marks merge, for example, the time 700 000 and 700 001.
Figure 5 X-axis Figure
6 Y-axis Figure
7 Z-axis Figure
8 Time