We fasten the spatial index to the unsuspecting OpenSource DBMS

    image
    I always liked it when the headline clearly says what will happen next, for example, "The Texas Chainsaw Massacre." Therefore, under the cut, we will really add a spatial search to the DBMS, in which it was not originally there.

    Introductory
    Let’s try to do without general words about the importance and usefulness of spatial search.
    When asked why it wasn’t possible to take an open DBMS with already integrated spatial search, the answer would be: “if you want to do something well, do it yourself” (C). But seriously, we will try to show that most of the practical spatial problems can be solved on a completely ordinary desktop without overvoltage and special costs.

    As an experimental DBMS we will use the open edition of OpenLink Virtuoso. In its paid version, spatial indexing is available, but this is a completely different story. Why exactly this DBMS? The author likes her. Fast, easy, powerful. Copy-run-everything works. We will use only regular tools, no C-plugins or special assemblies, only the official assembly and text editor.

    Index Type
    We will index points using a pixel-code (block) index.
    What is a pixel-code index? This is a compromise version of the sweeping curve, allowing you to work with inaccurate objects in a 2-dimensional search space. Combine in this class all the methods that:
    • Divide the search space into blocks in some way
    • All blocks are numbered.
    • Each indexed object is assigned one or more block numbers in which it is located
    • When searching, the extent is split into blocks, and for each block or a continuous chain of blocks from a regular integer index (B-tree), all objects that belong to them are obtained.

    The most famous methods are:
    • Good old Igloo (or other sweeping curve ) of limited accuracy. The space is cut by a lattice, the step of which (usually) is the characteristic size of the indexed object. Cells are numbered as X ... xY ... yZ ... z ... i.e. by gluing cell numbers along the normals.
    • HEALPix . The name comes from the Hierarchical Equal Area iso-Latitude Pixelisation. This method is designed to split a spherical surface in order to avoid the main drawback of Igloo - different areas under the cells at different latitudes. The sphere is divided into 12 equal areas, each of which is further hierarchically divided into 4 sub - sections. Illustration for this method is in the header of the article.
    • Hierarchical Triangular Mesh ( HTM ) .This is a scheme for successively approaching a sphere using triangles, starting with 8 starting ones, each of which is hierarchically divided into 4 sub - trianglesimage
    • One of the best known block index schemes is GRID. Spatial data is considered as a special case of a multidimensional index. Wherein:
      1. space is cut into pieces using a grill
      2. the resulting cells are numbered in some way
      3. data in a specific cell is stored together
      4. attribute values ​​are translated into cell numbers along the axes using the “directory”, it is assumed that it is small and is stored in memory
      5. if the “directory” is large, it is divided into levels a la tree
      6. when changing, the lattice can adapt, cells or columns split or merge, on the contrary
      A 3-level GRID based on Z-order is used by DB2, a 4-level GRID based on a variant of the Hilbert curve is used in MS SQLServer 2008

    With all the minuses, the main one of which, perhaps, is non-adaptability, the need to customize the index for a specific task, block indexes have pluses: simplicity and omnivorous. In essence, this is an ordinary tree with additional semantic load. We somehow convert the geometry to numbers when indexing and in the same way we get the keys for searching the index.
    In this paper, we will use a horizontal scan (+1 ystein ) because of its simplicity: coordinates can be converted into numbers cheaply using PL / SQL. Using HTM or a Hilbert curve without involving C-plugins would be difficult.

    Data type
    Why dots? As already noted, block indices are the same as indexing. At the same time, the rasterizer for points is trivial, and for polygons it would have to be written, and this is absolutely not important for disclosing the topic.

    Actually data
    Without further ado,
    • we put points on the nodes of the lattice 4000X4000 with the beginning at (0,0) and step 1.
    • we get a square of 16 million points.
    • take the block index cell for 10X10,
    • so we have 400x400 cells in each of 100 points.
    For the index, it does not matter how the points are located inside the cell, but such regularity will allow us to control the correctness of the results.

    To generate data, we will use this kind of script on AWK (let it be 'data.awk'):
    BEGIN {
      cnt = 1; sz = 4000;
      # Set the checkpoint interval to 100 hours -- enough to complete most of experiments.print"checkpoint_interval (6000);"; 
      print"SET AUTOCOMMIT MANUAL;";
      for (i=0; i < sz; i++) {
       for (j=0; j < sz; j++) {
        print"insert into \"xxx\".\"YYY\".\"_POINTS\"";
        print"  (\"oid_\",\"x_\", \"y_\") values ("cnt","i","j");";
        cnt++;
       }
        print"commit work;"
      }
      print"checkpoint;"print"checkpoint_interval (60);"; 
      exit;
    }
    


    Server
    • Take the official build from the company's website.
      The author used the latest version 7.0.0 (x86_64), but this is not important, you can take any other.
    • Install, if not, Microsoft Visual C ++ 2010 Redistributable Package (x64)
    • We select the directory for the experiments and copy libeay32.dll, ssleay32.dll, isql.exe, virtuoso-t.exe from the archive 'bin' and virtuoso.ini from the archive 'database' there
    • Start the server using 'start virtuoso-t.exe + foreground', allow the firewall to open ports
    • We test server availability using the command 'isql localhost: 1111 dba dba'. After establishing a connection, we introduce, for example. 'select 1;' and make sure the server is alive.


    Data Scheme
    To start, create a table of points, nothing more:
    createuser"YYY";
    createtable"xxx"."YYY"."_POINTS" (
    	"oid_"integer,
    	"x_"doubleprecision,
    	"y_"doubleprecision,
    	primary key ("oid_") 
    );
    

    Now the index itself:
    createtable"xxx"."YYY"."_POINTS_spx" (
    	"node_"integer,  
    	"oid_"integerreferences"xxx"."YYY"."_POINTS", 
    	primary key ("node_", "oid_") 
    );
    
    where node_ is the identifier of the space cell. Note that although the index is created as a table, both fields are present in the primary key and, therefore, are packaged in a tree.

    Configure index options:
    registry_set ('__spx_startx', '0');
    registry_set ('__spx_starty', '0');
    registry_set ('__spx_nx', '4000');
    registry_set ('__spx_ny', '4000');
    registry_set ('__spx_stepx', '10');
    registry_set ('__spx_stepy', '10');
    

    The registry_set system function allows you to write string name / value pairs to the system registry - this is much faster than keeping them in an auxiliary table and is still covered by transactions.

    Recording trigger:
    createtrigger"xxx_YYY__POINTS_insert_trigger"afterinserton"xxx"."YYY"."_POINTS"referencingnewas n
    {
      declare nx, ny integer;
      nx := atoi (registry_get ('__spx_nx'));
      declare startx, starty doubleprecision;
      startx := atof (registry_get ('__spx_startx'));
      starty := atof (registry_get ('__spx_starty'));
      declare stepx, stepy doubleprecision;
      stepx := atof (registry_get ('__spx_stepx'));
      stepy := atof (registry_get ('__spx_stepy'));
      declare sx, sy integer;
      sx := floor ((n.x_ - startx)/stepx);
      sy := floor ((n.y_ - starty)/stepy);
      declare ixf integer;
      ixf := (nx / stepx) * sy + sx;
      insertinto"xxx"."YYY"."_POINTS_spx" ("oid_","node_") 
        values (n.oid_,ixf);
    };
    

    We put all this in, for example, 'sch.sql' and execute through
    isql.exe localhost:1111 dba dba <sch.sql
    

    Data loading
    After the circuit is ready, you can fill in the data.
    gawk -f data.awk | isql.exe localhost:1111 dba dba >log 2>errlog
    
    It took the author 45 minutes (~ 6000 per second) on Intel i7-3612QM 2.1GHz with 8Gb of memory.
    The size of the memory occupied by this server was about 1 Gb, the size of the database on the disk was 540Mb, i.e. ~ 34 bytes per point and its index.
    Now you need to make sure that the result is correct, enter in isql
    selectcount(*) from"xxx"."YYY"."_POINTS";
    selectcount(*) from"xxx"."YYY"."_POINTS_spx";
    
    Both requests must issue 16 000 000.

    As a reference point, you can use the data published here :
    image.
    They are received on PostgreSQL (Linux), 2.2GHz Xeon.
    On the one hand, the data was obtained 9 years ago. On the other hand, in this case, disk operations are a bottleneck, and random access to the disk over the past period has accelerated not so much.
    And here we can learn that the time of spatial indexing is commensurate with the time of filling the actual data.
    It should be noted that data filling with subsequent indexing occurs (for obvious reasons) much faster than indexing on the fly. But we are more interested in indexing on the fly because for practical purposes, a spatial DBMS is more interesting than a spatial search engine.

    And yet, it should be noted that when filling in spatial data, the order of their filling is very important.
    Both in this work and in the aforementioned point feeding order was ideal (or close to it) in terms of spatial index. If you mix data, performance will drop quite dramatically, regardless of which DBMS you choose.
    If the data comes randomly, we can recommend accumulating it in the sump, sorting and filling it in portions as it overflows.

    We set up a search
    For this, we will use such a wonderful feature of Virtuoso as procedural views.
    createprocedure"xxx"."YYY"."_POINTS_spx_enum_items_in_box" (
    	in minx doubleprecision, 
    	in miny doubleprecision, 
    	in maxx doubleprecision, 
    	in maxy doubleprecision)
    {
      declare nx, ny integer;
      nx := atoi (registry_get ('__spx_nx'));
      ny := atoi (registry_get ('__spx_ny'));
      declare startx, starty doubleprecision;
      startx := atof (registry_get ('__spx_startx'));
      starty := atof (registry_get ('__spx_starty'));
      declare stepx, stepy doubleprecision;
      stepx := atof (registry_get ('__spx_stepx'));
      stepy := atof (registry_get ('__spx_stepy'));
      declare sx, sy, fx, fy integer;
      sx := floor ((minx - startx)/stepx);
      fx := floor (1 + (maxx - startx - 1)/stepx);
      sy := floor ((miny - starty)/stepy);
      fy := floor (1 + (maxy - starty - 1)/stepy);
      declare mulx, muly integer;
      mulx := nx / stepx;
      muly := ny / stepy;
      declare res any;
      res := dict_new ();
      declare cnt integer;
      for (declare iy integer, iy := sy; iy < fy; iy := iy + 1)  
      {
        declare ixf, ixt integer;
        ixf := mulx * iy + sx;
        ixt := mulx * iy + fx;
        for select"node_", "oid_"from"xxx"."YYY"."_POINTS_spx"where"node_" >= ixf and"node_" < ixt do  
        {
          dict_put (res, "oid_", 0);
          cnt := cnt + 1;
        }
      }
      declare arr any;
      arr := dict_list_keys (res, 1);
      gvector_digit_sort (arr, 1, 0, 1);
      result_names(sx);
      foreach (integer oid in arr) do  
      {
        result (oid);
      }
    };
    createprocedureview"xxx"."YYY"."v_POINTS_spx_enum_items_in_box"as"xxx"."YYY"."_POINTS_spx_enum_items_in_box" (minx, miny, maxx, maxy) ("oid"integer); 
    
    In essence, this procedure is very simple, it finds out which cells of the block index the search extent falls into and for each row of these cells makes a sub-request, writes the identifiers of the entries to the hash map, eliminating duplicates, and gives the resulting set of identifiers as a result.
    How to use a spatial index? For example, like this:
    selectcount(*), avg(x.x_),  avg(x.y_)  from"xxx"."YYY"."_POINTS"as x, "xxx"."YYY"."v_POINTS_spx_enum_items_in_box"as a
      where a.minx = 91.and a.miny = 228.and a.maxx = 139.and a.maxy = 295.and x."oid_" = a.oid 
      and x.x_ >= 91.and x.x_ < 139.and x.y_ >= 228.and x.y_ < 295.;
    
    Here, a join is made between the table of points and the synthetic record from the procedural view by identifiers. Here, the procedural view acts as a coarse filter that works with precision to the cell. Therefore, in the last line we put a thin filter.
    Expected Result:
    • Number: (139 - 91) * (295 - 228) = 3216
    • X-average: (139 + 91 - 1) / 2 = 114.5 (-1 because the right border is excluded)
    • Y average: (295 + 228 - 1) / 2 = 261 (...)
    So, with one query within the same transaction, we performed a spatial search.

    Benchmark
    To prepare the request flow, we will use the following script on AWK:
    BEGIN {
      srand();
      for (i=0; i < 100; i++) {
       for (j=0; j < 100; j++)  {
    	startx = int(rand() * 3900);
    	finx = int(startx + rand() * 100);
    	starty = int(rand() * 3900);
    	finy = int(starty + rand() * 100);
    	print"select count(*), avg(x.x_),  avg(x.y_) from";
    	print"\"xxx\".\"YYY\".\"_POINTS\" as x, "print"\"xxx\".\"YYY\".\"v_POINTS_spx_enum_items_in_box\" as a"print"where a.minx = "startx". and a.miny = "starty"."print" and a.maxx = "finx". and a.maxy = "finy". and"print"x.\"oid_\" = a.oid and x.x_ >= "startx". and x.x_ < "finx"."print" and x.y_ >= "starty". and x.y_ < "finy".;"
       }
      }
      exit;
    }
    
    As you can see, we get 10,000 requests in random places of random size (0 ... 100). We will prepare 4 such sets and perform three series of tests - run one isql with tests, 2 in parallel and 4 in parallel. On an eight-core processor, it makes no sense to parallelize more. Measure the performance.
    • 1 thread: 151 sec = ~ 66 requests per second per thread
    • 2 threads: 159 sec = ~ 63 ...
    • 4 threads: 182 sec = ~ 55 ...

    You can try to compare these times with those obtained by the cerebrospinal tract using two ordinary indices - x_ & y_. We index the table by these fields and run a similar series of tests with queries of the type
    selectcount(*), avg(x.x_),  avg(x.y_)  from"xxx"."YYY"."_POINTS"as x
      where x.x_ >= 91.and x.x_ < 139.and x.y_ >= 228.and x.y_ < 295.;
    
    The result is this: a series of 10,000 queries is completed in 417 seconds, we have 24 queries per second.

    Conclusion The
    presented method has a number of drawbacks, the main of which, as already noted, is the need to configure it for a specific task. Its performance is highly dependent on settings that cannot be changed after the index is created. However, this is true for most spatial indices in general.

    But even such an index written on the knee gives a decent performance. In addition, since we are in full control of the process, on this basis we can make whiter complex constructions, for example, multi-table indexes, which are difficult to construct using regular methods. But about it somehow next time.

    Also popular now: