mgramin January 6, 2013 at 09:43

Postgre (no) SQL or again about storing data with a flexible structure

When it comes to storing flexible (previously unknown, often modified) data structures in the database, developers usually turn to the "great and terrible" EAV pattern, or to the now fashionable NOSQL databases.
Not so long ago, such a task became before me.
- EAV . It causes persistent hostility in me, and it was said and written about it there were a lot of all the negative (Kite, Fowler, Karvin, Gorman). The main minus is that when writing queries you have to operate not with real entities ("Employee", "Home", "Client", which is what SQL is for), but with objects organized at a lower level (sorry for the confusion). Therefore, this was the most undesirable option.
- NOSQL. At first, this option was very interested (in particular, MongoDB). After prolonged use of relational systems, at first you begin to experience a feeling of total freedom, which takes your breath away. Storage of documents of any structure, instant creation of new collections, requests to them - beauty! But after a short use, the euphoria began to subside, and problems showed up:
- Poor query language (IMHO) + lack of joins;
- Lack of schemes (a good article recently was on this topic (and not only on this one) habrahabr.ru/post/164361 );
- Lack of built-in support for referential integrity;
- Lack of gadgets in the form of stored procedures / functions, triggers, views, and much more.
- In my application, in addition to data with a flexible (variable) structure, it is also necessary to store ordinary static data - a table of users, visits, employees, etc. Working with (again IMHO) is much simpler and (most important) more reliable in a regular relational database (the same referential integrity, etc.).

The first problem (partially) I tried to solve using ORM (it was Spring Data), it allowed me to write tolerable queries on objects, but for this you need to create and compile all classes (corresponding to the necessary collections) in advance and operate on them already. For me, this did not fit, because collections should be created and changed frequently and quickly - “on the go”.
The second - by creating a separate collection for storing the structures of all other collections, in order to verify the correctness of the input data, etc.
Things didn’t get to the solution of the remaining problems, I quit ...
Already at this stage, my base began to resemble a very fragile structure, completely dependent on the application, plus I had to manually implement many things that most relations can do out of the box. Maybe this is normal, but somehow I’m not used to it, somehow it was not at ease.

Next, I thought about how cool it would be to combine relational and NOSQL DBMSs. On the one hand, all the power of relational with all the attached, on the other - the ease and elegance of a document-oriented solution. Indeed, what makes it difficult to store objects with a flexible structure in some separate special table (s), for example, in xml format, and access them using XPATH, especially since many modern DBMSs have developed tools for working with XML (including indexing).
I decided to try using a small example using Postgresql that it will turn out what the queries will look like:

For starters, two service tables are enough, I think the comments are superfluous:

CREATE TABLE classes
(
  id integer NOT NULL,
  name text,
  is_closed boolean,
  obects_count integer,
  CONSTRAINT classes_pk PRIMARY KEY (id )
);
CREATE TABLE objects
(
  id integer NOT NULL,
  body xml,
  id_classes integer,
  CONSTRAINT objects_pk PRIMARY KEY (id ),
  CONSTRAINT classes_objects FOREIGN KEY (id_classes)
      REFERENCES classes (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION
);
CREATE INDEX fki_classes_objects
  ON objects
  USING btree
  (id_classes );

We create two entities for the experiment:

INSERT INTO classes(
            id, name, is_closed, obects_count)
    VALUES (1, 'customers', FALSE, 0);
INSERT INTO classes(
            id, name, is_closed, obects_count)
    VALUES (2, 'orders', FALSE, 0);

We will prepare two functions for generating test random data (taken on the Internet):

CREATE OR REPLACE FUNCTION random(numeric, numeric)
  RETURNS numeric AS
$BODY$
   SELECT ($1 + ($2 - $1) * random())::numeric;
$BODY$
  LANGUAGE sql VOLATILE
  COST 100;
CREATE OR REPLACE FUNCTION random_string(length integer)
  RETURNS text AS
$BODY$
declare
  chars text[] := '{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}';
  result text := '';
  i integer := 0;
begin
  if length < 0 then
    raise exception 'Given length cannot be less than 0';
  end if;
  for i in 1..length loop
    result := result || chars[1+random()*(array_length(chars, 1)-1)];
  end loop;
  return result;
end;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;

Filling the table with random data, objects of the classes “Client” and “Order” (one-to-many relationship, each client made five orders):

DO $$
DECLARE
	customer_pk integer;
	order_pk integer;
BEGIN
	FOR i in 1..10000 LOOP
		customer_pk := nextval('objects_id_seq');
		order_pk := nextval('objects_id_seq');
		insert into objects (body, id_classes) values((
		'' || customer_pk || '' || random_string('10') || '' || random_string('10') || '')::xml, 1);
		for j in 1..5 LOOP
			insert into objects (body, id_classes) values((
			'' || order_pk || '' || customer_pk || '' || random(1, 1000) || '')::xml, 2);
		end loop;
	END LOOP;
END$$;

The first request we choose the maximum order value:

explain select max(((xpath('/Orders/Order/Cost/text()', O.body))[1])::text::float) as cost_of_order
  from Objects O
 where O.id_classes = 2;

/*
Aggregate  (cost=2609.10..2609.11 rows=1 width=32)
  ->  Seq Scan on objects o  (cost=0.00..2104.50 rows=50460 width=32)
        Filter: (id_classes = 2)
*/

The request turned out to be a bit tricky, but still quite understandable: it’s immediately clear to which entity the request is made out and by what attribute. Oddly enough, the full scan turned out, but nothing prevents us from building the index using the Cost attribute:

create index obj_orders_cost_idx on objects using btree (((xpath('/Orders/Order/Cost/text()', body))[1]::text::float));

And now the request fulfills much faster and uses the index:

/*
Result  (cost=0.15..0.16 rows=1 width=0)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.00..0.15 rows=1 width=32)
          ->  Index Scan Backward using obj_orders_cost_idx on objects o  (cost=0.00..7246.26 rows=50207 width=32)
                Index Cond: ((((xpath('/Orders/Order/Cost/text()'::text, body, '{}'::text[]))[1])::text)::double precision IS NOT NULL)
                Filter: (id_classes = 2)
*/

Now let's try to select information about the orders of several specific employees, i.e. a bunch of two tables:

explain select (xpath('/Customers/Customer/Name/text()', C.body))[1] as customer
     , (xpath('/Orders/Order/Cost/text()', O.body))[1] as cost_of_order
  from objects C
     , objects O
 where C.id_classes = 1
   and O.id_classes = 2
   and (xpath('/Orders/Order/Customer_id/text()', O.body))[1]::text::int = (xpath('/Customers/Customer/ID/text()', C.body))[1]::text::int
   and ((xpath('/Customers/Customer/ID/text()' ,C.body))[1])::text::int between 1997585 and 1997595;

/*
Hash Join  (cost=1873.57..6504.85 rows=12867 width=64)
  Hash Cond: ((((xpath('/Orders/Order/Customer_id/text()'::text, o.body, '{}'::text[]))[1])::text)::integer = (((xpath('/Customers/Customer/ID/text()'::text, c.body, '{}'::text[]))[1])::text)::integer)
  ->  Seq Scan on objects o  (cost=0.00..2104.50 rows=50460 width=32)
        Filter: (id_classes = 2)
  ->  Hash  (cost=1872.93..1872.93 rows=51 width=32)
        ->  Bitmap Heap Scan on objects c  (cost=196.38..1872.93 rows=51 width=32)
              Recheck Cond: (id_classes = 1)
              Filter: (((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer >= 1997585) AND ((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer <= 1997595))
              ->  Bitmap Index Scan on fki_classes_objects  (cost=0.00..196.37 rows=10140 width=0)
                    Index Cond: (id_classes = 1)
*/

Expected fullscan, now index slightly:

create index obj_customers_id_idx on objects using btree (((xpath('/Customers/Customer/ID/text()', body))[1]::text::int));
create index obj_orders_id_idx on objects using btree (((xpath('/Orders/Order/ID/text()', body))[1]::text::int));
create index obj_orders_customerid_idx on objects using btree (((xpath('/Orders/Order/Customer_id/text()', body))[1]::text::int));

Now it’s more fun:

/*
Hash Join  (cost=380.52..5011.80 rows=12867 width=64)
  Hash Cond: ((((xpath('/Orders/Order/Customer_id/text()'::text, o.body, '{}'::text[]))[1])::text)::integer = (((xpath('/Customers/Customer/ID/text()'::text, c.body, '{}'::text[]))[1])::text)::integer)
  ->  Seq Scan on objects o  (cost=0.00..2104.50 rows=50460 width=32)
        Filter: (id_classes = 2)
  ->  Hash  (cost=379.88..379.88 rows=51 width=32)
        ->  Bitmap Heap Scan on objects c  (cost=204.00..379.88 rows=51 width=32)
              Recheck Cond: (((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer >= 1997585) AND ((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer <= 1997595) AND (id_classes = 1))
              ->  BitmapAnd  (cost=204.00..204.00 rows=51 width=0)
                    ->  Bitmap Index Scan on obj_customers_id_idx  (cost=0.00..7.35 rows=303 width=0)
                          Index Cond: (((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer >= 1997585) AND ((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer <= 1997595))
                    ->  Bitmap Index Scan on fki_classes_objects  (cost=0.00..196.37 rows=10140 width=0)
                          Index Cond: (id_classes = 1)
*/

This request also did not lose its clarity, but it can be combed even more: deal with type conversion, optimize the xml structure, etc. There is still a lot of work, this is just a small example.

What else can be done:

1. Flexible search by attributes of objects of any classes;
2. The objects table can be partitioned (at least partially), for example, to store objects of large classes physically separately.

Storing data in a database in xml format is naturally not a novelty, however, when searching for a solution to my question, there was very little information about this, not to mention specific examples. I hope someone comes in handy and / or to hear in the comments the reviews and opinions of people who have worked with a similar scheme.

Tags:

Postgre (no) SQL or again about storing data with a flexible structure

Also popular now: