When to use unstructured data types in PostgreSQL? Hstore vs. Comparison JSON vs. Jsonb

Original author: Craig Kerstiens
  • Transfer

Since PostgreSQL started supporting NoSQL (through HStore, JSON and JSONB), the question of when to use PostgreSQL in relational mode, and in which in NoSQL mode, has been raised quite often. Will you be able to completely abandon the traditional table structures and work with document presentations in the future? Do you mix both approaches? The answer to this question is not surprising - it all depends on many factors. Each new storage model including Hstore, JSON and JSONB has its own ideal application. Here we dig deeper and learn about the features of each of them and see when what to use.



Hstore


If we exclude the XML , hstore was the first truly unstructured data type, added in PostgreSQL. Hstore was added a long time ago in Postgres 8.3 to upsert , to streaming replication , and to window functions . Hstore is essentially a key / value store directly in PostgreSQL. Using Hstore you are limited in the choice of the data type used. In fact, you only have strings. You don’t even have data nesting; In short, this is a single-level key / value data type.


The advantage of Hstore is that you do not need to define keys (as opposed to columns) in advance. You can simply insert a record and it will store all the necessary data. Let's say you have an example script to create a table:


CREATE TABLE products (
    id serial PRIMARY KEY,
  name varchar,
  attributes hstore
);

With Hstore, you can insert everything you want into the attribute column. In this case, the request to add these keys and values ​​will look like this:


INSERT INTO products (name, attributes) VALUES (
'Geek Love: A Novel',
'author    => "Katherine Dunn",
    pages     => 368,
    category  => fiction'
);

The query for the selection will look like:


SELECT name, attributes->'author' as author
FROM products
WHERE attributes->'category' = 'fiction'

The obvious advantage of this approach is flexibility, but here's where it really shows itself in full is the ability to use different types of indexes. In particular, the GIN or GiST index will index each key and value within the Hstore. That is, the added index will be used when filtering, if the PostgreSQL scheduler requires it.


Since Hstore is not the full equivalent of a document, it is important to understand whether it is beneficial to use it as such.
If you have relational data and also some data that may not always exist in a column, then this approach can be a great solution. For example, in most cases, product catalog attributes can be a great example for this type of data. Then, for some products, such as books (which you keep in a separate table “Products”), parameters such as genre, year of publication can be defined. In another case, for products such as clothes that you also store in the same table, other parameters can already be determined - size and color. Adding a column to the product table for each possible parameter is redundant and unjustified.


Json


Starting with version 9.2, Postgres has implemented JSON support. Now, PostgreSQL can compete with MongoDB. (Although the JSON functionality in PostgreSQL 9.2 is, of course, a little exaggerated . More on that below.)


Postgres JSON data type, if you look pretty much just a text field. All you get with the JSON data type is the value validation on insertion. Postgres enforces JSON format. One small potential advantage over JSONB (which we will consider next) is that JSON saves indentation in the data coming into the database. So if you are very demanding on formatting your data or you need to save a record in one or another structure, JSON can be useful.


In addition, over time, Postgres has acquired a number of fairly useful features . Should you use JSON? After all, the PostgreSQL type JSON simply provides text field validation. If you store some form of log data that you rarely request, the JSON data type in this case works well. Since JSON is pretty simple, it will have much higher throughput when pasting. For anything more complicated, I would recommend using JSONB, which will be described below.


Jsonb


Finally, in Postgres 9.4, we got real and proper JSON support as JSONB. B stands for Better. JSONB is a binary representation of data in JSON format. This means that the data is compressed and more efficient for storage than plain text. In addition, under the hood he has a mechanism similar to Hstore. Technically, once in development, there was an almost implemented type of Hstore2 and a separate type of JSON and subsequently they were combined into JSONB in ​​the form as it is now.


The JSONB type represents pretty much what you would expect from a JSON data type. It allows you to implement nested structures, use basic data types, and also has a number of built-in functions for working with it. The best part of this similarity with Hstore is indexing. Creating a GIN index on a JSONB column will create an index for each key and value within this JSON document. The ability to index and nest within a document means that JSONB is superior to Hstore in most cases.


Although there is still a small question about when to use JSONB exclusively. Let's say you create a database of documents and choose Postgres from all the options. With a package like MassiveJS, this can be quite convenient.


The most common use cases:


  1. Track data events by adding changing payload events.
  2. Storage of game data is quite common, especially where you have a single player game and a changing data scheme based on the state of the user.
  3. Tools that combine multiple data sources, the example here may be a tool that integrates multiple customer databases to Salesforce, Zendesk, or something else. The combination of schemes makes this a more painful procedure than it should be.

Let's look at another example of working with JSONB. The script creates a table and inserts some data for an example:


CREATE TABLE integrations (id UUID, data JSONB);
INSERT INTO integrations VALUES (
    uuid_generate_v4(),
    '{
        "service": "salesforce",
        "id": "AC347D212341XR",
        "email": "craig@citusdata.com",
        "occurred_at": "8/14/16 11:00:00",
        "added": {
            "lead_score": 50
        },
        "updated": {
            "updated_at": "8/14/16 11:00:00"
        }
   }');
INSERT INTO integrations (
  uuid_generate_v4 (),
  '{
    "service": "zendesk",
    "email": "craig@citusdata.com",
    "occurred_at": "8/14/16 10:50:00",
    "ticket_opened": {
        "ticket_id": 1234,
        "ticket_priority": "high"
     }
   }');

In the above case, you can easily find all the events that occurred with the user from email craig@citusdata.com, and then do some actions. For example, you can conduct some form of behavioral analytics, and calculate the users who made foo and then bar, or make a simple report.
By adding a Gin index, all data within my JSONB field is indexed automatically:


CREATE INDEX idx_integrations_data ON integrations USING gin(data);

Conclusion


In most cases, JSONB is probably just what you are looking for when you plan to use a non-relational data type. For Hstore and JSON, you can also find good use, albeit in more rare cases. JSONB does not always fit into the data model. If you can normalize the scheme, then you will have an advantage, but if in the scheme, a large number of optional columns (for example, with event data) or one scheme is very different from the other, then JSONB is much better.


Summarizing, the decision selection algorithm:


JSONB - In most cases,
JSON - If you process logs, you do not often have to request data or do not need to use it as something more than for logging tasks.
Hstore - works great with text data based on a key-value representation, but overall JSONB also does a great job of this.


Also popular now: