Olegbunin March 12, 2019 at 14:16

Flexible data storage in MySQL (JSON)

Alexander Rubin works at Percona and has performed at HighLoad ++ more than once , familiar to participants as an expert in MySQL. It is logical to assume that today we will talk about something related to MySQL. This is so, but only partly, because we will also talk about the Internet of things . The story will be half entertaining, especially its first part, in which we look at the device that Alexander created to harvest apricots. Such is the nature of a true engineer — if you want fruit, you buy a fee.

Background

It all started with a simple desire to plant a fruit tree in its area. It would seem very simple to do this - you come to the store and buy a seedling. But in America, the first question sellers ask is how much sunlight will the tree receive. For Alexander, this turned out to be a giant mystery - it is completely unknown how much sunlight is on the site.

To find out, a student could go out into the yard every day, see how much sunshine, and write it in a notebook. But this is not the case - you need to equip everything and automate it.

During the presentation, many examples were run and played live. Want a more complete picture than in the text, switch to watching the video.

So, in order not to record weather observations in a notebook, there are a large number of devices for Internet things - Raspberry Pi, the new Raspberry Pi, Arduino - thousands of different platforms. But I chose a device called Particle Photon for this project . It is very easy to use, costs $ 19 on the official website.

The good thing about Particle Photon is:

100% cloud solution;
Any sensors are suitable, for example, for Arduino. They all cost less than a dollar.

I made such a device and put it in the grass on the site. It has a Particle Device Cloud and a console. This device connects via Wi-Fi hotspot and sends data: light, temperature and humidity. The tester lasted 24 hours on a small battery, which is pretty good.

Further, I need not only to measure the illumination and so on and transfer them to the phone (which is really good - I can see in real time what kind of illumination I have), but also to store data . For this, naturally, as a MySQL veteran, I chose MySQL.

How do we write data in MySQL

I chose a rather complicated scheme:

I get data from the Particle console;
I use Node.js to write them to MySQL.

I am using the Particle JS API, which can be downloaded from the Particle website. I establish a connection with MySQL and write, that is, I just do INSERT INTO values. Such a pipeline.

Thus, the device lies in the yard, connects via Wi-Fi to the home router and uses the MQTT protocol to transfer data to Particle. Then the very scheme: the program on Node.js runs on the virtual machine, which receives data from Particle and writes it to MySQL.

To start, I built the graphs from the raw data in R. The graphs show that the temperature and illumination rise during the day, fall by night, and humidity rises - this is natural. But there is also noise on the graph, which is typical for Internet of Things devices. For example, when a bug crawls onto a device and closes it, the sensor can transmit completely irrelevant data. This will be important for further consideration.

Now let's talk about MySQL and JSON, which has changed in working with JSON from MySQL 5.7 to MySQL 8. Then I will show a demo for which I use MySQL 8 (at the time of the report this version was not yet ready for production, a stable release has already been released).

MySQL data storage

When we try to store data received from sensors, our first thought is to create a table in MySQL :

CREATE TABLE 'sensor_wide' (
    'id' int (11) NOT NULL AUTO_INCREMENT, 
    'light' int (11) DEFAULT NULL,
    'temp' double DEFAULT NULL,
    'humidity' double DEFAULT NULL, 
    PRIMARY KEY ('id')
) ENGINE=InnoDB

Here for each sensor and for each data type there is a column: light, temperature, humidity.

This is logical enough, but there is a problem - it is not flexible . Suppose we want to add another sensor and measure something else. For example, some people measure the remaining beer in a keg. What to do in this case?

alter table sensor_wide
add water level double ...;

How to pervert in order to add something to the table? You need to make an alter table, but if you did an alter table in MySQL, then you know what I'm talking about - this is completely difficult. Alter table in MySQL 8 and MariaDB is much simpler, but historically this is a big problem. So if we need to add a column, for example, with the name of the beer, then it will not be so simple.

Again, the sensors appear, disappear, what should we do with the old data? For example, we stop receiving information about lighting. Or are we creating a new column - how to store what was not there before? The standard approach is null, but for analysis it will not be very convenient.

Another option is a key / value store.

MySQL data storage: key / value

This will be more flexible : in key / value there will be a name, for example, temperature and, accordingly, data.

CREATE TABLE 'cloud_data' (
    'id' int (11) NOT NULL AUTO_INCREMENT,
    'name' varchar(255) DEFAULT NULL,
    'data' text DEFAULT NULL,
    'updated_at' timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP 
    PRIMARY KEY ('id')
) ENGINE=InnoDB

In this case, another problem appears - there are no types . We do not know what we are storing in the 'data' field. We will have to declare it as a text field. When I create my Internet of things device, I know what kind of sensor there is and accordingly the type, but if you need to store someone else's data in the same table, I will not know what data is being collected.

You can store many tables, but creating one whole new table for each sensor is not very good.

What can be done? - Use JSON.

MySQL data storage: JSON

The good news is that in MySQL 5.7 you can store JSON as a field.

CREATE TABLE 'cloud_data_json' (
    'id' int (11) NOT NULL AUTO_INCREMENT,
    'name' varchar(255) DEFAULT NULL,
    'data' JSON,
    'updated_at' timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP 
    PRIMARY KEY ('id')
) ENGINE=InnoDB;

Before MySQL 5.7 appeared, people also stored JSON, but as a text field. The JSON field in MySQL allows you to store JSON itself most efficiently. In addition, based on JSON, you can create virtual columns and indexes based on them.

The only minor problem is that the table will grow in size during storage . But then we get a lot more flexibility.

The JSON field is better for storing JSON than the text field because:

Provides automatic document validation . That is, if we try to write something that is not valid there, an error will occur.
This is an optimized storage format . JSON is stored in binary format, which allows you to switch from one JSON document to another - what is called skip.

To store data in JSON, we can simply use SQL: make an INSERT, put 'data' there and get data from the device.

…
stream.on('event', function(data) { 
    var query = connection.query(
        'INSERT INTO cloud_data_json (client_name, data) 
        VALUES (?, ?)',
            ['particle', JSON.stringify(data)]
)
…
(demo)

Demo

To demonstrate ( here its beginning in the video) the example uses a virtual machine in which there is SQL.

Below is a fragment of the program.

I do INSERT INTO cloud_data (name, data), I get the data already in JSON format, and I can directly write it to MySQL as it is, completely not thinking about what is inside.

As it turned out, with this cloud you can access not only the data of my device, but generally all the datathat this very Particle uses. It seems to work so far. People who use Particle Photon all over the world are sending some data: the door to the garage is open, or the rest of the beer is such and such, or something else. It is not known where these devices are located, but such data can be obtained. The only difference is that when I get my data, I write something like: deviceId: 'mine'.

When we run the code, we get a stream of some data from someone else's devices that are doing something.

We absolutely do not know what this data is: TTL, published_at, coreid, door status (door open), relay on.

This is a great example. Suppose I try to put this in MySQL in a normal data structure. I should know what the door is there, why it is open, and what general parameters it can take. If I have JSON, then I write it directly to MySQL as a JSON field.

Please, everything has been recorded.

Document store

Document store is an attempt in MySQL to make storage for JSON. I really love SQL, am well acquainted with it, I can make any SQL query, etc. But many people do not like SQL for various reasons, and the Document store can be a solution for them, because with it you can abstract from SQL, connect to MySQL and write JSON directly there.

There is another possibility that appeared in MySQL 5.7: use a different protocol, a different port, and another driver is also needed. For Node.js (in fact, for any programming language - PHP, Java, etc.), we connect to MySQL using a different protocol and can transfer data in JSON format. Again, I don’t know what I have in this JSON - information about doors or something else, I just dump the data in MySQL, and then we'll figure it out.

const mysqlx = require('@mysql/xdevapi*);
// MySQL Connection 
var mySession = mysqlx.gctSession({
    host: 'localhost', port: 33060, dbUser: 'photon*
});
…
session.getSchema("particle").getCollection("cloud_data_docstore")
                          .add( data )
                          .execute(function (row) {
                          }).catch(err => {
                                   console.log(err);
                          })
                          .then( -Function (notices) {
                                   console.log("Wrote to MySQL")
                          });
...https://dev.mysql.com/doc/dev/connector-nodejs/

If you want to experiment with this, you can configure MySQL 5.7 so that it understands and listens on the appropriate port Document store or X DevAPI. I used connector-nodejs.

This is an example of what I write down there: beer, etc. I absolutely do not know what is there. Now we just write it down and analyze it later.

The next point of our program is how to see what's there?

MySQL data storage: JSON + indexes

There is a great feature in JSON and MySQL 5.7 that can pull fields out of JSON. This is such syntactic sugar on the JSON_EXTRACT function. I think this is very convenient.

Data in our case is the name of the column in which JSON is stored, and name is our field. Name, data, published_at - this is all we can pull out this way.

select data->>'$.name' as data_name,
       data->>'$.data' as data, 
       data->>'$.published_at' as published 
from cloud_data_json
order by data->'$.published_at' desc 
limit 10;

In this example, I want to see what I have written to the MySQL table, and the last 10 records. I make such a request and try to execute it. Unfortunately, this will work for a very long time .

In a logical way, MySQL will not use any indexes in this case. We pull the data out of JSON and try to apply some sort of filters and sorting. In this case, we get Using filesort.

EXPLAIN select data->>'$.name' as data_name ... 
order by data->>'$.published_at' desc limit 10
      select_type: SIMPLE
            table: cloud_data_json 
    possible_keys: NULL 
              key: NULL
…
             rows: 101589  
         filtered: 100.00
            Extra: Using filesort

Using filesort is very bad, it is an external sort.

The good news is that you can take 2 steps to speed it up.

Step 1. Create a virtual column

mysql> ALTER TABLE cloud_data_json
    -> ADD published_at DATETIME(6)
    -> GENERATED ALWAYS AS
    (STR_TO_DATE(data->>'$.published_at',"%Y-%m-%dT%T.%fZ")) VIRTUAL;
Query OK, 0 rows affected (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 0

I do EXTRACT, that is, I pull data from JSON and based on it I create a virtual column. The virtual column is not stored in MySQL 5.7 and in MySQL 8 - it's just the ability to create a separate column.

You ask how it is, you said that ALTER TABLE is such a long operation. But here it’s not so bad. Creating a virtual column is fast . There is loсk there, but actually in MySQL there is a lock on all DDL operations. ALTER TABLE is a fairly fast operation, and it does not rebuild the entire table.

We have created a virtual column here. I had to convert the date, because in JSON it is stored in the iso format, but here MySQL uses a completely different format. To create a column, I named it, gave it a type and said that I would record there.

To optimize the original query, you need to pull out published_at and name. Published_at already exists, name is easier - just make a virtual column.

mysql> ALTER TABLE cloud_data_json 
    -> ADD data_name VARCHAR(255)
    -> GENERATED ALWAYS AS (data->>'$.name') VIRTUAL;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Step 2. Creating an index

In the code below, I create an index on published_at and execute the query:

mysql> alter table cloud_data_json add key (published_at);
Query OK, 0 rows affected (0.31 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> explain select data_name, published_at, data->>'$.data' as data from 
cloud_data_json order by published_at desc limit 10\G
        table: cloud_data_json
         type: index
possible_keys: NULL
          key: published_at
      key_len: 9
         rows: 10
     filtered: 100.00
        Extra: Backward index scan

You can see that MySQL actually uses the index. This is an optimization by order. In this example, data and name are not indexed. MySQL uses order by data, and since we have an index on published_at, it uses it.

Moreover, I could use the same syntactic sugar in order by instead of published_at STR_TO_DATE(data->>'$.published_at',"%Y-%m-%dT%T.%fZ"). MySQL would still understand that there is an index on this column and start using it.

There is actually a small problem with this. Suppose I want to sort data not only by published_at, but also by name.

mysql> explain select data_name, published_at, data->>'$.data' as data from 
cloud_data_json order by published_at desc, data_name asc limit 10\G
    select_type: SIMPLE
          table: cloud_data_json
     partitions: NULL
           type: ALL
  possible_keys: NULL
            key: NULL
        key_len: NULL
            ref: NULL
           rows: 101589
       filtered: 100.00
          Extra: Using filesort

If your device processes tens of thousands of events per second, published_at will not give a good sort, as there will be duplicates. Therefore, we add another sorting by data_name. This is a typical query not only for the Internet of things: give me the last 10 events, but sort them by date, and then, for example, by person’s last name in ascending order. To do this, in the example above, there are two fields and two sort keys are specified: descending and ascending.

First of all, in this case, MySQL will not use indexes. In this particular case, MySQL decides that a full table scan will be more profitable than using an index, and again the very slow filesort operation is used.

New in MySQL 8.0

descending / ascending

In MySQL 5.7, such a query cannot be optimized, if only at the expense of other things. In MySQL 8, there was a real opportunity to specify sorting for each field.

mysql> alter table cloud_data_json
      add key published_at_data_name 
      (published_at desc, data_name asc); 
Query OK, 0 rows affected (0.44 sec) 
Records: 0  Duplicates: 0  Warnings: 0

The most interesting thing is that the descending / ascending key after the index name has long been in SQL. Even in the very first version of MySQL 3.23, you could specify published_at descending or published_at ascending. MySQL accepted this, but did nothing , that is, it always sorted in one direction.

In MySQL 8, this was fixed and now there is such a feature. You can create a field in descending order and with default sort.

Let's go back a second and look at the example from step 2 again.

Why does it work, otherwise it doesn't? This works because in MySQL indexes it is a B-tree, and B-tree indexes can be read both from the beginning and from the end. In this case, MySQL reads the index from the end and all is well. But if we do descending and ascending, then you cannot read. Can be read in the same order but it is impossible to combine two sorts - you need to re-sort.

Since we are optimizing a very specific case, we can create an index for it and specify a specific sort: here published_at is descending, data_name is ascending. MySQL uses this index, and everything will be fine and fast.

mysql> explain select data_name, published_at, data->>'$.data' as data from 
cloud_data_json order by published_at desc  limit 10\G
      select_type: SIMPLE
            table: cloud_data_json
       partitions: NULL
             type: index
    possible_keys: NULL
              key: published_at_data_name
          key_len: 267
              ref: NULL
             rows: 10
         filtered: 100.00
            Extra: NULL

This is a feature of MySQL 8, which is now, at the time of publication, already available and ready for use in production.

Output Results

There are two more interesting things that I want to show:

1. Pretty print, that is, a beautiful display of data on the screen. With normal SELECT, JSON will not be formatted.

mysql> select json_pretty(data) from cloud_data_json 
where data->>'$.data' like '%beer%' limit 1\G
…
json_pretty(data): {
    "ttl": 60,
    "data": "FvGav,tagkey=beer-store spFridge=7.00,pvFridge=7.44", 
    "name": "LOG_DATA_DEBUG",
    "coreid": "3600....",
    "published_at": "2017-09-28T18:21:16.517Z"
}

2. We can say that MySQL will output the result in the form of a JSON array or JSON object, specify the fields, and then the output will be formatted as JSON.

Full-text search inside JSON documents

If we use a flexible storage system and do not know what is inside our JSON, it would be logical to use full-text search.

Unfortunately, full-text search has its limitations . The first thing I tried was just to create a full-text key. I tried to do such a thing:

mysql> alter table cloud_data_json_indexes add fulltext key (data);
ERROR 3152 (42000): JSON column ’data’ supports indexing only via generated columns on a specified ISON path.

Unfortunately this does not work. Even in MySQL 8, creating a full-text index simply by the JSON field is unfortunately impossible. Of course, I would like to have such a function - the ability to search at least by JSON keys would be very useful.

But if this is not yet possible, let's create a virtual column. In our case, there is a data field, and it would be interesting for us to see what is inside.

mysql> ALTER TABLE cloud_data_json_indexes 
-> ADD data_data VARCHAR(255)
-> GENERATED ALWAYS AS (data->>'$.data') VIRTUAL;
Query OK, 0 rows affected (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> alter table cloud_data_json_indexes add fulltext key ft_json(data_name, data_data);
ERROR 3106 (HY000): 'Fulltext index on virtual generated column' is not supported for generated columns.

Unfortunately, this does not work either - you cannot create a full-text index on a virtual column .

If so, let's create a stored column. MySQL 5.7 allows you to declare a column as a stored field.

mysql> ALTER TABLE cloud_data_json_indexes
-> ADD data_name VARCHAR(255) CHARACTER SET UTF8MB4 
-> GENERATED ALWAYS AS (data->>'$.name') STORED;
Query OK, 123518 rows affected (1.75 sec)
Records: 123518  Duplicates: 0  Warnings: 0
mysql> alter table cloud_data_json_indexes add fulltext key ft_json(data_name);
Query OK, 0 rows affected, 1 warning (3.78 sec)
Records: 0 Duplicates: 0 Warnings: 1
mysql> show warnings;
+------------+--------+---------------------------------------------------+
| Level      |  Code  | Message                                           |
+------------+--------+---------------------------------------------------+
| Warning    |  124   | InnoDB rebuilding table to add column FTS_DOC_ID  |
+------------+--------+---------------------------------------------------+

In the previous examples, we created virtual columns that are not stored, but indexes are created and stored. In this case, I had to tell MySQL that this is a STORED column, that is, it will be created and the data will be copied to it. After that, MySQL created a full-text index, for this we had to recreate the table. But this limitation is actually InnoDB and InnoDB fulltext search: you have to recreate the table to add a special full-text search identifier.

Interestingly, in MySQL 8 there was a new UTF8 MB4 encoding for emoticons . Of course, not quite for them, but because in UTF8MB3 there are some problems with Russian, Chinese, Japanese and other languages.

mysql> ALTER TABLE cloud_data_json_indexes
    -> ADD data_data TEXT CHARACTER SET UTF8MB4
    -> GENERATED ALWAYS AS ( CONVERT(data->>'$.data' USING UTF8MB4) ) STORED 
Query OK, 123518 rows affected (3.14 sec)
Records: 123518   Duplicates: 0   Warnings: 0

Accordingly, MySQL 8 should store JSON data in UTF8MB4. But whether due to the fact that Node.js connects to Device Cloud, and something is written there incorrectly, or it is a beta version bug, this did not happen. Therefore, I had to convert the data before writing them to a stored column.

mysql> ALTER TABLE cloud_data_json_indexes DROP KEY ft_json, 
    ADD FULLTEXT KEY ft_json(data_name, data_data);
Query OK, 0 rows affected (1.85 sec)
Records: 0  Duplicates: 0  Warnings: 0

After that, I was able to create a full-text search on two fields: on the JSON name and on the JSON data.

Not only IoT

JSON is not only the Internet of things. It can be used for other interesting things:

Custom fields (CMS);
Complex structures, etc .;

Some things can be much more conveniently implemented using a flexible data storage scheme. An excellent example was provided at Oracle OpenWorld: cinema reservations. It is very difficult to implement this in the relational model - you get many dependent tables, joins, etc. On the other hand, we can store the entire room as a JSON structure, respectively, write it to MySQL in other tables and use it in the usual way: create indexes based on JSON, etc. Complex structures are conveniently stored in JSON format.

This is a tree that has been planted successfully. Unfortunately, a few years later, deer ate it, but this is a completely different story.

This report is an excellent example of how an entire section grows out of one topic at a large conference, and then a separate separate event. In the case of the Internet of things, we got InoThings ++ - a conference for professionals in the Internet of Things market, which will be held for the second time on April 4.

The central event of the conference, it seems, will be the round table “Do we need national standards on the Internet of Things?”, Which organically will be supplemented by comprehensive applied reports . Come if your heavily loaded systems are moving correctly to IIoT.

Tags: