Anger, bargaining and depression when working with InfluxDB
If you use the time series database (timeseries db, wiki ) as the main repository for a site with statistics, then instead of solving the problem, you can get a lot of headache. I am working on a project where such a database is used, and sometimes InfluxDB, which will be discussed, presented unexpected surprises in general.
Disclaimer : These problems are for InfluxDB 1.7.4.
Why time series?
The project is to track transactions in various blockchains and display statistics. Specifically, we look at the emission and burning of stable coins ( wiki ). Based on these transactions, you need to build graphs and show pivot tables.
When analyzing transactions, the idea came up: to use the InfluxDB time series database as the main storage. Transactions are points in time and they fit well into the time series model.
Also, the aggregation functions looked very convenient - they are ideal for processing charts with a long period. The user needs a graph for the year, and the database contains a data set with a time frame of five minutes. It’s pointless to send him one hundred thousand points - except for long processing, they will not fit on the screen. You can write your own implementation of increasing the timeframe, or use the aggregation functions built into Influx. With their help, you can group data by day and send the desired 365 points.
It was a little embarrassing that usually such databases are used to collect metrics. Monitoring servers, iot-devices, all from which millions of points of the form “pour”: [<time> - <metric value>]. But if the database works well with a large flow of data, then why should small volume cause problems? With this in mind, they took InfluxDB to work.
What else is convenient in InfluxDB
In addition to the aggregation functions mentioned, there is another great thing - continuous queries ( doc ). This is a scheduler built into the database that can process data on a schedule. For example, you can group all records for a day every 24 hours, calculate the average and write one new point in another table without writing your own bikes.
There is also retention policies ( doc ) - setting up data deletion after a period. It is useful when, for example, you need to store the load on the CPU for a week with measurements once a second, but at a distance of a couple of months this accuracy is not needed. In this situation, you can do this:
- create a continuous query to aggregate data into another table;
- For the first table, define a policy for deleting metrics that are older than that week.
And Influx will independently reduce the size of the data and delete unnecessary.
About Stored Data
Not much data is stored: about 70 thousand transactions and another million points with market information. Adding new entries - no more than 3000 points per day. There are also metrics on the site, but there is little data there and by retention policy they are stored for no more than a month.
During the development and subsequent testing of the service, more and more critical problems arose during the operation of InfluxDB.
1. Deletion of data
There is a series of data with transactions:
SELECT time, amount, block, symbol FROM transactions WHERE symbol='USDT'
I send a command to delete data:
DELETE FROM transactions WHERE symbol=’USDT’
Next, I make a request to receive already deleted data. And Influx, instead of an empty response, returns part of the data that should be deleted.
I try to delete the whole table:
DROP MEASUREMENT transactions
I check table deletion:
I don’t watch the table in the list, but a new data request still returns the same set of transactions.
The problem occurred to me only once, since the deletion case is an isolated case. But this behavior of the database clearly does not fit into the framework of "correct" work. Later on github I found an open ticket almost a year ago on this topic.
As a result, the removal and subsequent restoration of the entire database helped.
2. Floating-point numbers
Mathematical calculations using the built-in functions in InfluxDB give accuracy errors. Not that it was anything unusual, but unpleasant.
In my case, the data has a financial component and I would like to process it with high accuracy. Because of this, the plans to abandon continuous queries.
3. Continuous queries cannot be adapted to different time zones
The service has a table with daily statistics on transactions. For each day, you need to group all transactions for this day. But the day for each user will start at a different time, therefore the set of transactions is different. UTC has 37 shift options for which you need to aggregate data.
When grouping by time in InfluxDB, you can additionally specify a shift, for example, for Moscow time (UTC + 3):
SELECT MEAN("supply") FROM transactions GROUP BY symbol, time(1d, 3h) fill(previous)
But the query result will be incorrect. For some reason, daily-grouped data will begin as early as 1677 (InfluxDB officially supports the time period from this year):
To work around this problem, the service was temporarily transferred to UTC + 0.
There are many benchmarks on the Internet with comparisons of InfluxDB and other databases. At the first acquaintance, they looked like marketing materials, but now I think that there is some truth in them.
I’ll tell you my case.
The service provides an API method that returns statistics for the last day. During calculations, the method queries the database three times with the following queries:
SELECT * FROM coins_info WHERE time <= NOW() GROUP BY symbol ORDER BY time DESC LIMIT 1
SELECT * FROM dominance_info ORDER BY time DESC LIMIT 1
SELECT * FROM transactions WHERE time >= NOW() - 24h ORDER BY time DESC
- In the first query, we get the last points for each coin with market data. Eight points for eight coins in my case.
- The second request receives one the newest point.
- The third one requests a list of transactions for the last day, there may be several hundred.
I’ll clarify that in InfluxDB, an index is automatically built by tags and by time, which speeds up queries. In the first query, symbol is the tag.
I did a stress test for this API method. For 25 RPS, the server showed a full load of six CPUs:
At the same time, the NodeJs process did not give any load at all.
The execution speed has already degraded by 7-10 RPS: if one client could receive a response in 200 ms, then 10 clients should wait a second. 25 RPS - the border with which stability suffered, 500 errors were returned to clients.
With such performance, it is impossible to use Influx in our project. Moreover: in a project where you need to demonstrate monitoring to many clients, similar problems may appear and the metric server will be overloaded.
The most important conclusion from the experience gained is that you cannot take an unknown technology into the project without sufficient analysis. A simple screening of open tickets on github could provide information in order not to take InfluxDB as the main data warehouse.
InfluxDB should have been well suited to the tasks of my project, but as practice has shown, this database does not meet the needs and messes up a lot.
You can already find version 2.0.0-beta in the project repository, it is hoped that in the second version there will be significant improvements. In the meantime, I’m going to study TimescaleDB documentation.