Redis Streams as a Clean Data Structure

Transfer

The new Redis 5 data structure, called streams, has sparked keen interest in the community. Somehow I’ll talk with those who use streams in production and write about it. But now I want to consider a slightly different topic. It begins to seem to me that many people think of streams as a kind of surreal tool for solving terribly difficult tasks. Indeed, this data structure * also * provides messaging, but it will be an incredible simplification to assume that Redis Streams functionality is limited only by this.

Streams are a terrific template and “mental model” that can be used with great success in system design, but in reality streams, like most Redis data structures, are a more general structure and can be used for a bunch of other tasks. In this article, we will present streams as a pure data structure, completely ignoring blocking operations, recipient groups, and all other messaging functionality.

Streams - This is CSV on Steroids

If you want to record a number of structured data elements and think that the database will be an excess here, you can simply open the file in mode append onlyand write each line as a CSV (Comma Separated Value):

(open data.csv in append only)
time=1553096724033,cpu_temp=23.4,load=2.3
time=1553096725029,cpu_temp=23.2,load=2.1

It looks simple. People have done this a long time ago and still do it: it is a reliable template, if you know what's what. But what will be the equivalent in memory? In memory, much more advanced data processing becomes possible, and many restrictions of CSV files are automatically removed, such as:

It is difficult (inefficient) to fulfill range requests.
Too much redundant information: each record has almost the same time, and the fields are duplicated. At the same time, deleting data will make the format less flexible if I want to switch to a different set of fields.
Element offsets are simply byte offsets in the file: if we change the structure of the file, the offset will become incorrect, so there is no real concept of a primary identifier. Records in essence cannot be presented unambiguously.
Without the ability to collect garbage and without rewriting the log, you cannot delete entries, but only mark them as invalid. Rewriting logs usually sucks for several reasons, it is advisable to avoid it.

At the same time, such a CSV log is good in its own way: there is no fixed structure, fields can change, it is trivial to generate it and it is quite compact. The idea with Redis streams was to preserve virtues, but to overcome limitations. The result is a hybrid data structure that is very similar to Redis sorted sets: they * look like * a fundamental data structure, but use several internal representations to get this effect.

Introduction to threads (you can skip if you are already familiar with the basics)

Redis streams are represented as delta-compressed macro nodes connected by a base tree. As a result, you can very quickly search for random records, get ranges, delete old elements, etc. At the same time, the interface for the programmer is very similar to a CSV file:

> XADD mystream * cpu-temp 23.4 load 2.3
"1553097561402-0"
> XADD mystream * cpu-temp 23.2 load 2.1
"1553097568315-0"

As you can see from the example, the XADD command automatically generates and returns the identifier of the record, which monotonically increases and consists of two parts: <time> - <counter>. Time in milliseconds, and the counter is incremented for records with the same time.

So, the first new abstraction for the idea of a CSV file in mode append onlyis to use the asterisk as the ID argument for XADD: this is how we get the record identifier from the server for free. This identifier is useful not only to indicate a specific element in the stream, it is also associated with the time the record was added to the stream. In fact, using XRANGE, you can execute range queries or retrieve individual elements:

> XRANGE mystream 1553097561402-0 1553097561402-0
1) 1) "1553097561402-0"
   2) 1) "cpu-temp"
      2) "23.4"
      3) "load"
      4) "2.3"

In this case, I used the same ID to start and end the range to identify one item. However, I can use any range and COUNT argument to limit the number of results. Likewise, there is no need to specify full identifiers for a range, I can simply use only unix time to get elements in a given time range:

> XRANGE mystream 1553097560000 1553097570000
1) 1) "1553097561402-0"
   2) 1) "cpu-temp"
      2) "23.4"
      3) "load"
      4) "2.3"
2) 1) "1553097568315-0"
   2) 1) "cpu-temp"
      2) "23.2"
      3) "load"
      4) "2.1"

At the moment, there is no need to show you other API features, there is documentation for this. For now, let's just focus on this usage pattern: XADD for adding, XRANGE (and also XREAD) for extracting ranges (depending on what you want to do), and let's see why streams are so powerful as to call them data structures.

If you want to know more about streams and APIs, be sure to read the tutorial .

Tennis players

A few days ago, a friend of mine who started studying Redis and I simulated an application to track local tennis courts, players, and matches. The way to model players is very obvious, a player is a small object, so we only need a hash with type keys player:<id>. Then you will immediately realize that you need a way to track games in specific tennis clubs. If player:1and player:2played with each other and player:1won, we can send to the stream the following entry:

> XADD club:1234.matches * player-a 1 player-b 2 winner 1
"1553254144387-0"

Such a simple operation gives us:

Unique match identifier: ID in the stream.
There is no need to create an object for match identification.
Free range requests for paging matches or watching matches for a specific date and time.

Before streams appeared, we would have to create a sorted set by time: the elements of the sorted set would be match identifiers, which are stored in a different key as a hash value. It is not only more work, but also more memory. Much, much more memory (see below).

Our goal now is to show that Redis streams are a sort of sorted collection in mode append only, with keys in time, where each element is a small hash. And in its simplicity, this is a real revolution in the context of modeling.

Memory

The above use case is not just a more cohesive programming pattern. The memory consumption in threads is so different from the old approach with a sorted set + hash for each object that some things now begin to work that were previously impossible to implement at all.

Here are statistics on the amount of memory for storing a million matches in the configuration presented earlier:

Сортированный набор + хеш = 220 МБ (242 RSS)
Потоки                    = 16,8 МБ (18.11 RSS)

The difference is more than an order of magnitude (namely, 13 times). This means being able to work with tasks that used to be too expensive to do in memory. Now they are quite viable. The magic is to introduce Redis streams: macro nodes can contain several elements that are very compactly encoded in a data structure called listpack. This structure will take care, for example, of encoding integers in binary form, even if they are semantically strings. In addition, we apply delta compression and compress the same fields. However, it remains possible to search by ID or time, because such macro nodes are linked in a base tree, which is also designed with memory optimization. Together, this explains the economical use of memory, but the interesting part is

Now let's count. If I can store 1 million records in approximately 18 MB of memory, then I can store 10 million in 180 MB and 100 million in 1.8 GB. With just 18 GB of memory, I can have 1 billion items.

Time series

It is important to note that the example above with tennis matches is semantically * very different * from using Redis streams for time series. Yes, logically we are still registering some kind of event, but there is a fundamental difference. In the first case, we log and create records for rendering objects. And in the time series, we simply measure something that happens outside that does not really represent the object. You can say that this distinction is trivial, but it is not. It is important to understand the idea that Redis streams can be used to create small objects with a common order and to assign identifiers to such objects.

But even the simplest way to use time series is obviously a huge breakthrough, because before the advent of streams, Redis was practically powerless to do anything here. Memory characteristics and flexibility of streams, as well as the ability to limit capped streams (see XADD parameters) are very important tools in the hands of the developer.

findings

Streams are flexible and offer many use cases, but I wanted to write a very short article to clearly show examples and memory consumption. Perhaps for many readers this use of streams was obvious. However, conversations with developers in recent months have left me with the impression that many have a strong association between streams and streaming data, as if the data structure is only good there. This is not true. :-)

Tags: