Theory and practice of using HBase

Good day! My name is Danil Lipova, our team at Sbertech began using HBase as a repository of operational data. In the course of his study, experience has accumulated that he wanted to systematize and describe (we hope that many will be useful). All the experiments below were performed with HBase versions 1.2.0-cdh5.14.2 and 2.0.0-cdh6.0.0-beta1.

Common architecture
Writing data to HBASE
Reading data from HBASE
Data caching
MultiGet / MultiPut batch processing
Spliting Splitting Strategy
Fault tolerance, compactification and data locality
Settings and Performance
Stress Testing
findings

1. General architecture

The backup Master listens to the heartbeat active on the ZooKeeper node and, if it disappears, takes over the master's functions.

2. Writing data to HBASE

First, we consider the simplest case - writing a key-value object to a certain table using put (rowkey). The client must first find out where the root region server (Root Region Server - RRS) is located, which stores the hbase: meta table. This information he receives from ZooKeeper. After that, he accesses the RRS and reads the hbase: meta table, from which it extracts information which RegionServer (RS) is responsible for storing data for a given rowkey key in the table of interest. For further use, the meta-table is cached by the client and therefore subsequent calls go faster, directly to RS.

Then RS, having received the request, first of all writes it to WriteAheadLog (WAL), which is necessary for recovery in the event of a fall. Then saves the data in the MemStore. This is a buffer in memory that contains a sorted set of keys for a given region. The table can be divided into regions (partitions), each of which contains a non-overlapping set of keys. This allows placing regions on different servers to get better performance. However, despite the obviousness of this statement, we will see later that this does not work in all cases.

After placing the record in MemStore, the client is returned a response that the record was saved successfully. At the same time, it is actually stored only in the buffer and will get to the disk only after a certain time interval has elapsed or when it is filled with new data.

When performing the “Delete” operation, no physical deletion of data occurs. They are simply marked as deleted, and the destruction itself takes place at the moment the major compact function is called, about which it is written in more detail in Section 7.

Files in the HFile format are copied into HDFS and from time to time the process of minor compact starts up, which simply sticks small files into larger ones without deleting anything. Over time, this turns into a problem that manifests itself only when reading the data (we will return to this later).

In addition to the loading process described above, there is a much more efficient procedure in which perhaps the greatest strength of this database is BulkLoad. It lies in the fact that we independently form HFiles and put them on a disk, which allows us to scale perfectly and achieve very decent speeds. In fact, the limitation here is not HBase, but the possibility of iron. Below are the results of the download on a cluster consisting of 16 RegionServers and 16 NodeManager YARN (CPU Xeon E5-2680 v4 @ 2.40GHz * 64 streams), version HBase 1.2.0-cdh5.14.2.

Here you can see that increasing the number of partitions (regions) in the table, as well as Spark executors, we get an increase in download speed. Also, the speed depends on the recording volume. Large blocks give an increase in the measurement of MB / s, small in the number of inserted records per unit of time, other things being equal.

You can also start loading in two tables at the same time and get a doubling of speed. Below it can be seen that the recording of 10 KB blocks in two tables at once goes at a speed of about 600 MB / s in each (total 1275 MB / s), which coincides with the write speed in one table of 623 MB / s (see No. 11 above)

But the second launch with entries of 50 KB shows that the download speed is growing slightly already, which indicates an approach to the limit values. It should be borne in mind that the HBASE itself doesn’t create a load here, all that is required of it is to first send the data from hbase: meta, and after the HFiles lining, reset the BlockCache data and save the MemStore buffer to disk if empty.

3. Read data from HBASE

If we assume that all the information from hbase: meta already has a client (see p.2), then the request goes immediately to that RS, where the necessary key is stored. At first search is carried out in MemCache. Regardless of whether there is data or not, the search is also carried out in the BlockCache buffer and, if necessary, in the HFiles. If the data was found in the file, then it is placed in BlockCache and the next request will be returned faster. HFile searches are relatively fast due to the use of the Blum filter, i.e. having read a small amount of data, it immediately determines whether this file contains the necessary key and, if not, then proceeds to the next.

Receiving data from these three sources, the RS generates a response. In particular, it can transfer several found versions of an object at once if the client has requested versioning.

4. Caching data

The MemStore and BlockCache buffers occupy up to 80% of the allocated on-heap RS memory (the rest is reserved for RS service tasks). If a typical use mode is such that processes write and immediately read the same data, then it makes sense to reduce BlockCache and increase MemStore, since When writing, the data in the cache is not readable, then the use of BlockCache will occur less frequently. The BlockCache buffer consists of two parts: LruBlockCache (always on-heap) and BucketCache (usually off-heap or on SSD). BucketCache should be used when there are a lot of read requests and they do not fit in LruBlockCache, which leads to the active work of the Garbage Collector. At the same time, there is no point in expecting a dramatic increase in productivity from using a cache for reading, however, we will return to this in Section 8.

BlockCache is one for the whole RS, and MemStore has its own table for each table (one for each Column Family).

As described in theory, when writing data do not get into the cache, and indeed, such parameters CACHE_DATA_ON_WRITE for the table and “Cache DATA on Write” for RS are set to false. However, in practice, if we write the data to MemStore, then reset it to disk (by clearing it like this), then deleting the resulting file, then by running a get request we will successfully receive the data. And even if you completely disable BlockCache and score a table with new data, then reset the MemStore to disk, delete it and request it from another session, they will still be retrieved from somewhere. So HBase keeps in itself not only data, but also mysterious riddles.

hbase(main):001:0> create 'ns:magic', 'cf'
Created table ns:magic
Took 1.1533 seconds
hbase(main):002:0> put 'ns:magic', 'key1', 'cf:c', 'try_to_delete_me'
Took 0.2610 seconds
hbase(main):003:0> flush 'ns:magic'
Took 0.6161 seconds
hdfs dfs -mv /data/hbase/data/ns/magic/* /tmp/trash
hbase(main):002:0> get 'ns:magic', 'key1'
 cf:c      timestamp=1534440690218, value=try_to_delete_me

The parameter “Cache DATA on Read” is set to false. If you have any ideas, welcome to discuss this in the comments.

5. Batch Processing MultiGet / MultiPut

Processing single requests (Get / Put / Delete) is quite an expensive operation, so you should combine them into a List or List, if possible, which allows you to get a significant performance boost. Especially it concerns the write operation, but when reading there is the next pitfall. The graph below shows the reading time of 50,000 entries from MemStore. The reading was performed in one stream and the number of keys in the query is shown along the horizontal axis. Here you can see that when you increase to a thousand keys in one request, the execution time drops; speed increases. However, when the MSLAB mode is turned on by default, a radical drop in performance begins after this threshold, and the larger the amount of data in the record, the longer the operation time.

Tests were performed on virtualke, 8 cores, version HBase 2.0.0-cdh6.0.0-beta1.

The MSLAB mode is designed to reduce heap fragmentation, which arises from the mixing of new and old generational data. As a solution to the problem when MSLAB is enabled, data is placed in relatively small cells (chunk) and processed in chunks. As a result, when the volume in the requested data packet exceeds the allocated size, the performance drops sharply. On the other hand, turning off this mode is also not desirable, since it will lead to stops due to GC in moments of intensive work with data. A good solution is to increase the volume of the cell, in the case of active write through put simultaneously with reading. It is worth noting that the problem does not occur if after writing the flush command is executed that flushes the MemStore to disk or if the load is performed using BulkLoad. The table below shows that requests from MemStore of larger data (and the same amount) lead to a slowdown. However, increasing chunksize returns processing time to normal.

In addition to increasing chunksize, data fragmentation by regions helps, i.e. splitting tables. This leads to the fact that fewer requests come to each region and if they are placed in a cell, the response remains good.

6. The strategy of partitioning tables into regions (spiliting)

Since HBase is a key-value repository and partitioning is done by key, it is extremely important to share data evenly across all regions. For example, partitioning such a table into three parts will result in the data being split into three regions:

It happens that this leads to a sharp slowdown if the data loaded in the future will look like long values for the most part starting with the same number, for example:

1000001
1000002
...
1100003

Since the keys are stored as an array of bytes, they will all be start the same way and belong to the same region # 1 that stores this range of keys. There are several partitioning strategies:

HexStringSplit - Turns the key into a string with hexadecimal encoding in the range “00000000” => “FFFFFFFF” and filling with a zero on the left.

UniformSplit - Turns the key into an array of bytes with hexadecimal encoding in the range "00" => "FF" and filling the right with zeros.

In addition, you can specify any range or set of keys for splitting and configure autoswitch. However, one of the most simple and effective approaches is UniformSplit and the use of hash concatenation, for example, the highest pair of bytes from the key run through the CRC32 (rowkey) function and the rowkey itself:

hash + rowkey

Then all data will be distributed evenly across regions. When reading, the first two bytes are simply discarded and the original key remains. RS also controls the amount of data and keys in the region and, when the limits are exceeded, automatically breaks it into pieces.

7. Fault tolerance and data locality

Since only one region is responsible for each key set, the solution to the problems associated with an RS drop or decommissioning is storing all the necessary data in HDFS. When RS drops, the wizard detects this through the absence of a heartbeat on the ZooKeeper node. Then he assigns the serviced region to another RS, and since the HFiles are stored in a distributed file system, the new owner reads them out and continues to serve the data. However, since some of the data may be in MemStore and did not have time to get into the HFiles, WAL is used to restore the history of operations, which are also stored in HDFS. After the change, RS is able to respond to requests, but the move leads to the fact that part of the data and the processes that serve them are on different nodes, i.e. reduced locality.

The solution to the problem is major compaction - this procedure moves the files to those nodes that are responsible for them (where their regions are located), as a result of which, during this procedure, the load on the network and disks increases dramatically. However, in the future, access to data is significantly accelerated. In addition, major_compaction merges all HFiles into one file within a region, and also clears data depending on the table settings. For example, you can specify the number of versions of an object that you want to save or its lifetime, after which the object is physically deleted.

This procedure can have a very positive effect on the operation of HBase. The picture below shows how the performance has degraded as a result of active data recording. Here you can see how 40 threads were written in one table and 40 threads simultaneously read data. Writing streams form more and more HFiles that are read by other threads. As a result, more data needs to be removed from memory and eventually GC starts working, which almost paralyzes all work. Running major compaction led to the clearing of the resulting blockages and the restoration of performance.

The test was performed on 3 DataNode and 4 RS (CPU Xeon E5-2680 v4 @ 2.40GHz * 64 streams). Version HBase 1.2.0-cdh5.14.2

It is worth noting that the launch of major compaction was performed on a “live” table, in which the data were actively written and read. On the network, there was a statement that this could lead to an incorrect answer when reading data. For verification, a process was launched that generated new data and wrote them to a table. After that, he immediately read and checked whether the obtained value coincides with what was written. During this process, major compaction was run about 200 times and no failures were fixed. It is possible that the problem rarely manifests itself and only during high load, so it’s safer to plan to stop the processes of writing and reading and to perform cleaning without preventing such GC drawdowns.

Also major compaction does not affect the MemStore state, flush (connection.getAdmin (). Flush (TableName.valueOf (tblName))) should be used to flush it to disk and compactify.

8. Settings and performance

As already mentioned, HBase shows the greatest success where it doesn’t need to do anything when performing BulkLoad. However, this applies to most systems and people. However, this tool is more suitable for mass data packing in large blocks, whereas if the process requires the execution of many competing read and write requests, the Get and Put commands described above are used. To determine the optimal parameters, the launches were made with various combinations of the parameters of the tables and settings:

10 threads were launched simultaneously 3 times in a row (let's call it a block of threads).
The operation time of all threads in the block was averaged and was the final result of the block operation.
All threads worked with the same table.
Before each launch of the thread block, major compaction was performed.
Each block performed only one of the following operations:

- Put
- Get
- Get + Put

Each unit performed 50,000 repetitions of its operation.
The record size in the block is 100 bytes, 1000 bytes or 10,000 bytes (random).
Blocks were run with a different number of requested keys (either one key or 10).
The blocks were started with different table settings. Parameters changed:

- BlockCache = turned on or off
- BlockSize = 65 Kb or 16 Kb
- Partitions = 1, 5 or 30
- MSLAB = turned on or off

Thus the block looks like this:

a. MSLAB mode was turned on / off.
b. A table was created for which the following parameters were set: BlockCache = true / none, BlockSize = 65/16 Kb, Partitions = 1/5/30.
c. Set compression GZ.
d. 10 threads simultaneously doing 1/10 of the put / get / get + put operations into this table were started with entries of 100/1000/10000 bytes, performing 50,000 queries in a row (random keys).
e. Point d was repeated three times.
f. The operation time of all threads was averaged.

All possible combinations have been tested. It is predicted that as the size of the record increases, the speed will drop or that disabling caching will slow down. However, the goal was to understand the degree and significance of the influence of each parameter, so the collected data were fed to the input of the linear regression function, which makes it possible to assess the reliability using t-statistics. Below are the results of the operation of the blocks performing Put operations. Full set of combinations 2 * 2 * 3 * 2 * 3 = 144 variants + 72 since some were performed twice. Therefore, a total of 216 launches:

Testing was performed on a mini cluster consisting of 3 DataNodes and 4 RSs (CPU Xeon E5-2680 v4 @ 2.40GHz * 64 streams). HBase version 1.2.0-cdh5.14.2.

The fastest insertion speed of 3.7 seconds was obtained when MSLAB was turned off, on a table with one partition, with BlockCache enabled, BlockSize = 16, records of 100 bytes each, 10 pieces per pack.
The lowest insertion speed of 82.8 seconds was obtained with MSLAB enabled, on a table with one partition, with BlockCache enabled, BlockSize = 16, records of 10,000 bytes, 1 piece each.

Now look at the model. We see a good quality model for R2, but it is clear that extrapolation is contraindicated here. The actual behavior of the system when the parameters change will not be linear, this model is needed not for predictions, but for understanding what happened within the specified parameters. For example, here we see by Student’s criterion that for the Put operation, the BlockSize and BlockCache parameters do not matter (which is generally predictable):

But the fact that an increase in the number of partitions leads to a decrease in performance is somewhat unexpected (we have already seen the positive effect of an increase in the number of partitions with BulkLoad), although it can be explained. First, for processing, it is necessary to form requests to 30 regions instead of one, and the amount of data is not such that it gives a win. Secondly, the total running time is determined by the slowest RS, and since the number of DataNode is less than the number of RS, some regions have zero locality. Well, let's look at the top five:

Now let's evaluate the results of the execution of Get blocks:

The number of partitions has lost its significance, which is probably due to the fact that the data is well cached and the read cache is the most significant (statistically) parameter. Naturally, an increase in the number of messages in the request is also very useful for performance. Top scores:

Finally, let's take a look at the model of the block that was first executed by get, and then put:

Here all the parameters are significant. And the results of the leaders:

9. Load Testing

Well, finally, we will launch a more or less decent load, but it is always more interesting when there is something to compare. The DataStax site, a key developer of Cassandra, contains the results of NTs of a number of NoSQL repositories, including HBase version 0.98.6-1. Loading was carried out by 40 streams, data size is 100 bytes, SSD disks. The result of testing Read-Modify-Write showed such results.

As I understand it, the reading was carried out in blocks of 100 records and for 16 HBase nodes, the DataStax test showed a performance of 10 thousand operations per second.

It is fortunate that in our cluster there are also 16 nodes, but not very “successfully”, that each has 64 cores (streams), whereas in the DataStax test it is only 4. On the other hand, they have SSD disks, and we have HDD and more The new version of HBase and the utilization of the CPU during the load almost did not increase significantly (visually by 5-10 percent). However, we still try to run on this configuration. The default table settings, the reading is performed in a range of keys from 0 to 50 million randomly (that is, in fact, each time a new one). The table has 50 million entries, divided into 64 partitions. The keys are hashed in crc32. Default table settings, MSLAB enabled. Running 40 threads, each thread reads a set of 100 random keys and immediately writes the generated 100 bytes of these keys back.

Stand: 16 DataNode and 16 RS (CPU Xeon E5-2680 v4 @ 2.40GHz * 64 streams). HBase version 1.2.0-cdh5.14.2.

The average result is closer to 40 thousand operations per second, which is significantly better than in the DataStax test. However, in order to experiment, you can slightly change the conditions. It is rather unlikely that all work will be carried out exclusively with one table, and also only with unique keys. Suppose there is a “hot” set of keys that generates the main load. Therefore, we will try to create a load with larger records (10 KB), also in batches of 100, in 4 different tables and limiting the range of requested keys to 50 thousand. The graph below shows the launch of 40 threads, each stream reads a set of 100 keys and immediately writes random 10 KB on these keys back.

Stand: 16 DataNode and 16 RS (CPU Xeon E5-2680 v4 @ 2.40GHz * 64 streams). HBase version 1.2.0-cdh5.14.2.

During loading, major compaction was run several times, as was shown above without this procedure, the performance will gradually degrade, however, additional load also occurs during the execution. Drawdowns are caused by different reasons. Sometimes the threads ended up and while they restarted there was a pause, sometimes third-party applications created a load on the cluster.

Reading and writing immediately is one of the most difficult work scenarios for HBase. If you make only put queries of small size, for example, 100 bytes each, combining them in packs of 10-50 thousand pieces, you can get hundreds of thousands of operations per second and similarly with read-only requests. It should be noted that the results are radically better than those obtained by DataStax most due to requests in blocks of 50 thousand each.

Stand: 16 DataNode and 16 RS (CPU Xeon E5-2680 v4 @ 2.40GHz * 64 streams). HBase version 1.2.0-cdh5.14.2.

10. Conclusions

This system is quite flexible, but the influence of a large number of parameters is still unknown. Some of them were tested, but were not included in the resulting test suite. For example, preliminary experiments showed insignificant significance of such a parameter as DATA_BLOCK_ENCODING, which encodes information using values from neighboring cells, which is quite understandable for randomly generated data. In the case of using a large number of repetitive objects, the gain can be significant. In general, we can say that HBase gives the impression of a rather serious and well-thought-out database, which can be quite productive in operations with large data blocks. Especially if it is possible to spread in time the processes of reading and writing.

If something in your opinion is not sufficiently disclosed, I am ready to tell you in more detail. We offer to share our experience or to discuss if you do not agree with something.

Tags: