Using ClickHouse in VK, or Why we wrote KittenHouse

    At the beginning of the year, we decided to learn how to store and read VK debugging logs more efficiently than before. Debugging logs are, for example, video conversion logs (basically the output of the ffmpeg command and a list of steps for preprocessing files), which sometimes we only need 2-3 months after processing the problem file.

    At that time, we had 2 ways to store and process logs - our own logs engine and rsyslog, which we used in parallel. We began to consider other options and realized that ClickHouse from Yandex is quite suitable for us - we decided to implement it.

    In this article, I’ll talk about how we started using ClickHouse on VKontakte, what kind of rake they stepped on, and what KittenHouse and LightHouse are. Both products are laid out in open-source, links at the end of the article.

    Log collection task


    System Requirements:

    1. Storage of hundreds of terabytes of logs.
    2. Storage for months or (rarely) years.
    3. High write speed.
    4. High read speed (reading is rare).
    5. Index Support
    6. Support for long strings (> 4 Kb).
    7. Simplicity of operation.
    8. Compact storage.
    9. The ability to insert from tens of thousands of servers (UDP will be a plus).

    Possible solutions


    Let's briefly list the options that we considered, and their cons:

    Logs engine


    Our self-written microservice for logs.
    - Able to give only the last N lines that fit in RAM.
    - Not very compact storage (no transparent compression).

    Hadoop


    - Not all formats have indexes.
    - The reading speed could be higher (depending on the format).
    - The complexity of the settings.
    - There is no possibility of inserting from tens of thousands of servers (Kafka or analogues are needed).

    Rsyslog + files


    - No indexes.
    - Low read speed (regular grep / zgrep).
    - Architecturally not supported strings> 4 Kb, UDP even less (1.5 Kb).
    ± Compact storage is achieved by logrotate over the crown

    We used rsyslog as a fallback for long-term storage, but long lines were truncated, so it can hardly be called ideal.

    LSD + files


    - No indexes.
    - Low read speed (regular grep / zgrep).
    - Not particularly designed for insertion from tens of thousands of servers.
    ± Compact storage is achieved by logrotate over the crown.
    The differences from rsyslog in our case are that LSD supports long strings, but significant changes to the internal protocol are required to insert from tens of thousands of servers, although this can be done.

    Elasticsearch


    - Problems with operation.
    - Unstable recording.
    - No UDP.
    - Bad compression.
    The ELK stack is already almost the industry standard for log storage. In our experience, everything is fine with the speed of reading, but there are problems with writing, for example, when merging indices.

    ElasticSearch is primarily designed for full-text search and relatively frequent read requests. For us, stable recording and the ability to read our data more or less quickly are more important, and by exact coincidence. The index at ElasticSearch is sharpened for full-text search, and the disk space is quite large compared to the gzip of the original content.

    Clickhouse


    - No UDP.

    By and large, the only thing that did not suit us at ClickHouse was the lack of UDP communication. In fact, of the above options, only rsyslog had it, but rsyslog did not support long lines.

    According to other criteria, ClickHouse came up to us, and we decided to use it, and problems with transport were solved in the process.

    Why KittenHouse is needed


    As you probably know, VKontakte works on PHP / KPHP, with "engines" (microservices) in C / C ++ and a little on Go. PHP does not have a concept of “state” between requests, except perhaps for shared memory and open connections.

    Since we have tens of thousands of servers from which we want to be able to send logs to ClickHouse, it would be unprofitable to keep open connections from each PHP worker (each server can have 100+ workers). Therefore, we need some kind of proxy between ClickHouse and PHP. We called this proxy KittenHouse.

    KittenHouse, v1


    First, we decided to try the simplest possible scheme to understand whether our approach will work or not. If Kafka comes to your mind when solving this problem, then you are not alone. However, we did not want to use additional intermediate servers - in this case, we could easily rest on the performance of these servers, and not ClickHouse itself. In addition, we collected logs and we needed a predictable and small delay in data insertion. The scheme is as follows:



    On each of the servers our local proxy (kittenhouse) is installed, and each instance holds exactly one HTTP connection with the desired ClickHouse server. Paste is done in spooled tables, since MergeTree is often not recommended to be inserted.

    Features KittenHouse, v1


    The first version of KittenHouse knew quite a bit, but this was enough for testing:

    • Communication through our RPC (TL Scheme).
    • Maintain 1 TCP / IP connection per server.
    • In-memory buffering by default, with a limited buffer size (the rest is discarded).
    • The ability to write to disk, in this case there is a guarantee of delivery (at least once).
    • The insertion interval is once every 2 seconds.

    First problems


    We encountered the first problem when we “repaid” the ClickHouse server for several hours and then turned it back on. Below you can see the load average on the server after it “climbed up”:



    This is explained quite simply: ClickHouse has a network model of thread per connection, so when I tried to do INSERT from thousands of nodes at the same time, there was very strong competition for CPU resources and the server barely responded. However, all the data was eventually inserted and nothing fell.

    To solve this problem, we put nginx in front of ClickHouse and, in general, it helped.

    Further development


    During the operation, we encountered a number of problems, mainly related not to ClickHouse, but to our way of operating it. Here's another rake we stepped on:

    A large number of “chunks” of Buffer tables leads to frequent buffer flushes in MergeTree


    In our case, there were 16 pieces of buffer and a reset interval every 2 seconds, and there were 20 pieces of tables, which gave up to 160 inserts per second. This periodically affected the insertion performance very badly - there were a lot of background merges and disk utilization reached 80% and higher.

    Solution: increase the default buffer reset interval, reduce the number of pieces to 2.

    Nginx returns 502 when connections to upstream end


    This in itself is not a problem, but in combination with frequent flushing of the buffer, this gave a rather high background of 502 errors when trying to insert into any of the tables, as well as when trying to perform SELECT.

    Solution: they wrote their reverse proxy using the fasthttp library , which groups the insert into tables and consumes connections very economically. It also distinguishes between SELECT and INSERT and has separate connection pools for insertion and reading.



    Running out of memory with intensive insertion


    The fasthttp library has its advantages and disadvantages. One of the drawbacks is that the request and response are fully buffered in memory before giving control to the request handler. For us, this resulted in the fact that if the insert into ClickHouse “didn’t have time”, then the buffers started to grow and eventually all the memory on the server ran out, which led to the kill of reverse proxy by OOM. Colleagues drew a demotivator:



    Solution: patching fasthttp to support streaming the body of a POST request turned out to be a daunting task, so we decided to use Hijack () connections and upgrade the connection to our protocol if a request came with the HTTP method KITTEN. Since the server must reply MEOW in response, if it understands this protocol, the whole scheme is called the KITTEN / MEOW protocol.

    We only read from 50 random connections at a time, therefore, thanks to TCP / IP, the rest of the clients “wait” and we do not spend memory on buffers until the queue reaches the respective clients. This reduced memory consumption by at least 20 times, and we had no more such problems.

    ALTER tables can go long if there are long queries


    ClickHouse has a non-blocking ALTER in the sense that it does not interfere with both SELECT and INSERT queries. But ALTER cannot start until it has finished executing queries to this table sent before ALTER.

    If you have a background of “long” queries to any tables on your server, then you may encounter a situation where ALTER on this table will not have time to execute in a default timeout of 60 seconds. But this does not mean that ALTER will fail: it will be executed as soon as those SELECT queries are finished.

    This means that you do not know at what point in time ALTER actually occurred, and you do not have the ability to automatically recreate Buffer tables so that their layout is always the same. This can lead to insertion issues.





    Decision:As a result, we plan to completely abandon the use of buffer tables. In general, buffer tables have a scope, so far we use them and do not experience huge problems. But now we finally reached the point where it is easier to implement the functionality of buffer tables on the reverse proxy side than to continue to put up with their shortcomings. An example circuit will look like this (the dashed line shows the ACK asynchrony on INSERT).



    Reading data


    Let's say we figured out the insert. How to read these logs from ClickHouse? Unfortunately, we did not find any convenient and easy-to-use tools for reading raw data (without graphing and other) from ClickHouse, so we wrote our own solution - LightHouse. Its capabilities are rather modest:

    • Quick view of table contents.
    • Filtering, sorting.
    • Editing an SQL query.
    • View table structure.
    • Shows the approximate number of lines and disk space used.

    However, LightHouse is fast and able to do what we need. Here are a couple of screenshots:

    View table structure



    Filtering content



    results


    ClickHouse is practically the only open-source database that has taken root on VKontakte. We are pleased with the speed of its work and are ready to put up with the shortcomings, which are discussed below.

    Difficulty in work


    All in all, ClickHouse is a very stable database and very fast. However, as with any product, especially so young, there are features in the work that need to be considered:

    • Not all versions are equally stable: do not upgrade directly to the new version on production, it is better to wait for several bugfix releases.
    • Для оптимальной производительности крайне желательно настраивать RAID и некоторые другие вещи согласно инструкциям. Об этом недавно был доклад на highload.
    • Репликация не имеет встроенных ограничений по скорости и может вызывать существенную деградацию производительности сервера, если её не ограничивать самим (но это обещают исправить).
    • В Linux есть неприятная особенность механизма работы виртуальной памяти: если вы активно пишете на диск и данные не успевают сбрасываться, в какой-то момент сервер полностью «уходит в себя», начинает активно сбрасывать page cache на диск и практически полностью блокирует процесс ClickHouse. Это иногда происходит при больших мержах, и за этим нужно следить, например периодически сбрасывать буферы самим или делать sync.

    Open-source


    KittenHouse and LightHouse are now available in open source in our github repository:


    Thanks!

    Yuri Nasretdinov, developer in the backend infrastructure department of VKontakte

    Also popular now: