Query performance analysis in ClickHouse. Yandex Report

    What if your database query is not running fast enough? How do you know if a query uses computing resources optimally or can it be accelerated? At the last HighLoad ++ conference in Moscow, I talked about the introspection of query performance - and what the ClickHouse DBMS provides, and about the OS features that everyone should know about.



    Every time I make a request, I am concerned not only with the result, but also with what this request does. For example, it works for one second. Is it a lot or a little? I always think: why not half a second? Then I optimize something, speed it up, and it works for 10 ms. I am usually satisfied. But still, in this case I try to make a displeased facial expression and ask: “Why not 5 ms?” How can I find out what the time is spent in processing the request? Can it be accelerated in principle?

    Typically, request processing speed is simple arithmetic. We wrote the code - probably optimally - and we have some device in the system. Devices have specifications. For example, the read speed from the L1 cache. Or the number of random reads that an SSD can do. We all know that. We need to take these characteristics, add, subtract, multiply, divide and check the answer. But this is in the ideal case, this almost never happens. Nearly. In fact, this sometimes happens in ClickHouse.

    Consider the trivial facts about what devices and what resources are in our servers.



    Processor, memory, disk, network. I specially arranged these resources in such a way, starting from the simplest and most convenient for review and optimization, and ending with the most inconvenient and complex. For example, I execute a request and see that my program seems to rest on the CPU. What does it mean? What will I find there is some kind of inner loop, a function that is most often executed, rewrite the code, recompile, and once - my program runs faster.

    If you spend too much RAM, then everything is a little more complicated. You need to rethink the data structure, squeeze some bits. In any case, I restart my program, and it spends less RAM. True, this is often to the detriment of the processor.

    If everything depends on disks, then this is also more difficult, because I can change the data structure on the disk, but I have to convert this data later. If I make a new release, people will have to do some kind of data migration. It turns out that the disk is already much more complicated, and it is better to think about it in advance.

    And the network ... I really do not like the network, because it is often completely unclear what is happening in it, especially if it is a network between continents, between data centers. Something is slowing down there, and it’s not even your network, not your server, and you can’t do anything. The only thing you can think in advance is how the data will be transmitted and how to minimize interaction over the network.

    It happens that not a single resource in the system is utilized, and the program is just waiting for something. In fact, this is a very common case, because our system is distributed, and there can be many different processes and flows, and some one is waiting for another, and all this needs to be somehow connected with each other in order to properly consider it.



    The simplest thing is to look at the utilization of resources, at some numerical value. For example, you start some top, and he writes: the processor is 100%. Or run iostat, and he writes: disks are 100%. True, this is often not enough. One person will see that the program rests on disks. What can be done? You can simply note this and go to rest, decide that everything, nothing can be optimized. But in fact, each of the devices inside itself is quite complicated. The processor has a bunch of computing devices for different types of operations. Disks may have a RAID array. If there is an SSD, then there is inside its own processor, its own controller, which makes it unclear what. And one value - 50% or 100% - is not enough. The basic rule: if you see that some resource is 100% utilized, do not give up. Often, you can still improve something. But it happens and vice versa. Say you see that recycling is 50%, but nothing can be done.

    Let's take a closer look at this.



    The easiest and most convenient resource is the processor. You look at top, it says that the processor is 100%. But it should be borne in mind that this is not a 100% processor. The top program does not know what the processor does there. She looks from the perspective of the OS planner. That is, now some kind of program thread is running on the processor. The processor does something, and then 100% will be shown if it is averaged over time. At the same time, the processor is doing something, and it is not clear how effective it is. It can execute a different number of instructions per cycle. If there are few instructions, the processor itself can wait for something, for example, loading data from memory. At the same time, the same thing will be displayed in top - 100%. We are waiting for the processor to follow our instructions. And what he does inside is unclear.

    Finally, there are just a rake when you think that your program rests on the processor. This is true, but for some reason the processor has a lower frequency. There can be many reasons: overheating, power limitation. For some reason, in the data center there is a power limitation in power, or energy saving can simply be turned on. Then the processor will constantly switch from a higher frequency to a lower one, but if your load is unstable, then this will not be enough and on average the code will execute more slowly. See the turbostat for the current processor frequency. Check for overheating in dmesg. If something like that happened, it would say: “Overheating. Frequency down. ”

    If you are interested in how many cache misses were inside, how many instructions are executed per cycle, use perf record. Record some sample of the program. Further it will be possible to look at it using perf stat or perf report.



    And vice versa. Let's say you look at top and the processor is less than 50% recycled. Suppose you have 32 virtual processor cores in your system and 16 physical cores. On Intel processors, this is because hyper-threading is double. But this does not mean that additional cores are useless. It all depends on the load. Suppose you have some well-optimized linear algebra operations or you have hashes for mining bitcoins. Then the code will be clear, many instructions will be executed per cycle, there will be no cache misses, branch mispredictions, too. And hyper-threading does not help. It helps when you have one core waiting for something, while the other can simultaneously execute instructions from another thread.

    ClickHouse has both situations. For example, when we do data aggregation (GROUP BY) or filtering by set (IN subquery), we will have a hash table. If the hash table does not fit into the processor cache, cache misses will occur. This can hardly be avoided. In this case, hyper-threading will help us.

    By default, ClickHouse uses only physical processor cores, excluding hyper-threading. If you know that your request can benefit from hyper-threading, just double the number of threads: SET max threads = 32, and your request will be faster.

    It happens that the processor is perfectly used, but you look at the graph and see, for example, 10%. And your schedule, for example, is five minutes in the worst case. Even if it is a one-second, there is still some kind of averaged value. In fact, you constantly had requests, they are executed quickly, in 100 ms every second, and this is normal. Because ClickHouse tries to execute the request as quickly as possible. He does not at all try to completely and constantly use and overheat your processors.



    Let's take a closer look, a bit complicated option. There is a query with an expression in subquery. Inside the subquery, we have 100 million random numbers. And we just filter on this result.

    We see such a picture. By the way, who will say with what tool I can see this wonderful picture? Absolutely true - perf. I am very glad that you know this.

    I opened perf, thinking that now I understand everything. I open assembler listing. There I wrote how often the execution of the program was on a particular instruction, that is, how often there was an instruction pointer. Here the numbers are in percent, and it is written that almost 90% of the time the test% edx,% edx instruction was executed, that is, checking four bytes for zero.

    The question is: why can a processor take so long to simply compare four bytes with zero? (answers from the audience ...) There is no remainder of the division. There are bit shifts, then there is a crc32q instruction, but as if instruction pointer never happens on it. And random number generation is not in this listing. There was a separate function, and it is very well optimized, it does not slow down. Something else is slowing down here. Code execution stops at this instruction and spends a lot of time. Idle loop? Not. Why should I insert empty loops? Also, if I inserted the Idle loop, that would also be visible in perf. There is no division by zero, there is simply a comparison with zero.

    The processor has a pipeline, it can execute several instructions in parallel. And when the instruction pointer is in some place, this does not mean at all that it is executing this instruction. Maybe he is waiting for some other instructions.

    We have a hash table to verify that some number occurs in some set. For this, we do a lookup in memory. When we do a lookup in memory, we have a cache miss, because the hash table contains 100 million numbers, it is not guaranteed to fit into any cache. So, to execute the zero check instruction, this data should already be loaded from memory. And we wait until they are loaded.



    Now the next resource, a little more complex - drives. SSDs are also sometimes called drives, although this is not entirely correct. SSDs will also be included in this example.

    We open, for example, iostat, it shows utilization of 100%.

    At conferences, it often happens that the speaker goes on stage and says with pathos: “Databases always abut the disk. Therefore, we made an in-memory database. She will not slow down. " If a person approaches you and says so, you can safely send him. There will be some problems - you say, I solved it. :)

    Let's say the program rests on disks, the utilization is 100. But this, of course, does not mean that we use disks optimally.

    A typical example is when you just have a lot of random access. Even if the access is sequential, then you simply read the file sequentially, but it can still be more or less optimal.

    For example, you have a RAID array, several devices - say, 8 disks. And you just read sequentially without read ahead, with a buffer size of 1 MB, and the chunk size in your stripe in RAID is also 1 MB. Then each reading you will have from one device. Or, if not aligned, from two devices. Half a megabyte will go somewhere, another half a megabyte somewhere, and so on - the disks will be used in turn: one, then another, then a third.

    It needs to be read ahead. Or, if you have O_DIRECT, increase the size of the buffer. That is, the rule is: 8 disks, chunk size 1 MB, set the buffer size to at least 8 MB. But this will work optimally only if the reading is aligned. And if not aligned, then there will first be extra pieces, and you need to put more, multiply by a few more.

    Or, for example, you have RAID 10. At what speed can you read from RAID 10 - for example, from 8 disks? What will be the advantage? Fourfold, because there is a mirror, or eightfold? Actually, it depends on how the RAID is created, with what arrangement of chunks in stripes.

    If you use mdadm on Linux, you can specify near layout and far layout there, with near being better for writing, far for reading.

    I always recommend using far layout, because when you write to the analytical database, it is usually not so critical in time - even if there is much more writing than reading. This is done by some background process. But when you read, you need to complete it as quickly as possible. So it is better to optimize RAID for reading by setting far layout.

    As luck would have it, in Linux mdadm will set you to near layout by default, and you will get only half the performance. There are a lot of such rakes.

    Another terrible rake is RAID 5 or RAID 6. Everything scales well there by sequential reads and writes. In RAID 5, the multiplicity is "the number of devices minus one." This scales well even with random readings, but it does not scale well with random reads. Make a record in any one place, and you need to read data from all other disks, poksorit them (XOR - approx. Ed.) And write to another place. For this, a certain cache of strips is used, a terrible rake. In Linux, it is by default such that you create RAID 5 and it will slow down for you. And you will think that RAID 5 always slows down, because this is understandable. But in fact, the reason is the wrong setup.

    One more example. You are reading from an SSD, and you bought yourself a good SSD, it says 300 thousand random reads per second in the specification. And for some reason you can’t do it. And you think - yes they all lie in their specifications, there is no such thing. But all these readings must be done in parallel, with the maximum degree of parallelism. The only way to do this optimally enough is to use asynchronous I / O, which is implemented using the system calls io_submit, io_getevents, io_setup, etc.

    By the way, data on the disk, if you store them, always need to be compressed. I will give an example from practice. One person contacted us in ClickHouse support chat and said:

    - ClickHouse compresses the data. I see it rests on the processor. I have very fast NVMe SSDs, they have a reading speed of several gigabytes per second. Is it possible to somehow disable compression in ClickHouse?
    “No, no way,” I say. - You need to keep the data compressed.
    - Let’s stop it, there’s just going to be another compression algorithm that does nothing.
    - Easy. Enter these letters in this line of code.
    “Indeed, everything is very simple,” he answered a day later. - I did.
    - How much has the performance changed?
    “Failed to test,” he wrote another day later. - There is too much data. They no longer fit on SSDs.

    Let's now see what reading from disk might look like. We start dstat, it shows read speed.

    The first example of dstat and iostat


    Here is the read column - 300 MB / s. We read from discs. It is a lot or a little - I do not know.

    Now I run iostat to check this. Here you can see the breakdown by device. I have RAID, md2, and eight hard drives. Each of them shows recycling, it does not even reach 100% (50-60%). But the most important thing is that I read from each disk only at a speed of 20-30 MB / s. And since childhood I remembered the rule that you can read somewhere from 100 MB / s from the hard drive. For some reason, this has still not changed much.

    Second example of dstat and iostat


    Here is another example. Reading is more optimal. I run dstat, and I have a read speed of 1 GB / s from this RAID 5 out of eight drives. What does iostat show? Yes, almost 1 GB / s.

    Now the drives are finally 100% loaded. True, for some reason, two are 100%, and the rest are 95%. Probably, they are still a little different. But with each of them I read 150 MB / s, even cooler than it can be. What is the difference? In the first case, I read with insufficient buffer size in insufficient pieces. It's simple, I tell you common truths.

    By the way, if you think that the data still does not need to be compressed for the analytical database, that is, a report from the HighLoad ++ Siberia conference ( habrastaty based on the report - approx.ed .). The organizers decided to make the most hardcore reports in Novosibirsk.



    The next example is memory. Continuing common truths. First, on Linux, never see what free shows. For those who are watching, they specially created the site linuxatemyram.com. Come in, there will be an explanation. You don’t need to look at the amount of virtual memory either, because what's the difference, how much address space has the program allocated? Look at how much physical memory is used.

    And one more rake with which it is not even clear how to fight. Remember: the fact that allocators often do not like to give memory to the system is normal. They made mmap, but munmap no longer does. Memory will not return to the system. The program thinks - I know better how I will use the memory. I'll leave it to myself. Because the mmap and munmap system calls are pretty slow. Changing the address space, flushing the processor’s TLB caches — it’s best not to. However, the OS still has the ability to free memory correctly using the madvise system call. The address space will remain, but physically the memory can be unloaded.

    And never enable swap on production servers with databases. You think - there is not enough memory, I will include swap. After that, the request will stop working. It will crack endless time.



    With a network too typical rake. If you create a TCP connection each time, it takes some time before the correct window size is selected, since the TCP protocol does not know how fast it will be necessary to transmit data. He adapts to this.

    Or imagine - you are transferring a file, and you have a large latency on your network and a decent packet loss. Then it’s not at all obvious whether it is right to use TCP to transfer files. I think it’s wrong, since TCP guarantees consistency. On the other hand, you could transfer one half of the file and the other at the same time. Use at least several TCP connections or not use TCP at all for data transfer. Say, if you download data, movies and TV shows with torrents, TCP may not be used there. And the data needs to be compressed.

    If you have a 100-gigabit network within the rack, you can not compress it. But if you have 10 gigabits between data centers, especially between Europe and the USA, then who knows how your bytes will creep under the ocean. Squeeze them. Let less bytes creep.



    Did everyone see this picture? If everything is slow in the system, you have the necessary tools. You will begin to use them, begin to deal with the problem and, from experience, you will find 10 other problems. These tools are powerful enough to keep you busy for a very long time.



    When you are told: “something is wrong with my database” - you go to the server and run all these tools in a row. Of these, iotop can be especially noted, it shows how much each process actually reads and writes to disks, how many iops there are.

    But even in such simple tools, there are little tricks that not everyone knows about. Let's get a look.

    Example: top by thread


    I run top on the production server, and it shows that clickHouse-server uses some processor resources, some processor cores. I don’t know what is going on inside, but I can press Shift + H and it will show a breakdown by flows. For your convenience, the flows are named in ClickHouse. Here is ParalInputsProc, parallel processing of requests. Or BackgrProcPool - that is, merges or downloading data for replication. Now it’s better to understand what the individual processor cores are spent on and what the individual processor threads do.

    Why are names so stupid? If you read ClickHouse sources, then you know that I don’t like that. I would call BackgroundProcessingPool. But here a maximum of 15 bytes is possible. 16 minus 1, where 1 is the zero byte. Why 16? It seems to me that the Linux kernel developers are strong bearded professionals who decided: “16 bytes. That's better". :)

    Here is another example using the great clickhouse-benchmark program. It comes standard with the clickhouse-client. This is not a separate binary, but the same clickhouse-client, only a symlink. Using it, you can take some kind of query tape or one request and arrange load testing. These queries will be continuously executed using a fixed number of connections and display statistics.

    Example: clickhouse-benchmark + perf top


    I take a simple query counting the number of unique visitors. I launch clickhouse-benchmark, he writes how many requests I have per second, what percentiles of execution, how many lines are processed per second, and displays the speed of reading data after being compressed in megabytes per second. I usually use it with peft top. I go to peft top, and now I have a breakdown by function. Here you can see that at the beginning of the hash table there is an insert into a certain hash set, designed specifically for calculating the uniq function: UniquesHashSet. And aggregation. I can go inside, see the assembler listing. Usually there is absolutely nothing clear, but the entertainment is good.

    Although here, for example, a function call by pointer. Actually, I wanted to see something else - insert into a hash table. There are integer multiplications, bit shifts, XOR and some long constants. This is a hash function calculation. That is, I could put some simpler hash function there. But I can’t do this because it is in the indicated piece of code that this is almost the simplest hash function.

    I could put, say, crc32q there. But it is already used elsewhere in the code, and if I put it here as well, there will be an undesirable correlation of the hash functions with each other, which will lead to monstrous collisions in the hash table and the program will slow down.

    I talked about various resources, but all of the above applies not only to ClickHouse. You can use the same rules for any other database, watch what happens inside. I want to talk purely about ClickHouse.



    Let's start with the most basic thing. To see what requests are being executed - no surprises, just type SHOW PROCESSLIST. Simple and clear. This is the same as SELECT * FROM system processes. There are a bunch of columns with all sorts of system metrics: memory consumption, reading, how many bytes were counted. You can even portray a simple ClickHouse top.

    And what does ClickHouse do inside? Basically, it performs queries and background operations. Background operations are basically merges. If we are interested in which merges are executed, we just look at SELECT * FROM system.merges.

    Top example


    Let's see how it looks. Here is the production server. I run a magic team. Please - Top ClickHouse. This is a real user from a real analytic server. Here they have some requests fulfilled, about horror, more than a minute. I think I need to stop and go optimize them. Some kind of traf_testing. What is it? In fact, these requests are pretty much cut down by resources, especially so that analysts do not interfere with interactive requests. ClickHouse is easy to provide.



    Move on. I would like not only to see what is happening at the current moment in time. Suppose you had problems, everything stopped, and you need to understand what happened at that moment. Enough to include query_log. By default, it is turned off for the only reason - you do not have to worry about the fact that you are only reading from some server, making SELECT queries, but still some tables are being written. You can enable query_log on production everywhere by default, there are no problems. On our production servers, it is mandatory on. If you want separately, you can include it in the session and in the request. Each request is logged two times: when it started to run and when it finished.

    Suppose you are interested in what happens to pieces of data - merge, inserts, downloading from replicas. There is a part_log system table for this. It can be included in the configuration file, and it is also turned off by default.



    It’s convenient to use query_log with clickhouse-benchmark. You make select feeds of requests that you had, and then send it to stdin in clickhouse-benchmark.

    Or it’s very convenient to query_log to find some first heavy request, after which the system began to work not well enough.



    Often you need to execute a request right now, but it is not clear what it is doing. Therefore, it is possible to trace requests. Just set send_logs_level = 'trace' in the SET client, and get the execution logs from all the servers involved in processing the request.

    Example: query tracing


    Let's see how it looks. The request is fulfilled, but quickly reaches 98%. I want to understand what he does in the remaining moments of time. It is very simple. I type SET send_logs_level = 'trace', I start to execute, and a lot of garbage is output. But finally it is visible: merging aggregated data, I managed to notice. That is what he does in the remaining 1% of the time. Previously, you didn’t even think about this, but now everything is clear.

    And in this log the request identifier is displayed, which can be used to find the request directly in query_log.

    Let's get a look. SELECT * FROM system.query_log for this query. Two entries. True, now the trace is superfluous, I will turn it off and see what there is in this query_log. Chose. There are all sorts of metrics for query execution - the number of times the files were opened, the number of compressed blocks, the number of hits in the serif cache, etc.



    There are resource counters right inside ClickHouse. Most importantly, they are collected globally for all queries and are available in the tables system.events, system.metrics and system.asynchronous_metrics. Events are simply incremental counters showing how many times files have been opened. 100 times. Or how many requests have been completed since the start of the program - 10 pieces. And system.metrics is the number of simultaneous events right now. For example, right now at the same time 10 requests are being executed, which in total use 10 GB of memory.

    I named the system.asynchronous_metrics table because of the harmfulness, because it is very difficult to type. I am principled. Now I myself have difficulty typing it every time - also out of harm. So, system.asynchronous_metrics are those metrics that are simply collected at some frequency. Say once a minute.

    All listed counters are available not only globally, but also for each request. That is, you can look at SHOW PROCESSLIST and find there the counters for a specific request. Look at query_log, and there will also be counters for completed requests.



    Let's see what they are. There are those that the program itself collects. For example, the number of file openings. We know when we open the file, and it is at that moment that we increment the counter. And there are those that are built from the Linux kernel, much more advanced. This is what the Linux kernel thinks about our program. I don’t even want to think what it thinks about us. It counts how many bytes we read. And here there are completely different metrics.

    Say OSReadChars and OSReadBytes. What is the difference? The first shows how many bytes we read from the file system, and part of the bytes could be read from the cache, and not from a real disk. The second metric shows how many real bytes were read from block devices. The fact is that the OS tries to implement the cache as transparently as possible so that we just read from the files and don’t know where it is read from. Fortunately, you can still directly obtain more detailed metrics from the kernel that just don’t give us.

    Page cache example


    Let's see how it looks. Here we are fulfilling some kind of request. It seems nothing, almost 40 million lines per second, 6.7 seconds. Fine. By the way, the data in the test tables are fake, I specifically ruined them . It looks, of course, strange.

    We do the same a second time, now only 1.3 seconds, 5 times less. Why? I think the answer is obvious to everyone - because the second time the page cache was used. But how can we understand this from our counters?

    An example with a request is more complicated


    Let's execute not so simple request. Here are all our metrics for the first request. For example, it was written there that when executing the request, one request was executed. This is for completeness. But there are also useful metrics: 3.2 GB were read from the disk, 2.5 GB from the file system. By the way, it is interesting that this time more was read from the disk than we wanted. Why? First, the hypothesis: there is read ahead. That is, when we wanted to read a little data, just in case the OS read more - suddenly come in handy? Secondly, we can read from the disk with some minimal sectors - 4 KB or, say, 512 KB. The exact value depends on the settings. And if not aligned, then a little extra tail will be read. But most likely the difference is due to read ahead.



    This was our first request. For the second request, the metrics will be different. According to these data, it is clearly seen that the request has become faster. It differs, for example, ReadBytes - the number of bytes that are actually read from disk. It was 3 GB, and it became only 3 MB. I don’t know why it’s not zero, but it’s almost zero.

    Another interesting metric is IOWait. 87 seconds. The request ran for almost 7 seconds, and IOWait - 87. Why? That's right - we have a lot of threads. The metric is counted on each stream. Each thread waited until it was given data from the disk, and in total it turned out 87 seconds. The second time we almost did not wait, it took some milliseconds.

    Another metric is CPUWait. This is not the time of using the processor, but the time when our thread was ready for the OS to start executing it on the processor. For some reason, the OS did not allow it to run - perhaps there were more threads. That is, we see CPU starvation. But in this example, the threads do not starve on the CPU. Whenever they are ready to follow some instructions, they will follow them. As an add-on, there are still simple metrics - for example, processor time spent in user space. They almost do not differ, but for the second time for some reason more. Well, okay.

    And - CPU time spent in the Linux kernel. If you make system calls and there is something complicated inside, this will be taken into account. When reading from a disk, of course, a little processor is also spent on auxiliary operations.

    Example: query_thread_log


    And now the most advanced thing we have: query_thread_log. With it, you can understand what each thread of query execution wasted time on.

    I look for my request, select by query_id and indicate the metric "The amount of processor time spent in user space". Here are our streams. For parallel processing of the request, 16 threads were allocated. Each of them spent 800 ms. And then another 16 threads were allocated for the merging of the state of aggregate functions, 0.25 s were spent on each of them. Now I can understand exactly what each request took time.

    Video report on HighLoad ++:


    Also popular now: