A Tale of How One HTTP / 2 Client Engineer Overclocked

    Using the example of “JEP 110: HTTP / 2 Client” (which will appear in the JDK in the future), Sergey Kuksenko from Oracle shows how the team started it, where it looked and what it turned to make it faster.

    We offer you a transcript of his report from JPoint 2017. In general, we will not discuss HTTP / 2 here. Although, of course, it will not be possible to manage without a number of details on it.



    HTTP / 2 (aka RFC 7540)


    HTTP 2 is a new standard designed to replace legacy HTTP 1.1. What is the difference between the implementation of HTTP 2 from the previous version in terms of performance?

    The key thing about HTTP 2 is that we have one single TCP connection. Data streams are cut into frames, and all these frames are sent through this connection.

    A separate header compression standard, RFC 7541 (HPACK), is also provided. It works very well: it allows you to squeeze up to 20 bytes an HTTP header with a size of the order of kilobytes. For some of our optimizations, this is important.

    In general, the new version has a lot of interesting things - request prioritization, server push (when the server itself sends data to the client) and so on. However, in the context of this narrative (in terms of performance) this is not important. In addition, many things have remained the same. For example, what the HTTP protocol looks like from above: we have the same GET and POST methods, the same values ​​for the HTTP header fields, status codes and the structure "request -> response -> final response". In fact, if you take a closer look, HTTP 2 is just a low-level transport substrate for HTTP 1.1, which removes its shortcomings.

    HTTP API (aka JEP 110, HttpClient)


    We have an HttpClient project called JEP 110. It is almost included in JDK 9. Initially, we wanted to make this client part of the JDK 9 standard, but there were some disputes at the API implementation level. And since we do not have time to finalize the HTTP API by the release of JDK 9, we decided to make it so that it could be shown to the community and discussed.

    JDK 9 introduces a new incubator module (Incubator Modules aka JEP-11). This is a sandbox where, in order to receive feedback from the community, new APIs will be added that are not yet standardized, but, by the definition of the incubator, will be standardized to the next version or removed altogether ("The incubation lifetime of an API is limited: It is expected that the API will either be standardized or otherwise made final in the next release, or else removed "). Everyone who is interested can familiarize themselves with the API and send their feedback. Perhaps by the next version - JDK 10 - where it will become the standard, everything will be fixed.

    • module: jdk.incubator.httpclient
    • package: jdk.incubator.http

    HttpClient is the first module in an incubator. Subsequently, other things related to the client will appear in the incubator.

    I’ll tell you a couple of examples about the API (this is the client API, which allows you to make a request). Main classes:

    • HttpClient (its Builder);
    • HttpRequest (its Builder);
    • HttpResponse, which we are not building, but just getting back.

    Here is a simple way to build a query:

    HttpRequest getRequest = HttpRequest .newBuilder(URI.create("https://jpoint.ru/")) .header("X-header", "value") .GET() .build();
    

    HttpRequest postRequest = HttpRequest .newBuilder(URI.create("https://jpoint.ru/")) .POST(fromFile(Paths.get("/abstract.txt"))) .build();
    

    Here we specify the URL, set the header, etc. - we receive a request.
    How can I send a request? There are two kinds of APIs for a client. The first is a synchronous request when we block at the location of this call.

    HttpClient client = HttpClient.newHttpClient(); 
    HttpRequest request = ...; 
    HttpResponse response = 
    // synchronous/blocking 
    client.send(request, BodyHandler.asString()); 
    if (response.statusCode() == 200) { 
    String body = response.body(); 
    ... 
    } 
    ...
    

    Запрос ушел, мы получили ответ, проинтерпретировали его как string (handler у нас здесь может быть разный — string, byte, можно свой написать) и обработали.

    Второй — асинхронный API, когда мы не хотим блокироваться в данном месте и, посылая асинхронный запрос, продолжаем выполнение, а с полученным CompletableFuture потом можем делать все, что захотим:

    HttpClient client = HttpClient.newHttpClient(); 
    HttpRequest request = ...; 
    CompletableFuture> responseFuture = 
    // asynchronous 
    client.sendAsync(request, BodyHandler.asString()); 
    ...
    

    Клиенту можно задать тысячу и один конфигурационный параметр, по-разному сконфигурировать:

    HttpClient client = HttpClient.newBuilder()
    .authenticator(someAuthenticator)
    .sslContext(someSSLContext)
    .sslParameters(someSSLParameters)
    .proxy(someProxySelector)
    .executor(someExecutorService)
    .followRedirects(HttpClient.Redirect.ALWAYS)
    .cookieManager(someCookieManager)
    .version(HttpClient.Version.HTTP_2)
    .build();
    

    Основная фишка еще здесь в том, что клиентский API — универсальный. Он работает как со старым HTTP 1.1, так и с HTTP 2 без различения деталей. Для клиента можно указать работу по умолчанию со стандартом HTTP 2. Этот же параметр можно указать для каждого отдельного запроса.

    Постановка задачи


    So, we have a Java library - a separate module, which is based on standard JDK classes, and which we need to optimize (do some kind of performance work). Formally, the task of performance is as follows: we must get a reasonable client performance for an acceptable engineer’s time.

    We choose the approach


    Where can we start this work?

    • We can sit down and read the HTTP 2 specification. This is useful.
    • We can begin to study the client itself and rewrite the shit that we find.
    • We can just look at this client and rewrite it in its entirety.
    • We can benchmark.

    Let's start with benchmarking. Suddenly, everything is so good there - you don’t have to read the specification.

    Benchmarks


    They wrote a benchmark. It’s good if we have any competitor for comparison. I took the Jetty Client as a competitor. I screwed Jetty Server on the side - simply because I wanted the server to be in Java. Wrote GET and POST requests of different sizes.



    The question naturally arises - what do we measure: throughput, latency (minimum, average). During the discussion, we decided that this is not a server, but a client. This means that taking into account the minimum latency, gc-pauses and everything else in this context is not important. Therefore, specifically for this work, we decided to confine ourselves to measuring the overall throughput of the system. Our task is to increase it.

    The overall throughput of the system is the inverse of the average latency. That is, we worked on the average latency, but at the same time did not bother with each individual request. Just because the client does not have the same requirements as the server.

    Alteration 1. TCP Configuration


    We start GET on 1 byte. Iron is written out. We



    get : I take the same benchmark for HTTPClient, run it on other operating systems and hardware (these are more or less server machines). I get:



    In Win64, everything looks better. But even on MacOS, things are not as bad as on Linux.

    The problem is here:

    SocketChannel chan; 
    ... 
    try { 
        chan = SocketChannel.open(); 
        int bufsize = client.getReceiveBufferSize(); chan.setOption(StandardSocketOptions.SO_RCVBUF, bufsize); 
    } catch (IOException e) { 
        throw new InternalError(e); 
    }
    

    This is the opening of the SocketChannel to connect to the server. The problem is the lack of one line (I highlighted it in the code below):

    SocketChannel chan; 
    ... 
    try { 
        chan = SocketChannel.open(); 
        int bufsize = client.getReceiveBufferSize(); chan.setOption(StandardSocketOptions.SO_RCVBUF, bufsize); 
    chan.setOption(StandardSocketOptions.TCP_NODELAY, true); <-- !!!
    } catch (IOException e) { 
        throw new InternalError(e); 
    }
    

    TCP_NODELAY- This is a "hello" from the last century. There are various TCP stack algorithms. There are two in this context: Nagle's Algorithm and Delayed ACK. Under certain conditions, they can flare, causing a sharp slowdown in data transfer. This is such a well-known issue for the TCP stack that people enable TCP_NODELAY, which turns off Nagle's Algorithm, by default. But sometimes even an expert (and real TCP experts wrote this code) can simply forget about it and not enter this command line.

    In principle, there are a lot of explanations on the Internet how these two algorithms conflict and why they create such a problem. I provide a link to one article that I liked: TCP Performance problems caused by interaction between Nagle's Algorithm and Delayed ACK

    A detailed description of this problem is beyond the scope of our conversation.

    After the only line with the inclusion was added TCP_NODELAY, we got about such a performance gain:



    I will not count how much it is in percentage.

    Moral: this is not a Java problem, it is a problem with the TCP stack and its configuration issues. For many areas there are well-known schools. So well-known that people forget about them. It is advisable to just know about them. If you are new to this area, you can easily google the main shoals that exist. You can check them very quickly and without any problems.  

    You need to know (and do not forget) the list of well-known jambs for your subject area.

    Alteration 2. Flow-control window


    We have the first change, and I did not even have to read the specification. It turned out 9600 requests per second, but remember that Jetty gives 11 thousand. Next, we profile using any profiler.

    Here's what I got:



    And this is a filtered option:



    My benchmark takes 93% of the CPU time.

    Sending a request to the server takes 37%. Then comes all the internal detailing, working with frames, and at the end of 19% - this is an entry in our SocketChannel. We transfer the data and header of the request, as it should be in HTTP. And then we read readBody().

    Next, we must read the data that came to us from the server. What then is it?



    If the engineers correctly named the methods, and I trust them, then here they send something to the server, and this takes as much time as sending our requests. Why do we send something when reading the server response?

    To answer this question, I had to read the specification.

    In general, a lot of performance problems are solved without knowledge of the specification. Somewhere it is necessary to replace ArrayListwith LinkedListor vice versa, or Integerwith intand so on. And in this sense it is very good if there is a benchmark. Measure - correct - works. And you don’t go into the details of how it works there according to the specification.

    But in our case, the problem was really revealed in the specification: in the HTTP 2 standard there is the so-called flow-control. It works as follows. We have two peers: one sends data, the other receives. The sender (sender) has a window - a flow-control window with a size of a certain number of bytes (suppose 16 KB).



    Suppose we sent 8 KB. The flow-control window is reduced by these 8 KB.



    After we sent another 8 KB, the flow-control window became 0 KB.



    By standard, in this situation, we have no right to send anything. If we try to send some data, the recipient will be required to interpret this situation as a protocol error and close the connection. This is a kind of protection against DDOS in some cases, so that nothing extra is sent to us, and the messenger is adjusted to the recipient's bandwidth.



    When the receiver processed the received data, it had to send a special dedicated signal called WindowUpdate indicating how many bytes to increase the flow-control window.



    When WindowUpdate arrives to the sender, its flow-control window increases, we can send data further.

    What is going on in the client?
    We got the data from the server - this is the real piece of processing:

    // process incoming data frames 
    ... 
    DataFrame dataFrame; 
    do { 
        DataFrame dataFrame = inputQueue.take(); 
        ... 
        int len = dataFrame.getDataLength(); 
        sendWindowUpdate(0, len); // update connection window sendWindowUpdate(streamid, len); //update stream window 
    } while (!dataFrame.getFlag(END_STREAM)); 
    ... 
    

    Some came dataFrame- a data frame. We looked at how much data there was, processed it, and sent WindowUpdate back to increase the flow-control window by the desired value.

    Actually in each such place two flow-control windows work. We have a flow-control window specifically for this data stream (request), and we also have a common flow control window for the entire connection. Therefore, we must send two WindowUpdate requests.

    How to optimize this situation?

    The first one. At the end while, we have a flag that says that we were sent the last data frame. By standard, this means that no more data will come. And we do this:

    // process incoming data frames 
    ... 
    DataFrame dataFrame; 
    do { 
        DataFrame dataFrame = inputQueue.take(); 
        …
        int len = dataFrame.getDataLength(); 
        connectionWindowUpdater.update(len); 
        if (dataFrame.getFlag(END_STREAM)) { 
            break; 
        } 
        streamWindowUpdater.update(len); 
    } 
    while (true); 
    ... 
    

    This is a small optimization: if we caught the stream end flag, then for this stream WindowUpdate can no longer send: we are no longer waiting for any data, the server will not send anything.

    The second one. Who says we should send WindowUpdate every time? Why can’t we, having received many requests, process the received data and only then send WindowUpdate a bundle to all incoming requests?

    Here is WindowUpdaterone that works on a specific flow-control window:

    final AtomicInteger received; 
    final int threshold; 
    ... 
    void update(int delta) { 
        if (received.addAndGet(delta) > threshold) { 
            synchronized (this) { 
                int tosend = received.get(); 
                if( tosend > threshold) { 
                    received.getAndAdd(-tosend); 
                    sendWindowUpdate(tosend); 
                } 
            } 
        } 
    }
    

    We have a certain one threshold. We receive data, do not send anything. As soon as we have gathered data before this threshold, we send all WindowUpdate. There is a certain heuristic that works well when the value is thresholdclose to half of the flow-control window. If we initially had this window 64 KB, and we get 8 KB each, then as soon as we received several data frames with a total volume of 32 KB, we send the window updater immediately to 32 KB. Normal batch processing. For good synchronization, we also make a completely ordinary double check.

    For a request of 1 byte we get:



    The effect will be even for megabyte requests, where there are many frames. But he, of course, is not so noticeable. In practice, I had different benchmarks, requests of different sizes. But here for each case I did not draw graphics, but picked up simple examples. Squeezing more detailed data will be a little later.

    We got only + 23%, but Jetty has already overtaken.

    Moral: reading the specification carefully and logic are your friends.

    There is some nuance of the specification. It says on the one hand that, having received the data frame, we must send WindowUpdate. But, having carefully read the specification, we will see: there is no requirement that we are obliged to send WindowUpdate to each byte received. Therefore, the specification allows such a batch update of the flow-control window.

    Alteration 3. Locks


    Let's learn how we scale (scale).

    The laptop is not very suitable for scaling - it has only two real and two fake kernels. We will take some server machine in which 48 hardware threads and run the benchmark.

    Here, horizontally - the number of threads, and vertically shows the total throughput.



    Here you can see that up to four threads we scale very well. But further, scalability becomes very poor.

    It would seem, why do we need this? We have one client; we will get the necessary data from the server from one thread and forget about it. But first, we have an asynchronous version of the API. We will come to her again. There will probably be some threads. Secondly, everything in our world is now multicore, and to be able to work well with many threads in our library is simply useful, if only because when someone starts complaining about the performance of the single-threaded version, he can be advised to switch to multi-threaded and get a benefit. Therefore, let's look for the perpetrator of poor scalability. I usually do it like this:

    #!/bin/bash 
    (java -jar benchmarks.jar BenchHttpGet.get -t 4 -f 0 &> log.log) & 
    JPID=$! 
    sleep 5 while kill -3 $JPID; 
    do 
    : 
    done
    

    I just write stacktraces to a file. In reality, this is enough for me in 90% of cases when I work with locks without any profilers. Only in some complicated trick cases do I launch Mission control or something else and watch the distribution of locks.

    In the log, you can see in what condition I have various threads:



    Here we are interested in exactly the locks, and not waiting, when we expect events. There are 30 thousand locks, which is quite a lot against 200 thousand runnable.

    But such a command line will simply show us the culprit (nothing else is needed - just the command line):



    The culprit is caught. This is a method inside our library that sends a data frame to the server. Let's get it right.

    void sendFrame(Http2Frame frame) { 
        synchronized (sendlock) { 
            try { 
                if (frame instanceof OutgoingHeaders) { 
                    OutgoingHeaders oh = (OutgoingHeaders) frame; 
                    Stream stream = registerNewStream(oh); 
                    List frames = encodeHeaders(oh, stream); writeBuffers(encodeFrames(frames)); 
                } else { 
                    writeBuffers(encodeFrame(frame)); 
                } 
            } catch (IOException e) { 
                    ... 
            } 
        } 
    }
    

    Here we have a global monitor:



    But this branch -

    - the beginning of the initiation of the request. This is sending the very first header to the server (some additional actions are required here, I will talk about them now).

    This is sending to the server all the other frames:



    All this under global lock!

    It sendFrametakes us on average 55% of the time.



    But this method takes 1%:



    Let's try to understand what can be taken out of the global block.

    Registration of a new stream from under the block cannot be taken out. The HTTP standard imposes a restriction on the numbering of streams. INregisterNewStreamthe new stream gets the number. And if for the transfer of my data I initiated streams with numbers 15, 17, 19, 21 and sent 21, and then 15, this will be a protocol error. I must send them in ascending order of number. If I take them out from under the lock, they may not be sent in the order in which I wait.

    The second problem that cannot be removed from the lock:



    Here, the header is compressed.

    In the usual form, our heading is attached with the usual map - key-value (from string to string). INencodeHeadersheader compression occurs. And here the second rake of the HTTP 2 standard is the HPACK algorithm, which works with statefull compression. Those. it has a state (therefore it compresses very well). If two requests are sent to me (two headers), while first I compressed one, then the second, then the server must receive them in the same order. If he receives them in a different order, he will not be able to decode. This problem is the serialization point. All encodings of all HTTP requests must go through a single serialization point, they cannot work in parallel, and even after this, encoded frames must be sent.

    The method encodeFrametakes 6% of the time, and theoretically it can be taken out of the lock.



    encodeFramesthrows the frame into the byte buffer in the form in which it is defined by the specification (before that we prepared the internal structure of the frames). It takes 6% of the time.

    Nothing prevents us from removing the locks encodeFrames, except for the method where the actual recording to the socket takes place:



    There are some nuances of implementation.

    It turned out that it encodeFramescan encode a frame not into one, but into several byte buffers. This is primarily due to efficiency (so as not to do too much copying).

    If we try to get out of the lock writeBuffersand writeBuffersmix from two frames, we will not be able to decode the frame. Those. we must provide some kind of atomicity. Thus within writeBuffersperformedsocketWrite, and there is its own global lock on writing to the socket.

    Let's do the first thing that comes to mind is Queue's turn. We will put the byte buffer in this queue in the correct order, and let another thread read from it.

    In this case, the method writeBuffersgenerally "leaves" this thread. There is no need to keep it under this lock (it has its own global lock). The main thing for us is to ensure the order of byte buffers that come there.

    So, we removed one of the most difficult operations outside and launched an additional thread. The size of the critical section is now 60% smaller.

    But the implementation has also disadvantages:

    • for some frames in the HTTP 2 standard, there is a limitation in order. But other specification frames can be sent earlier. The same WindowUpdate I can send before others. And I would like to do this, because the server is standing - it is waiting (it has flow-control window = 0). However, the implementation does not allow this;
    • the second problem is that when our queue is empty, the sending stream falls asleep and wakes up for a long time.

    Let's solve the first problem with the order of frames.

    The obvious idea is . We have an inextricable piece of byte buffers that cannot be mixed with anything; we put it into an array, and the array itself into a queue. Then these arrays can be mixed together, and where we need a fixed order, we provide it:Deque



    • ByteBuffer [] - atomic sequence of buffers;
    • WindowUpdateFrame - we can put it at the beginning of the queue and remove it altogether from under blocking (it has neither protocol encoding nor numbering);
    • DataFrame - also can be taken out of the lock and put at the end of the queue. As a result, the lock is getting smaller and smaller.

    Pros:

    • fewer locks;
    • sending Window Update early allows the server to send data earlier.

    But here there was one more minus. As before, the sending stream often falls asleep and wakes up for a long time.

    Let's do this:



    We will have our own turn a little. In it, we add the received arrays of byte buffers. After that, we will arrange a competition between all the threads that are out of the lock. Whoever won, let him write to the socket. And let the rest work on.

    It should be noted that there flush()was another optimization in the method that gives the effect: if I have a lot of small data (for example, 10 arrays of three to four buffers) and an encrypted SSL connection, it can take more than one array from the queue, but more in large pieces, and send them to SSLEngine. In this case, the cost of coding is sharply reduced.

    Each of the three optimizations presented made it possible to solve the problem with scaling very well. Something like this (the general effect is reflected):



    Morality: Locks are evil!

    Everyone knows that you need to get rid of locks. Moreover, the concurrent library is becoming more and more advanced and interesting.

    Alteration 4. Pool or GC?


    In theory, we have an HTTP Client designed for 100% use of ByteBufferPool. But in practice ... There were bugs, here - something fell, there - the frame wasn’t working ... And if ByteBuffer didn’t return the pool to the pool, the functionality didn’t break ... In general, the engineers had no time to deal with this. And we got an unfinished version, sharpened by pools. We have (and cry):

    • only 20% of buffers return to the pool;
    • ByteBufferPool.getBuffer () takes 12% of the time.

    We get all the disadvantages of working with pools, and at the same time, all the disadvantages of working without pools. There are no pluses in this version. We need to move forward: either make a normal full-fledged pool so that all ByteBuffer return to it after use, or even cut out the pools, but at the same time we even have them in the public API.

    What do people think about pools? Here is what you can hear:

    • The pool is not needed, pools are generally harmful! eg Dr. Cliff Click, Brian Goetz, Sergey Kuksenko, Aleksey Shipil¨ev, ...

    • some claim that the pool is cool and they have an effect. You need a pool! eg Netty (blog.twitter.com/2013/netty-4-at-twitter-reduced-gc-overhead),...

    DirectByteBuffer or HeapByteBuffer


    Before we get back to the question of pools, we need to solve the sub-question - what do we use as part of our task with HTTPClient: DirectByteBuffer or HeapByteBuffer?

    First, we study the question theoretically:

    • DirectByteBuffer is better for I / O.
      sun.nio. * copies HeapByteBuffer to DirectByteBuffer;
    • HeapByteBuffer is better for SSL.
      SSLEngine works directly with byte [] in the case of HeapByteBuffer.

    Indeed, directByteBuffer is better for transferring data to a socket. Because if we get to nio in the Write chain, we will see the code where everything from HeapByteBuffer is copied to internal DirectByteBuffer. And if we got DirectByteBuffer, we don’t copy anything.

    But we have another thing - an SSL connection. The HTTP 2 standard itself allows you to work with both plain connection and SSL connection, but it is declared that SSL should be the de facto standard for the new web. If you follow the chain of how it is implemented in OpenJDK in the same way, it turns out that theoretically SSLEngine works better with HeapByteBuffer, because it can reach the byte [] array and encode there. And in the case of DirectByteBuffer, he must first copy here, and then back.

    And measurements show that HeapByteBuffer is always faster:

    • PlainConnection - HeapByteBuffer “faster” 0% -1% - I quoted it because 0-1% is not faster. But there is no gain from using DirectByteBuffer, but there are more problems;
    • SSLConnection - HeapByteBuffer 2% -3% faster

    Those. HeapByteBuffer - our choice!

    Oddly enough, reading and copying from DirectByteBuffer is more expensive because there are checks left. The code there is not very well vectorized because it works through unsafe. And in HeapByteBuffer - intrinsic (not even vectorization). And soon it will work even better.

    Therefore, even if HeapByteBuffer was 2-3% slower than DirectByteBuffer, it might not make sense to deal with DirectByteBuffer. So let's get rid of the problem.

    We will make various options.

    Option 1: All to the pool


    • We are writing a normal pool. We clearly track the life paths of all buffers so that they return back to the pool.
    • We optimize the pool itself (based on ConcurrentLinkedQueue).
    • We separate pools (according to the size of the buffer). The question is what size should be the buffer. I read that Jetty made a universal ByteBufferPool, which allows you to work with byte buffers of different sizes with a granularity of 1 KB. We just need three different ByteBufferPools, each working with its own size. And if the pool works with buffers of only one size, everything becomes much simpler:
    • SSL packets (SSLSession.getPacketBufferSize ());
    • header encoding (MAX_FRAME_SIZE);
    • all the rest.

    Pluses of option 1:

    • less "allocation pressure"

    Minuses:

    • реально сложный код. Почему инженеры не доделали это решение в первый раз? Потому что оценить, как ByteBuffer пробирается туда-сюда, когда его можно безопасно вернуть в пул, чтобы ничего не испортилось, та еще проблема. Я видел потуги некоторых людей, пытавшихся к этим буферам прикрутить референс-каунтинг. Я настучал им по голове. Это делало код еще сложнее, но проблемы не решало;
    • реально плохая «локальность данных»;
    • затраты на пул (а еще мешает скалируемости);
    • частое копирование данных, в том числе:
    • практическая невозможность использования ByteBuffer.slice() и ByteBuffer.wrap(). Если у нас есть ByteBuffer, из которого надо вырезать какую-то серединку, мы можем либо скопировать его, либо сделать slice(). slice() не копирует данные. Мы вырезаем кусок, но переиспользуем тот же массив данных. Мы сокращаем копирование, но с пулами полная каша. Теоретически это можно довести до ума, но здесь уже точно без референс-каунтинга не обойтись. Допустим, я прочитал из сети кусок 128 КБ, там лежит пять дата-фреймов, каждый по 128 Байт, и мне из них надо вырезать данные и отдать их пользователю. И неизвестно, когда пользователь их вернет. А ведь все это — единый байт-буфер. Надо чтобы все пять кусков померли и тогда байт-буфер вернется. Никто из участников не взялся это реализовать, поэтому мы честно копировали данные. Думаю, затраты на борьбу с копированием не стоят возрастающей сложности кода.

    Вариант 2: Нет пулам — есть же GC


    GC will do all the work, especially since we have not DirectByteBuffer, but HeapByteBuffer.

    • we remove all pools, including from the Public API, because in reality they do not carry any functionality in themselves, except for some kind of internal technical implementation.
    • Well, of course, since GC now collects everything from us, we don’t need to copy the data - we actively use ByteBuffer.slice()/ wrap()- we cut and wrap the buffers.

    Pros:

    • the code really has become easier to understand;
    • no pools in the "public API";
    • we have a good “data locality”;
    • Significant reduction in copying costs, this all works;
    • no pool costs.

    But two problems:

    • firstly, data allocation - higher than “allocation pressure”
    • and the second problem is that we often don’t know which buffer we need. We read from the network, from I / O, from the socket, we allocate a buffer of 32 KB, well, even if it is 16 KB. And from the network read 12 Bytes. And what do we do next with this buffer? Just throw it away. We get inefficient memory usage (when the required buffer size is unknown) - for the sake of 12 Bytes, 16 KB were allocated.

    Option 3: Mix


    For the sake of experiment, we make a mixed version. About him I will tell you a little more. Here we choose the approach depending on the data.

    Outgoing data:

    • user data. We know their size, with the exception of encoding in the HPACK algorithm, so we always allocate buffers of the right size - we do not have inefficient memory wasting. We can do all kinds of cuts and wraps without unnecessary copying - and let the GC put it together.
    • for compression of HTTP headers - a separate pool, where the byte buffer is taken from and then returned there.
    • everything else - buffers of the required size (GC will collect)

    Incoming data:

    • reading from a socket - a buffer from a pool of some normal size - 16 or 32 KB;
    • came data (DataFrame) - slice()(GC will collect);
    • everything else - return to the pool.

    In general, there are nine types of frames in the HTTP 2 standard. If eight of them came (all but the data), then we decode the byte buffer there and we don’t need to copy anything from it, and we return the byte buffer back to the pool. And if the data came, we execute slice so that there is no need to copy anything, and then just drop it - it will be collected by GC.

    Well, a separate pool for encrypted SSL connection buffers, because it has its own size.

    Pluses of the mixed option:

    • the average complexity of the code (in some ways, but basically it is simpler than the first option with pools, because less need to be tracked);
    • no pools in the "public API";
    • good “data locality”;
    • no copying costs;
    • reasonable pool costs;
    • acceptable memory usage.

    There is one minus: above the "allocation pressure".

    Compare options


    We made three options, checked, fixed bugs, achieved functional work. We measure. We look at data allocation. I had 32 measurement scenarios, but I did not want to draw 32 graphs here. I will show just the range averaged over all measurements. Here baseline is the initial unfinished code (I took it for 100%). We measured the change in allocation rate relative to baseline in each of the three modifications.



    The option where everything goes to the pool predictably allocates less. A variant that does not require any pools allocates eight times more memory than a variant without pools. But do we really need memory for the allocation rate? We measure the GC-pause:



    With such GC-pauses, this does not affect the allocation rate.

    It can be seen that the first option (to the maximum pool) gives 25% acceleration. The absence of pools to the maximum gives 27% acceleration, and the mixed version gives the maximum 36% acceleration. Any correctly completed option already gives an increase in productivity.

    In a number of scenarios, the mixed option gives about 10% more than the option with pools or the option without pools at all, so it was decided to dwell on it.

    Moral: here I had to try various options, but there was no real need to totally work with pools by dragging them into the public API.

    • Do not focus on "urban legends"
    • Know the opinions of authorities
    • But often "the truth is somewhere nearby"

    Subtotals


    The above describes four alterations that I wanted to talk about in terms of working with blocking calls. Further I will talk a little about something else, but first I want to make an intermediate cut.

    Here is a comparison of HttpClient and JettyClient on different types of connections and data volumes. Columns are sections; the higher the faster.



    For GET requests, we are pretty well ahead of Jetty. I tick off. We have an acceptable performance with reasonable costs. In principle, you can still squeeze there, but you need to stop sometime, otherwise in Java 9 or Java 10 you will not see this HttpClient.

    With POST requests, everything is not so optimistic. When sending big data in a PLAIN connection, Jetty still wins a little. But when sending small data and with an SSL-encoded connection, we also have no problems.



    Why don't we scale data with a large post size? Here we come up against two serialization problems: in the case of an SSL connection, this is a lock on the socket — it is global for writing to this particular SocketChannel. We cannot write to a socket in parallel. Although we are part of the JDK, the nio library for us is an external module where we cannot change anything. Therefore, when we write a lot, we run into this bottleneck.

    With SSL (with encryption) the same situation. SSLEngine has encryption / decryption. They can work in parallel. But encryption is required to work sequentially, even if I send data from many threads. This is a feature of the SSL protocol. And this is another serialization point. Nothing can be done with this, unless you switch to some native OpenSSL standards.

    Alteration 5. Asynchronous API


    Let's look at asynchronous requests.

    Can we make such a completely simple version of the asynchronous API?

    public  HttpResponse
        send(HttpRequest req, HttpResponse.BodyHandler responseHandler) {
        ...
    }
    public  CompletableFuture>
        sendAsync(HttpRequest req, HttpResponse.BodyHandler responseHandler) {
        return CompletableFuture.supplyAsync(() -> send(req, responseHandler), executor);
    }
    

    I gave my executor - here it is written out (executor is configured in the client; I have some default executor, but you, as a user of this client, can give any executor there).

    Alas, you cannot just take and write an asynchronous API:



    The problem is that we often wait for something in blocking requests.

    Here is a very exaggerated picture. In reality, there is a query tree - wait here, wait there ... they are placed in different places.



    Step 1 - Switch to CompletableFuture


    When we wait, we sit on wait or on condition. If we are waiting in a blocking API, and at the same time we have thrown an executor into Async, then we have selected the thread from executor.

    On the one hand, it is simply inefficient. On the other hand, we wrote an API that allows us to give our API any external executor. This, by definition, should work with a fixed thread pool (if the user can give any executor there, then we should be able to work in at least one thread).

    In reality, this was a standard situation when all the threads from my executor were blocked. They are waiting for a response from the server, and the server is waiting and does not send anything until I also send something to it. I need to send something from the client, but I have no threads in the executor. All. Arrived.

    It is necessary to cut the entire query chain so that each waiting point is wrapped in a separate CompletableFuture. Something like this:



    We have a user stream on the left. There we build a query chain. Here the thenCompose method, in which one future came, comes the second future. On the other hand, we have a thread thread of SelectorManager. It was in a serial version, it just did not have to be optimized. It reads from the socket, decodes the frame and makes complete.

    When we come to thenCompose and see that our future, which we are waiting for, is not finished yet, we are not blocking (this is asynchronous processing of CompletetableFuture), but we are leaving. The thread will return to executor, continue to work something else that is required for this executor, and then we will joke this execution further. This is a key feature of CompletableFuture, which allows you to effectively write such things. We do not steal thread from work. We always have someone to work with. And it’s more effective in terms of performance.

    We cut out all the locks on condition or wait, go to CompletableFuture. When CompletableFuture is completed, then the thread is put to execution. We get + 40% to processing asynchronous requests.

    Step 2 - Delayed Startup


    We have a very popular puzzle game. I do not really like puzzle games, but I want to ask. Suppose we have two threads and there is a CompletableFuture. In one thread, we attach a chain of actions - thenSomething. By this "Something" I mean Compose, Combine, Apply - any operations with CompletableFuture. And from the second thread, we make the completion of this CompletableFuture.

    The foo method - our action that should work - in which thread will be executed?



    The correct answer is C.

    If we finish building the chain - i.e. call the thenSomething method - and CompletableFuture was already completed by this moment, then the foo method will be called in the first thread. And if CompletableFuture has not been completed yet, it will be called from complete in a chain, i.e. from the second stream. We will deal with this key feature as much as two times.

    So, we are building a chain of queries in user code. We want the user to send me sendAsync. Those. I want in the user thread where we do sendAsync to build a query chain and give the final CompletableFuture to the user. And there, my threads will go to work in my executor, sending data, waiting.

    I twist and saw Java code on a localhost. And what turns out: sometimes I do not have time to complete the request chain, and CompletableFuture is already completed:



    On this machine I have only four hardwired threads (and there may be several dozen of them), and even here he does not have time to finish building. I measured - this happens in 3% of cases. Attempting to complete the query chain further leads to the fact that some of the actions on this chain, such as sending and receiving data, are called in the user process, although I do not want this. Initially, I want this whole chain to be hidden, i.e. the user should not see it. I want her to work in executor.

    Of course, we have methods that make Compose Async. If instead of thenCompose I call a method thenComposeAsync(), I certainly will not translate my actions into the user stream.

    Advantages of implementation:

    • nothing falls into the user stream;

    Minuses:

    • switching from one thread from executor to another thread from executor is too frequent (expensive). Nothing gets into the user code, but methods thenComposeAsync, thenApplyAsyncand generally any methods with Async ending, switch the execution of CompletableFuture to another thread from the same executor, even if we came from the thread of our executor (in Async), if it's fork-join by default or if it is an explicitly specified executor. However, if CompletableFuture is already completed, what is the point of switching from this thread? This switch from one thread to another is a waste of resources.

    Here's a trick that was used:

    CompletableFuture start = new CompletableFuture<>(); 
    start.thenCompose(v -> sendHeader()) 
        .thenCompose(() -> sendBody()) 
        .thenCompose(() -> getResponseHeader()) 
        .thenCompose(() -> getResponseBody()) 
        ...; 
    start.completeAsync( () -> null, executor); // !!! trigger execution
    

    We first take an empty incomplete CompletableFuture, build the whole chain of actions that we need to perform for it, and start the execution. After that, we will complete the CompletableFuture - we will completeAsync- with the transition directly to our executor. This will give us another 10% performance for asynchronous requests.

    Step 3 - tricks with complete ()


    There is another problem associated with CompletableFuture:



    We have a CompletableFuture and the SelectorManager dedicated thread completes this CompletableFuture. We cannot write here future.complete. The problem is that the SelectorManager stream is internal, it processes all reads from the socket. And we give it to the user CompletableFuture. And he can attach a chain of his actions to it. If we start the execution of user actions using response.complete on the SelectorManager, then the user can kill us with our dedicated SelectorManager stream, which should work correctly, and should not run too much there. We must somehow translate the execution - take it from that thread and put it into our executor, where we have a bunch of threads.

    This is just dangerous.



    We have one completeAsync.



    But having done completeAsync, we get the same problem.

    We very often have to switch execution from a thread to an executor to another thread from the same executor in the chain. But we do not want to switch from SelectorManager to executor or from any user thread to executor. And inside the executor, we don’t want our tasks to migrate. Performance suffers from this.

    We can not do CompleteAsync. On the other hand, we can always make the transition to Async.



    But here is the same problem. In both cases, we secured our work, nothing will start in our thread, but this migration is a little expensive.

    Pros:

    • nothing gets into the SelectorManager stream

    Minuses:

    • frequent switching from one thread from executor to another thread from executor (expensive)

    Here's another trick: let's check, maybe our CompletableFuture is already completed? If CompletableFuture is not completed yet - we leave in Async. And if it is completed, it means that I know for sure that building a chain to the already completed CompletableFuture will be executed in my thread, and I already do the executor in this thread.



    This is pure optimization, which removes unnecessary things.

    And this gives another 16% performance for asynchronous requests.

    As a result, all three of these optimizations by CompletableFuture overclocked asynchronous requests by about 80%.

    Moral: Learn new.
    CompletableFuture ( since 1.8)

    Alteration 6


    The last fix was never made in the HTTP Client code itself simply because it is associated with the Public API. But the problem can be circumvented. I’ll tell you about this.



    So, we have a client builder, it can be given executor. If when creating the HTTP Client we did not give it an executor, it says here that it is used by default CachedThreadPool().

    Let's see what it is CachedThreadPool(). I specifically emphasized that it is interesting:



    In CachedThreadPool()there is one plus and one minus. By and large, this is one and the same plus and minus. The problem is that when CachedThreadPool()threads are over, he creates new ones. On the one hand, this is good - our task does not sit in line, does not wait, it can be immediately fulfilled. On the other hand, this is bad because a new thread is being created.

    Before I made corrections from the previous paragraph (fifth alteration), I measured, and it turned out that I CachedThreadPool()created 20 threads per request - there was too much waiting. 100 simultaneous threads threw out of memory exception. This did not work - even on the servers that are available in our lab.

    I cut out all the expectations, locks, made the "Fifth Alteration." My threads are no longer blocked, not spent, but they work. Anyway, on CachedThreadPool()average, 12 threads are created per request . For 100 simultaneous requests, 800 threads were created. It creaked, but it worked.

    In fact, CachedThreadPool()executor cannot be used for such things . If you have very small tasks, there are a lot of them,CachedThreadPool()executor will do. But in the general case, no. It will create many threads for you, then you will rake them.

    In this case, you need to fix the ThreadPool executor. One must measure the options. But I’ll just show the performance results for one that turned out to be the best candidate for fixing CachedThreadPool()with two threads:



    Two threads are the best option, because writing to the socket is a bottleneck that cannot be parallelized, and SSLEngine cannot work in parallel either. The numbers speak for themselves.

    Moral: Not all ThreadPools are equally useful.

    With the rework of the HTTP 2 Client, I have everything.

    To be honest, reading the documentation, I swore a lot on the Java API. Especially in terms of byte buffers, sockets, and more. But my rules of the game were such that I should not have changed them. For me, the JDK is the external library on which this API is built.

    But comrade Norman Maurer was not so constrained by the framework as I was. And he wrote an interesting presentation - for those who want to dig deeper: Writing Highly Performant Network Frameworks on the JVM - A Love-Hate Relationship .

    He scolds the core JDK API just in the area of ​​sockets, APIs and more. And it describes what they wanted to change and what they lacked at the JDK level when they wrote Netty. These are all the same problems that I met, but could not fix within the framework of the given rules of the game.



    If you like to relish all the details of Java development in the same way as we, you will probably be interested in these reports at our April JPoint 2018 conference :


    Also popular now: