LAppS: Half a million 1KB-WebSocket messages per second with TLS on a single CPU

For those who do not know: LAppS - Lua Application Server , it is almost like nginx or apache, but only for the WebSocket protocol, instead of HTTP.

HTTP is supported only at the Upgrade request level.

LAppS was initially sharpened on high load and vertical scalability, and today it reached the peak of its capabilities on my hardware (well, almost, you can further optimize, but it will be a long and hard work).

Most importantly, LAppS in WebSocket performance on the stack exceeded the uWebSockets library, which is positioned as the fastest WebSocket implementation.

Interested please under the cat.

A couple of months have already passed from my last article about LAppS , and that article did not cause any interest. I hope this article will seem more interesting to habrovchanam. LAppS during this time has done a rather difficult path to version 0.7.0, has acquired functionality and has grown in terms of performance (which was promised earlier).

One of the features that has appeared: the loadable module with the implementation of the client part of the WebSocket protocol, is cws.

Thanks to this module, I was finally able to squeeze everything from my home computer, and load LAppS for real.

Previously, testing was performed using the websocketpp library client echo (more details can be found on the github project page), which is not only slow, but also difficult to parallelize. The tests were performed simply: a bunch of clients were started, the results from each client were collected using awk and simple arithmetic showed performance results. The results were as follows:

ServerNumber of customersRPS serverRPS per customerpayload (bytes)
LAppS 0.7.024084997354.154128
uWebSockets (latest)24074172.7309.053128
LAppS 0.7.024083627.4348.447512
uWebSockets (latest)24071024.4295.935512
LAppS 0.7.024079270.1330.2921024
uWebSockets (latest)24066499.8277.0831024
LAppS 0.7.024051621215.0878192
uWebSockets (latest)24045341.6188.9248192

In this test, as in the subsequent ones, the number of packages in sm business is twice as high, because measurement is performed on on_message and in the client's on_message method, a new package of the same size is sent. Those. client request and server response are the same size, and if you count the amount of traffic processed by the server, then you need to double the result of RPS multiplied by payload and neglecting the headers you can get an approximate amount of traffic in bytes.

Obviously, with 240 client processes running simultaneously, LAppS itself (like uWebSockets) doesn’t have that many CPU resources.

I looked at several client implementations for WebSocket under Lua, and unfortunately I didn’t find a simple and sufficiently productive module with which I could load LAppS properly. Therefore, as usual made my bike.

The module has a fairly simple interface and imitates the behavior of the browser WebSocket API.

A simple example of how to work with this module (service for receiving transactions with BitMEX):

Hidden text
- подключаемся к BitMEXlocal websocket,errmsg=cws:new(
    ["onopen"]=function(handler)-- после установления WebSocket соединения отправляем запрос local result, errstr=cws:send(handler,[[{"op": "subscribe", "args": ["orderBookL2:XBTUSD"]}]],1);
      -- Тип отправляемого сообщения 1 (OpCode 1 - ТЕХТ)if(not result) -- если отравка сообщения была неудачной, - обрабатываемthenprint("Error on websocket send at handler "..handler..": "..errstr);
    ["onmessage"]=function(handler,message,opcode)print(message) -- выводим на экран сообщения BitMEX по запрошенному топику.end,
    ["onerror"]=function(handler, message)-- обрабатываем ошибки соединенияprint(message..". Socket FD:  "..handler);
    ["onclose"]=function(handler)-- реагируем на закрытие сокетаprint("WebSocket "..handler.." is closed by peer.");
  if(websocket == nil) -- если не удалось подключитьсяthenprint(errmsg)
  elsewhilenot must_stop()
      cws:eventLoop(); -- poll событийendendendreturn bitmex;

Immediately I warn you, the module appeared only today and it is poorly tested.

For testing, I wrote a simple service for LAppS and called it the same simple benchmark .

At the start, this service creates 100 connections to the echo WebSocket server (no matter which one), and upon a successful connection sends a 1kb message. When receiving a message from the server, it sends it back.

My home computer: Intel® Core (TM) i7-7700 CPU @ 3.60GHz, microcode 0x5e
Memory: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2400 MHz (0.4 ns), Kingston KHX2400C15 / 16G

All testing was conducted on this localhost.

Echo service configuration in LAppS:

"echo": {
      "auto_start": true,
      "instances": 2,
      "internal": false,
      "max_inbound_message_size": 16777216,
      "preload": null,
      "protocol": "raw",
      "request_target": "/echo"

The instances parameter requires LAppS to start two parallel echo services.

Benchmark service (client) configuration:

"benchmark" : {
    "auto_start" : true,
    "instances": 4,
    "internal": true,
    "preload" : [ "cws", "time" ]

Te creates 4 instances of the benchmark service at startup.

Result with TLS enabled

ServerNumber of customersRPS serverRPS per customerpayload (bytes)
LAppS 0.7.0-Upstream400257828644.571024
nginx & lua-resty-websocket 4 workers4003378884.471024

Testing uWebSockets has not yet succeeded - TLS handshake swears at SSLv3 (my client uses TLSv1.2 and the libreSSL SSLv3 I use is cut out).

Result without TLS

ServerNumber of customersRPS serverRPS per customerpayload (bytes)
LAppS 0.7.0-upstream4004397001099.251024

Why in the header of "half a million" messages, and in test 257828? Because there are twice as many messages (as explained above).

uWebsockets, shows unenviable results in this test, only because it works on the 1st core, the multi-threaded version of uWebSockets from the project repository does not actually work, and when TLS is enabled, it has a data-race in the OpenSSL stack.

If we imagine that uWebSockets works great on 2 cores (like 2 LAppS echo services), then it can be conditionally set off as 495098 RPS (just double the result from the table).

But keep in mind that the echo server ( uWebSockets ) does not do anything with the received data, but immediately sends it back. LAppS transfers the data to the Lua stack corresponding to the service.

What else is new in LAppS

  • PAM authentication module: pam_auth
  • Message Queuing module: mqr - for messaging between services within one LAppS server (for multi-server exchange, you need to use something already existing, for example: RabbitMQ, mosquitto, etc)
  • Network connection ACL

All this can be found on the project wiki page.

Well, for a snack, for connoisseurs, what exactly is LAppS doing during this test.

Without TLS

Hidden text
Очевидный лидер iptables.
     4.98%  lapps[ip_tables][k]ipt_do_table
Возврат из системных вызовов
     3.80%  lapps[kernel.vmlinux][.]syscall_return_via_sysret
Это передача данных между сервером и Lua сервисами
Парсинг потока данных WebSocket сервером
     1.96%  lappslapps[.]WSStreamProcessing::WSStreamServerParser::parse
Обращения к системным вызовам
     1.88%  lapps[kernel.vmlinux][k]copy_user_enhanced_fast_string
     1.81%  lapps[kernel.vmlinux][k] __fget
     1.61%  lapps[kernel.vmlinux][k]tcp_ack
     1.49%  lapps[kernel.vmlinux][k] _raw_spin_lock_irqsave
     1.48%  lapps[kernel.vmlinux][k]sys_epoll_ctl
     1.45%  lapps[xt_tcpudp][k]tcp_mt
Воркеры LAppS
     1.35%  lappslapps[.]LAppS::IOWorker<false, true>::execute
Клиент бенчмарка
     1.28%  lappslapps[.]cws_eventloop
     1.27%  lapps[nf_conntrack][k] __nf_conntrack_find_get.isra.11
     1.14%  lapps[kernel.vmlinux][k] __inet_lookup_established
Эхо серверы взгляд со стороны C++
     1.01%  lappslapps[.]LAppS::Application<false, true, (abstract::Application::Protocol)0>::execute
     0.98%  lapps[kernel.vmlinux][k]ep_send_events_proc
     0.98%  lapps[kernel.vmlinux][k]tcp_recvmsg
     0.96%[.] __memmove_avx_unaligned_erms
     0.92%  lapps[kernel.vmlinux][k]tcp_transmit_skb
     0.88%  lapps[kernel.vmlinux][k]sock_poll
     0.85%  lapps[nf_conntrack][k]nf_conntrack_in
     0.83%  lapps[nf_conntrack][k]tcp_packet
     0.79%  lapps[kernel.vmlinux][k]do_syscall_64
     0.78%  lapps[kernel.vmlinux][k] ___slab_alloc
     0.78%  lapps[kernel.vmlinux][k] _raw_spin_lock_bh
     0.73%[.] _int_free
     0.69%  lapps[kernel.vmlinux][k] __slab_free
     0.66%  lapps[kernel.vmlinux][k]tcp_write_xmit
     0.65%  lapps[kernel.vmlinux][k]sock_def_readable
     0.65%  lapps[kernel.vmlinux][k]tcp_sendmsg_locked
Собственно отправка сообщений клиентом (сервисом -bemchmark)
     0.64%  lappslapps[.]LAppS::ClientWebSocket::send
     0.64%  lapps[kernel.vmlinux][k]tcp_v4_rcv
     0.63%  lapps[kernel.vmlinux][k] __alloc_skb
     0.61%  lappslapps[.]std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
     0.61%  lapps[kernel.vmlinux][k] _raw_spin_lock
     0.60%[.] __memset_avx2_unaligned_erms
     0.60%  lapps[kernel.vmlinux][k]kmem_cache_alloc_node
     0.59%  lapps[kernel.vmlinux][k] __local_bh_enable_ip
     0.58%  lapps[kernel.vmlinux][k] __dev_queue_xmit
     0.57%  lapps[kernel.vmlinux][k]nf_hook_slow
     0.55%  lapps[kernel.vmlinux][k]ep_poll_callback
     0.55%  lapps[kernel.vmlinux][k]skb_release_data
     0.54%  lapps[kernel.vmlinux][k]native_queued_spin_lock_slowpath
     0.54%[.]cfree@GLIBC_2.2.50.53%  lapps    [kernel.vmlinux]        [k] ip_finish_output2
     0.49%  lapps  [.] lj_BC_RET
     0.49%  lapps            [.] __strlen_avx2
     0.48%  lapps    [kernel.vmlinux]        [k] _raw_spin_unlock_irqrestore

C find 10 differences when working with TLS

Hidden text
    3.73%  lapps[kernel.vmlinux][k]syscall_return_via_sysret
     2.74%  lapps[ip_tables][k]ipt_do_table
     1.41%[.] __pthread_mutex_lock
     1.32%  lapps[kernel.vmlinux][k] __fget
     1.06%[.] __memmove_avx_unaligned_erms
     1.06%  lappslapps[.]WSStreamProcessing::WSStreamServerParser::parse
     1.05%  lapps[kernel.vmlinux][k]tcp_ack
     1.02%  lapps[kernel.vmlinux][k]copy_user_enhanced_fast_string
     1.02%  lapps[nf_conntrack][k] __nf_conntrack_find_get.isra.11
     0.98%  lappslapps[.]cws_eventloop
     0.98%  lapps[kernel.vmlinux][k]native_queued_spin_lock_slowpath
     0.92%  lappslapps[.]LAppS::IOWorker<true, true>::execute
     0.91%  lapps[kernel.vmlinux][k]tcp_recvmsg
     0.89%  lapps[kernel.vmlinux][k]sys_epoll_ctl
     0.84%  lapps[kernel.vmlinux][k]do_syscall_64
     0.82%  lapps[kernel.vmlinux][k] __inet_lookup_established
     0.82%  lapps[kernel.vmlinux][k]tcp_transmit_skb
     0.79%[.] __pthread_mutex_unlock_usercnt
     0.77%  lapps[kernel.vmlinux][k] _raw_spin_lock_irqsave
     0.76%  lapps[xt_tcpudp][k]tcp_mt
     0.70%  lapps[kernel.vmlinux][k] _raw_spin_lock
     0.67%  lapps[kernel.vmlinux][k]ep_send_events_proc
     0.63%  lapps[kernel.vmlinux][k]sock_def_readable
     0.62%  lappslapps[.]LAppS::Application<true, true, (abstract::Application::Protocol)0>::execute
     0.61%  lapps[nf_conntrack][k]nf_conntrack_in
     0.57%  lapps[kernel.vmlinux][k]tcp_write_xmit
     0.55%  lapps[kernel.vmlinux][k] __netif_receive_skb_core
     0.54%  lapps[kernel.vmlinux][k] ___slab_alloc
     0.54%[.] __memset_avx2_unaligned_erms
     0.51%  lapps[kernel.vmlinux][k] _raw_spin_lock_bh
     0.51%  lapps[kernel.vmlinux][k]sock_poll
     0.48%  lapps[nf_conntrack][k]tcp_packet
     0.48%[.]cfree@GLIBC_2.2.50.48%  lapps        [.] SSL_read
     0.46%  lapps    [kernel.vmlinux]        [k] copy_user_generic_unrolled
     0.45%  lapps    [kernel.vmlinux]        [k] tcp_sendmsg_locked
     0.45%  lapps    lapps                   [.] LAppS::ClientWebSocket::send
     0.44%  lapps            [.] _int_free
     0.44%  lapps        [.] ssl3_read_internal
     0.43%  lapps    [kernel.vmlinux]        [k] futex_wake
     0.42%  lapps  [.] lj_tab_get
     0.42%  lapps            [.] vfprintf
     0.41%  lapps    [kernel.vmlinux]        [k] tcp_v4_rcv

Also popular now: