zloidemon April 20, 2017 at 11:34

Experience implementing Tarantool in Calltouch

In the modern world of information technology, everyone - both large and small companies - has a large number of different APIs. And fault tolerance, in spite of many best practices, most often does not guarantee a 100% ability to correctly process client requests, as well as recover from a failure and continue processing requests lost due to a failure. This problem arises even for large players on the Internet, not to mention not very large companies.

I work at Calltouch , and our main goal is to achieve fault tolerance of services and to be able to manage the data and requests that customers made to the API service. We need the ability to quickly restore the service after a failure and process requests for a service that has problems. Start processing from the moment of failure. All this will allow us to approach a state where it is almost impossible to lose customer requests on our side.

By analyzing the solutions offered on the market, we have discovered excellent performance and almost unlimited possibilities for data management and processing - with very small requirements for technical and financial resources.

Background

Calltouch has an API service, which receives requests from customers with data for building reports in the web interface. This data is very important: it is used in marketing, and their loss can lead to unforeseen service work. Like everyone, sometimes after laying out or adding new features, the service has problems, for some time a malfunction may occur. Therefore, we need the ability to very quickly take and process those requests with data that were not delivered to the API service at the time of the failure. Balancing with backup alone is not enough for a number of reasons:

The amount of memory required for the service may require new equipment.
The cost of equipment is now high.
No one is safe from an error with killer request.

A fairly simple task (storing requests and quick access to them) creates high costs for the budget. In this regard, we decided to conduct a study of how now you can save all incoming requests with very quick access to them.

Study

There were several options for how to store incoming data.

First option

Save queries with data using nginx logs and put them in some place. If problems arise, the API service will access the data that is stored somewhere, and after that will do the necessary processing.

Second option

Make duplicate HTTP requests in multiple places. Plus write an additional service that will add data somewhere.

Configuring the web server to save data through the logs for subsequent processing has its drawbacks. This solution is not very cheap, and the speed of access to data will be extremely slow. Additional services for working with log files, aggregation and data storage will be required. Plus, a large amount of financial costs will be required - for the introduction of new services, training of operating personnel, and the probable purchase of new equipment. And most importantly, if there was no such solution before, then you will have to find the time for implementation. For these reasons, we almost immediately abandoned the first option and began to explore the possibilities of implementing the second option.

Implementation

We chose between nginx , goreplay and lwan .

The first fell lwan, as goreplay can immediately everything we need. It remains only to choose nginx with @post_action or goreplay. Goreplay was the standard for this scheme, but we decided to stop and reflect on the requests: where and how to store them better.

Storage could not be especially thought up to a certain point. We needed feedback between already processed and not yet processed data. The API for which we do duplicate requests did not provide IDs in requests from the client side. And such a situation arose: I needed the ability to substitute additional data in the incoming request. This would allow receiving feedback between the processed and unprocessed data, because all the data will get into the database, and not just the unprocessed one. Then somehow you need to deal with all the incoming data.

To get rid of request IDs, we decided to add a header with a UUID on the web server side and proxy such API requests - so that the API service after processing will modify / delete those requests that we duplicate in the database. At this point, we are abandoning goreplay in favor of nginx, since nginx supports many modules, including the ability to write to various databases. This will simplify the data processing scheme and reduce the number of auxiliary services in solving this technical problem. You don’t have to spend time learning additional languages and modifying goreplay to meet the requirements.

The simplest option would be to take a module for nginx, which can write all the contents of incoming requests to some database. I would not really like additional code and programming in the configs. The module for Tarantool turned out to be the most flexible and suitable for us ; it can proxy all the data in Tarantool without any additional actions.

As an example, let's take a simple configuration and a small Lua script for Tarantool, in which all bodies of incoming requests will be logged. The interaction of services is shown in the diagram below.

For this, we need nginx with a set of modules and Tarantool.

Additional modules to nginx:

Example upstream configuration in nginx for working with Tarantool:

upstream tnt {
    server 127.0.0.1:3301 max_fails=1 fail_timeout=1s;
    keepalive 10;
}

Data proxy configuration in Tarantool using post_action:

location @send_to_tnt {
    tnt_method http_handler;
    tnt_http_rest_methods all;
    tnt_pass_http_request on pass_body parse_args pass_headers_out;
    tnt_pass tnt;
}
location / {
    uuid4 $req_uuid;
    proxy_set_header x-request-uuid $req_uuid;
    add_header x-request-uuid $req_uuid always;
    proxy_pass http://127.0.0.1:8080/;
    post_action @send_to_tnt;
}

An example procedure in Tarantool that accepts input from nginx:

box.cfg {
    log_level = 5;
    listen = 3301;
}
log  = require('log')
box.once('grant', function()
    box.schema.user.grant('guest', 'read,write,execute', 'universe')
    box.schema.create_space('example')
end)
function http_handler(req)
    local headers = req.headers
    local body    = req.body
    if not body then
        log.error('no data')
        return false
    end
    if not headers['x-request-uuid'] then
        log.error('header x-request-uuid not found')
        return false
    end
    local s, e = pcall(box.space.example.insert,
        box.space.example, {headers['x-request-uuid'], body})
    if not s then
        log.error('can not insert error:\n%s', e)
    return false
    end
    return true
end

The solution can be considered a small size of Lua code and a rather simple nginx configuration. The part with the API here does not make sense, it must be done in any implementation variant. You can easily extend this master-master replication scheme in Tarantool and do load balancing across multiple nodes using nginx or twemproxy.

Since post_action sends data to Tarantool a few milliseconds later than the request arrives in the API and is processed, there is one caveat in the scheme. If the API is as fast as Calltouch, you will have to make a few delete requests or timeout before the request to Tarantool. We chose several requests instead of timeouts, so that our services work without delay, as fast as before.

Conclusion

In conclusion, we can add the fact that only nginx and the nginx_upstream_module module together with Tarantool allow you to achieve incredible flexibility and simplicity in working with http requests, high data access speed without disrupting the operation of basic services, and significant changes when implemented in an existing infrastructure. Coverage of tasks - from creating complex statistics to the usual storage of queries. Not to mention that you can use it as a regular web service and implement an API based on this module for nginx and Tarantool.

As a development for the future in Calltouch, I note the possibility of creating an interface in which you can almost instantly access various data through some filters. Use real queries on tests instead of synthetic load. Debug applications when problems arise - both to improve quality and to eliminate errors. If you can have such high data availability and flexibility in work, you can increase the number of services and their quality by a small amount of the cost of implementing Tarantool in various products.

Tags:

Experience implementing Tarantool in Calltouch

Background

Study

Implementation

Conclusion

Also popular now: