armiol July 13, 2015 at 12:42

Real time data processing in AWS Cloud. Part 2

In the first part of the article, we described one of the problems that we encountered when working on a public service for storing and analyzing the results of biological research. The requirements provided by the customer and several possible implementation options based on existing products were considered.

Today we will focus on the solution that has been implemented.

Proposed Architecture

Front-end

User requests come to the front-end, are validated for compliance with the format, and are sent to the back-end. Each request will eventually return to the front-end with an image or a set of points if the client independently wants to build such an image.

In the future, it is possible to install an LRU cache to store duplicate results with a short lifespan of elements - in proportion to the duration of the user session.

Back end

For each such request, back end

validates the incoming request and verifies its legitimacy in terms of security policy,
determines what data should be read from S3 and forms a general task, the processing result of which should return to the front end,
breaks down tasks into subtasks, taking into account features, data layout on S3 to avoid double readings, etc.,
queues tasks built on the basis of RabbitMQ,
processes the results obtained from the queue, collecting sets of points together,
renders an image if the request implies
returns front end results.

Processing of subtasks occurs by parallel queuing of each subtask in the RPC-style (set the task, waited, got the result). For this, a thread pool is used, which is global for the back-end application. Each thread in this pool is responsible for interacting with a broker: sending a message, waiting, receiving a result.

In addition, the use of a pool of threads of known size allows you to control the number of simultaneously processed messages-subtasks. And the launch of threads for processing well-known subtasks makes it possible to plan exactly what common tasks are currently being performed, predicting the availability of each common task.

For stability, you need to follow three things:

Processing time of one subtask / number of subtasks queued at a point in time - if this parameter is increased, the throughput capacity of the queue must be increased.
Prioritize sub-task processing so that each common task is processed in as little time as possible.
The number of common tasks in processing is to avoid JVM heap overflow on the back end due to the need to keep intermediate results in memory.

Items 2 and 3 are achieved by manipulating the thread pool size and the approach to queuing sub-tasks. When changing the average subtask processing time (point 1), it is required to increase or, accordingly, reduce the number of work nodes for processing subtasks.

Worker Workers

Subscribers to the RabbitMQ queue are standalone applications, which for definiteness are called Workers. Each of them occupies completely one of the EC2 instances, most efficiently using the CPU, RAM and network bandwidth.

Sub-tasks formed at the back end are consumed by some of the workers. The process of processing such a subtask does not imply a global context, because the worker works independently of his own kind.

The important point is that Amazon S3 provides random access to any data . This means that instead of downloading the file size of 500 MB, used on most of which is not necessary for the processing of this request, we mozhetm read only what you really need. That is, by dividing the general task into the posdachki in the right way, you can always achieve the absence of double readings of the same data.

In the case of a runtime error (out of memory, failure, etc.), the task simply goes back to the queue, where it is automatically distributed to another node. For system stability, each worker periodically restarts on cron to avoid possible problems with memory leaks and JVM heap overflow.

Scaling

There may be several reasons leading to the need to change the number of application nodes:

The increase in the average processing time of subtasks, which ultimately leads to problems in delivering the final result to users within the required time frames.
Lack of proper workload for worker nodes.
Back-end overload on CPU or on consumed memory.

To solve problems 1 and 2, we used the API provided by EC2 and created a separate module-scaler that operates on instances. Each new instance is created on the basis of a pre-configured image of the operating system (Amazon Machine Image, AMI) and is launched through spot-requests, which saves about five times the cost of hosting.

The disadvantage of this approach is that from the moment of creating a spot-request for an instance to its commissioning, it takes about 4-5 minutes. At this point, the peak load may already be passed, and the need to increase the number of nodes may disappear on its own.

To get into such situations less often, we use statistics on the number of requests, the geographical location of users and the time of day. With its help, we increase or decrease the number of work units “in advance”. Almost all users work with our service exclusively during the working day. Therefore, surges at the beginning of the working day in the United States (especially the US West) and in China are clearly noticeable. And if problems with queue congestion still arise, then we manage to smooth them out in 4-5 minutes.

Problem 3 has not yet been resolved and represents the most vulnerable spot for us. The current connectivity of three things: data access control, knowledge about their specifics and location, and post-processing of calculated data (Reduce step), is contrived and needs to be processed into separate layers.

In fairness, I must say that the Reduce process comes down to System.arraycopy (...), and the total amount of data in memory (queries + parts of ready-made subtasks) on one back-end instance has never exceeded 1 GB, which easily fits into JVM heap.

Deployment

Any changes in the existing system go through several stages of testing:

Unit testing. This process is integrated into the build that runs on TeamCity after each commit.
Integration Testing. Once a day (sometimes less often) TeamCity launches several builds that check the interaction of modules. As test data, we use pre-prepared files, the processing result of which is known. As the set of functional features expands, we add specific cases to the test code.
If the changes relate to the user interface, then sometimes human intervention is required at the final stage.

For the described subsystem, the changes, basically, concern the performance and support of new types of source data. Therefore, unit and integration testing is usually enough.

After each successful build from the “production” branch, TeamCity publishes artifacts that are ready-to-use JARs and scripts that control the set of parameters for launching the application. When starting a new instance from a pre-prepared AMI (or reloading an existing one), the starting script downloads the last production build from TeamCity and launches the application using the script supplied with the build.

Thus, all that needs to be done to deploy the new version in production is to wait for the end of the tests and click on the “magic” button to restart the instances. By controlling the set of running instances and dividing the task flow into different RabbitMQ queues, you can conduct A / B testing for user groups.

Mistress note

Know how your data works. Provide random access to any part in the minimum amount of time. [Keywords]: Amazon S3, random access.
Use spot queries to save money. [Keywords]: Amazon EC2, spot requests.
Be sure to build prototypes based on existing solutions. At a minimum - get experience. As a maximum - get an almost complete solution.

And in the end I will tell...

In this review article, we described our approach to solving a fairly typical problem. The project continues to develop, becoming more and more functional every day. We will be happy to share our experience with the audience and answer your questions.

Tags:

Real time data processing in AWS Cloud. Part 2

Proposed Architecture

Mistress note

And in the end I will tell...

Also popular now: