Experience in developing a service-oriented system

    Some time ago, together with a small team of programmers, we began the development of an analytical project that is quite interesting from a technical point of view. Its main goal was to process data received from various web pages. It was necessary to process this data, bringing it in a convenient form and then analyze the collected statistics.

    Until we had a large amount of all kinds of data, we did not have any non-standard problems and all solutions were fairly straightforward. But the project was expanding, and the size of the information collected, although at first not very fast, was still increasing. The code base also grew. And after some time we realized a very sad fact - because of all sorts of crutches and quick fixes, we violated almost all possible design principles. And if at first the organization of the code was not so important, then over time it became clear that without good refactoring we would not go far.


    After discussion and reflection, it was decided that for our purposes, the architecture of the Internet “parsilka” should be primarily Service Oriented (SOA). Further, we started from this approach and identified three main parts of the future system, which are responsible for the following tasks:
    1. Retrieving page content, data from various services through the API, data from structured files
    2. Structuring the information received
    3. Analysis of statistics and the creation of recommendations


    As a result of this separation, three independent services should have turned out: Fetch Service, Parse Service, and Analyze Service

    * Hereinafter I will use some English-language names for greater ease of perception and brevity.

    Then the question arose about how these services will communicate with each other. Defining the general mechanism, it was decided to use the concept of pipeline processing (pipeline). A fairly understandable and simple approach, when you need to process any information sequentially, passing it from one node to another. As the communication bus, a queuing mechanism based on RabbitMQ was chosen.

    image

    So, we decided on the main architectural model. It turned out to be quite simple, understandable and quite expandable. Next, I will describe what each service consists of and what allows them to be scaled.

    Service Components and Technology

    Let's talk a little about the technologies that are used inside each individual service. In this article, I will mainly describe how the Fetch Service works. However, other services have a similar architecture. Below I will describe the general points, or rather the main components. There are four of them.

    The first is the proccessing module, which contains all the basic logic for working with data. It is a set of workers who perform tasks. And the customers who create these tasks. Here Gearman is used as a task server and, accordingly, its API. Workers and clients themselves are separate processes controlled by Supervisord.

    The next component is a result repository. Which is a database in MongoDB. Mostly data is retrieved from web pages, or through various APIs that return JSON. And MongoDB is convenient enough to store this kind of information. In addition, the structure of the results may change, new metrics, etc. may appear. And in this case, we can easily make changes to the structure of documents.

    And finally, the third component of the system is queues. There are two types of queues. The former are engaged in the fact that they serve to transfer requests to services from other services or from external clients (not to be confused with Gearman-clients). These queues are referred to as Request Queues. In the case of the Fetch Service mentioned earlier, this type of queue receives a JSON string. It contains the URL of the desired page or parameters for requesting a third-party API.

    The second type of queue is the Notifications Queues. In a queue of this type, services place information about requests that have been processed and the result can be obtained from the repository. Thus, asynchronous execution of requests for receiving, processing and analyzing data is implemented.

    RabbitMQ was chosen as the message broker. This is a good solution, it works fine, albeit with some troubles. However, for such a system it is too feature-rich, so it might be better to replace it with something simpler.

    image

    Communication

    So, communication is provided through the queues and this is an obvious and convenient way to connect various services. Next, I will describe the communication process in more detail.

    There are two types of communication. Inside the system, between services. And between the end customer and the entire system as a whole.
    For example, Parse Service requires new data. He sends the request to the queue for the Fetch Service and then continues to do his own thing - the requests are executed asynchronously.
    After the Fetch Service receives the request from the queue, it will take the necessary steps to extract the data from the desired source (web page, file, API) and place it in the repository (MongoDB). And then it will send a notification of the completion of the operation, which in turn will receive the Parse Service to process the latest data.

    image

    Fetch service

    And finally, I’ll tell you a little more about the service responsible for obtaining source data from external sources.
    This basic part of the system is the first stage in the data processing pipeline. The following tasks fall on him:
    1. Retrieving data from an external source
    2. Handling exceptions and errors at this stage (e.g. handling HTTP responses)
    3. Providing basic information about the received data (headers, statistics on file changes, etc.)


    Extracting the source data itself is an important part of most systems where data is structured. And the service-oriented approach in this case is very convenient.
    We just say: “Give me this data” and get what we want. In addition, this approach allows you to create various workers to receive information from specific sources without making customers think about where the material for processing will come from. You can use different APIs, with different formats and protocols. The whole logic of obtaining the target data is isolated at this level.

    In turn, others can be built on top of this service that implement more specific logic, such as crawling sites, parsing, aggregation, and so on. But each time there is no need to take care of network interactions and handling many situations.

    On this I will probably end. Of course, there are many more aspects of developing such systems. But the main thing to remember is always first of all you need to think about architecture and use the principle of sole responsibility. Isolate system components and connect them in a simple and understandable way. And you get a result that is easy to scale, easy to control and very easy to use in the future.

    Also popular now: