Ignite Service Grid - Reboot

    On February 26, we held the Apache Ignite GreenSource meeting, where the open source contributors of the Apache Ignite project performed . An important event in the life of this community was the restructuring of the Ignite Service Grid component , which allows you to deploy custom microservices directly in the Ignite cluster. Vyacheslav Daradur , senior Yandex developer and Apache Ignite contributor for more than two years, spoke about this difficult process at the meeting .

    To begin with, what is Apache Ignite in general. This is a database that is a distributed Key / Value repository with support for SQL, transactional and caching. In addition, Ignite allows you to deploy user services directly in the Ignite cluster. The developer becomes available all the tools that Ignite provides - distributed data structures, Messaging, Streaming, Compute and Data Grid. For example, when using the Data Grid, the problem of administering a separate infrastructure for the data warehouse and, as a result, the overhead resulting from this disappears.

    Using the Service Grid API, you can deploy a service by simply specifying the deployment scheme in the configuration and, accordingly, the service itself.

    Typically, a deployment pattern is an indication of the number of instances that should be deployed to cluster nodes. There are two typical deployment patterns. The first is Cluster Singleton: at any time in the cluster, one instance of the user service will be guaranteed to be available. The second is Node Singleton: one instance of the service is deployed on each node of the cluster.

    The user can also specify the number of service instances in the entire cluster and define a predicate for filtering suitable nodes. In this scenario, the Service Grid will itself calculate the optimal distribution for the deployment of services.

    In addition, there is such a feature as Affinity Service. Affinity is a function that defines the relationship of keys with partitions and the relationship of parties with nodes in the topology. Using the key, you can determine the primary node on which data is stored. Thus, you can associate your own service with the key and cache of the affinity function. If the affinity function changes, an automatic re-operation will occur. So the service will always be placed next to the data that it should manipulate, and, accordingly, reduce the overhead of access to information. Such a scheme can be called a kind of collocated computing.

    Now that we have figured out what the beauty of Service Grid is, we’ll tell you about its development history.

    What was before

    The previous implementation of Service Grid was based on the Ignite transactional replicated system cache. The word "cache" in Ignite means storage. That is, this is not something temporary, as you might think. Despite the fact that the cache is replicable and each node contains the entire data set, inside the cache it has a partitioned view. This is due to storage optimization.

    What happened when a user wanted to deploy a service?

    • All nodes in the cluster subscribed to update data in the repository using the built-in Continuous Query mechanism.
    • An initiating node under a read-committed transaction made a record in the database that contained the configuration of the service, including the serialized instance.
    • Upon receiving notification of a new record, the coordinator calculated the distribution based on the configuration. The resulting object is written back to the database.
    • The nodes read information about the new distribution and deployed services
      if necessary.

    What did not suit us

    At some point, we came to the conclusion: it is impossible to work with services. There were several reasons.

    If some kind of error occurred during the deployment, then you could find out about it only from the logs of the node where everything happened. There was only an asynchronous deploy, so after returning control from the deployment method to the user, it took some extra time to start the service - and at that time the user could not control anything. To develop Service Grid further, to saw new features, to attract new users and make life easier for everyone, you need to change something.

    When designing a new Service Grid, we first of all wanted to provide a synchronous deployment guarantee: as soon as the user returned control from the API, he could immediately use the services. I also wanted to give the initiator the opportunity to handle deployment errors.

    In addition, I wanted to facilitate the implementation, namely to get away from transactions and rebalancing. Despite the fact that the cache is replicable and there is no balancing, there were problems during a large deployment with many nodes. When changing the topology, the nodes need to exchange information, and with a large deployment, this data can weigh a lot.

    When the topology was unstable, the coordinator needed to recalculate the distribution of services. And in general, when you have to work with transactions on an unstable topology, this can lead to difficult-to-predict errors.


    What global changes without accompanying problems? The first of these was a change in topology. You need to understand that at any time, even at the time of service deployment, a node can enter or exit a cluster. Moreover, if at the time of deployment the node enters the cluster, it will be necessary to consistently transfer all the information about the services to the new node. And we are talking not only about what has already been deployed, but also about current and future deployments.

    This is just one of the problems that can be put together in a separate list:

    • How to deploy statically configured services when starting a node?
    • Node exit from the cluster - what if the host host services?
    • What to do if the coordinator has changed?
    • What to do if the client reconnected to the cluster?
    • Do I need to process activation / deactivation requests and how?
    • But what if they called the Destroy cache, and we have affinity services tied to it?

    And that’s not all.


    As the target, we chose the Event Driven approach with the implementation of communication processes using messages. Ignite has already implemented two components that allow nodes to forward messages between themselves - communication-spi and discovery-spi.

    Communication-spi allows nodes to communicate and forward messages directly. It is well suited for sending large amounts of data. Discovery-spi allows you to send a message to all nodes in the cluster. In a standard implementation, this is done according to the ring topology. There is also integration with Zookeeper, in this case the star topology is used. Another important point worth noting: discovery-spi guarantees that the message will be delivered in the correct order to all nodes.

    Consider the deployment protocol. All user requests for deployment and distribution are sent via discovery-spi. This gives the following guarantees :

    • The request will be received by all nodes in the cluster. This will allow you to continue processing the request when changing the coordinator. It also means that in one message each node will have all the necessary metadata, such as the configuration of the service and its serialized instance.
    • A strict message delivery order allows you to resolve configuration conflicts and competing requests.
    • Since the node’s input to the topology is also processed by discovery-spi, all the data necessary for working with services will get to the new node.

    Upon receipt of the request, the nodes in the cluster validate it and form tasks for processing. These tasks are queued and then processed in another thread by a separate worker. This is implemented in this way, because a deployment can take a considerable time and delay an expensive discovery stream is unacceptable.

    All requests from the queue are processed by the deployment manager. He has a special worker who pulls a task out of this queue and initializes it to begin deployment. After this, the following actions occur:

    1. Each node independently calculates the distribution thanks to a new deterministic assignment function.
    2. The nodes form a message with the results of the deployment and send it to the coordinator.
    3. The coordinator aggregates all messages and generates the result of the entire deployment process, which is sent via discovery-spi to all nodes in the cluster.
    4. Upon receipt of the result, the deployment process is completed, after which the task is removed from the queue.

    New event-driven design: org.apache.ignite.internal.processors.service.IgniteServiceProcessor.java

    If an error occurs at the time of deployment, the node immediately includes this error in the message that sends it to the coordinator. After the message aggregation, the coordinator will have information about all errors during the deployment and will send this message via discovery-spi. Error information will be available on any node in the cluster.

    According to this algorithm, all important events in the Service Grid are processed. For example, a topology change is also a discovery-spi message. And in general, when compared with what it was, the protocol turned out to be quite lightweight and reliable. So much as to handle any situation during the deployment.

    What will happen next

    Now about the plans. Any major development in the Ignite project is carried out as an initiative to improve Ignite, the so-called IEP. The Service Grid redesign also has an IEP - IEP No. 17 with the banter name “Oil Change in Service Grid”. But in fact, we changed not the oil in the engine, but the entire engine.

    We divided tasks in IEP into 2 phases. The first is a major phase, which consists in altering the deployment protocol. It is already poured into the wizard, you can try the new Service Grid, which will appear in version 2.8. The second phase includes many other tasks:

    • Hot redeep
    • Service Versioning
    • Increased Resiliency
    • Thin client
    • Tools for monitoring and counting various metrics

    Finally, we can advise you Service Grid for building fault-tolerant high availability systems. We also invite you to dev-list and user-list to share your experience. Your experience is really important for the community, it will help you understand where to go next, how to develop the component in the future.

    Also popular now: