ii000314 February 26, 2019 at 12:19

WG Contract API: zoo of services

With the increase in the number of components in a software system, the number of people participating in its development usually also grows. As a result, in order to maintain the pace of development and ease of maintenance, approaches to the organization of the API should be the subject of special attention.

If you want to get a closer look at how the Wargaming Platform team copes with the complexity of a system of more than a hundred web services interacting with each other, then welcome to cat.

Hello! My name is Valentine and I'm an engineer on the Platform at Wargaming. For those who don’t know what the platform is and what it does, I’ll leave here a link to the recent publication of one of my colleagues - max_posedon

At the moment, I have been working in the company for more than five years and partially found the period of active growth of World of Tanks. To uncover the issues raised in this article, I need to start with a brief digression into the history of the Wargaming Platform.

A bit of history

The growing popularity of “tanks” turned out to be avalanche-like, and as is usually the case in such cases, the infrastructure around the game began to develop rapidly. As a result, the game quickly overgrew with various web services, and at the time of my joining the team, their score was already going to tens (now, by the way, more than 100 platform components work and benefit the company).

As time passed, new games came out, and understanding the intricacies of integrations between web services was no longer easy. The situation only worsened when teams from other Wargaming offices joined the development of the platform. The development has become distributed, with all the consequences in the form of distance, time zones and language barrier. And there are more services. Finding a person who understands how the platform as a whole works is not so easy. Information often had to be collected in parts from different sources.

The interfaces of various web services could differ greatly in stylistic performance, which made the integration process with the platform even more difficult. And direct inter-component dependencies reduced development flexibility by complicating the decomposition of functionality within the platform. To make matters worse, games - clients of the platform - knew our topology well, since they had to integrate directly with each platform service. This gave them the opportunity, using horizontal connections, to lobby for the implementation of certain improvements directly in the component with which they are integrated. This led to the appearance of duplicate functionality in various components of the platform, as well as the inability to extend the existing functionality to other games. It became obvious that continuing to build a platform around each specific game is a dead end branch of development. We needed technical and organizational changes, as a result of which we could take control of the growing complexity of a rapidly growing system and make all the platform functionality suitable for use by any game.

With this I want to finish the historical excursion and, finally, talk about one of our technical solutions, which helps to keep control of the complexity caused by the ever-growing number of services. In addition, it reduces the cost of developing new functionality and greatly simplifies integration with the platform.

Meet the Contract API

Inside the platform, we call it the Contract API. At its core, it is an integration framework represented by a set of documentation and client libraries for each technology from our stack (Erlang / Elixir, Java / Scala, Python). It is being developed, first of all, in order to simplify the integration of platform components with each other. Second, to help us solve a number of the following problems:

stylistic differences of program interfaces
the presence of direct inter-component dependencies
keeping documentation up to date
introspection and debugging end-to-end functionality

So, first things first.

Stylistic differences in software interfaces

In my opinion, this problem arose as a result of a combination of several factors:

Lack of a strict standard of what the API should look like. The set of recommendations often does not have the desired effect, the API is still different. Especially if the development is carried out by teams from different offices of the company. Each team has its own habits and practices. Collectively, such APIs often do not look like parts of a whole.
Lack of a single directory with the names and formats of business-specific entities. As a rule, you cannot take an entity from the result of one API and pass it to the API of another service. This requires transformation.
Lack of a mandatory centralized review system for the API. There are always deadlines and there is no time to collect updates and, moreover, make changes to the API, which in fact often turns out to be already half tested.

The first thing we did when designing the Contract API was to say that from now on the API belongs to the platform, and not to a single component. This led to the fact that the development of new functionality begins with a pull request to a centralized storage API. Currently, we use the GIT repository as storage. For convenience, we divided the entire API into separate business functions, formalized the structure of this function and called it Contract.

Since then, each new business function in our contract API should be described in a special format and go through the pull request with a mandatory review. There is no other way to publish a new API to the Contract API. In the same repository, we defined a directory of business-specific entities and suggested that contract developers reuse them instead of describing these entities themselves.

So we got a conceptually integrated platform API that looked like a single product, despite the fact that it was actually implemented on many platform components using various technological stacks.

The presence of direct inter-component dependencies

This problem of ours manifested itself in the fact that each platform component was required to know who specifically services the functionality it needs.

And it was not even the difficulty of maintaining this directory up-to-date, but the fact that direct dependencies significantly complicated the migration of business functionality from one platform component to another. The problem was especially acute when we started the decomposition of our monoliths into smaller components. It turned out that convincing the client to replace the working integration with any functionality with the same from the point of view of the business, but another from the technical point of view, is not a trivial management task. The client simply does not see the point in this, since everything works fine for him. As a result, bad-smelling layers of backward compatibility were written that only complicated the support of the platform and had a bad effect on the quality of service. And since we are already going to standardize the platform API,

We faced a choice of several options. Of these, we especially carefully considered:

Implementation of service discovery protocols on each of the components.
The use of a mediator ( mediator ), which would redirect client requests to the correct platform components.
Using a message broker as a messaging bus.

As a result of some thought and experiment, the choice fell on the message broker, despite the fact that he saw us as a potential single point of failure and increased the overhead of operating the platform. An important role in the selection was played by the fact that the platform at that time already had expertise in working with RabbitMQ. And the broker itself scaled well and had built-in mechanisms for ensuring fault tolerance. As a bonus, we got the opportunity to implement an event-driven architecture ( event-driven architecture or EDA ) “under the hood” . Which subsequently opened up before us wider possibilities of interservice interaction, in comparison with point-to-point interaction.

So, topologically, the platform began to turn from a graph with random connectivity into a star. And platform components inverted their dependencies and got the opportunity to interact with each other exclusively through contracts registered in a centralized repository, without the need to know who specifically implements a particular contract. In other words, all the components within the platform were able to interact with each other using a single integration point, which greatly simplified the life of developers.

Keeping documentation up to date

Problems associated with the lack of documentation or the loss of its relevance are almost always encountered. And the higher the pace of development, the more often it manifests itself. And after the fact, collecting all the API specifications in a single place and format for more than a hundred services in a distributed and multinational team is a difficult task.

When developing the Contract API, we set ourselves the goal of solving this problem as well. And we did it. A strictly defined format for the description of the contract allowed us to build a process in accordance with which, immediately after the appearance of a new contract, automatic documentation assembly is started. This gives us confidence that our API documentation is always up to date. This process is fully automated and requires no development or management effort.

Introspection and debugging end-to-end functionality

As we split our monoliths into smaller components, quite naturally, difficulties began to arise in debugging end-to-end functionality. If the service of a business function was distributed across several platform components, then often to localize and debug the problem, one had to look for representatives from each of the components. Which at times was achievable with difficulty, given the 11-hour time difference with some of our colleagues.

With the advent of the Contract API, and in particular thanks to the message broker underlying it, we got the opportunity to receive copies of messages involved in the execution of a business function, without side effects on the interaction participants. To do this, it is not even necessary to know which of the components of the platform is responsible for processing a particular contract. And after localization of the problem, we can get the identifier of the broken component from the metadata of the problem message.

What else did we develop on top of the Contract API

In addition to its main purpose and solving the above problems, the Contract API allowed us to implement a number of useful services.

Gateway to access platform functionality

The standardization of the API in the form of contracts allowed us to develop a single access point to platform functionality via HTTP. Moreover, with the advent of new functionality (contracts), we do not need to modify this access point in any way. It is forward compatible with all future contracts. This allows you to work with the platform as a single product using the usual HTTP interface.

Mass Operations Service

Any contract can be launched as part of a mass operation, with the ability to track its status and then receive a report on the results of this operation. This service, just like the previous one, is compatible with all future contracts in advance.

Unified platform error handling

The Contract API protocol standardizes errors as well. This allowed us to implement an error interceptor, which analyzes their severity and notifies the monitoring system of potential problems on platform components. And in the future, he will be able to independently decide on the discovery of a bug on the platform component. The error interceptor catches them directly from the message broker and does not know anything about the purpose of a contract or error, acting only on the basis of meta-information. This allows him, as well as all the services described in this section, to be forward compatible with all future contracts.

Auto Generate User Interfaces

Strictly formalized contracts allow you to automatically build user interface components. We have developed a service that allows you to generate an administrative interface based on a collection of contracts, and then embed this interface in any of our platform tools. Thus, those admins that we previously wrote with our hands can now be generated (although only partially so far) in automatic mode.

Platform Logging

This component has not yet been implemented and is under development. But in the future, it will allow “on the fly” to turn on and off the logging of any business function in the platform, extracting this information directly from the message broker, without any side effects that negatively affect the interacting components.

The main purpose of the Contract API

But still, the main purpose of the Contract API is to reduce the cost of integrating platform components.

Developers are abstracted from the transport level by libraries that we developed for each of our technology stacks. This gives us some room for maneuver in case we have to change the message broker or even switch to point-to-point interaction. The external interface of the library will remain unchanged.

The library under the hood generates a message according to certain rules and sends it to the broker, after which, after waiting for a response message, it returns the result to the developer. Outside, it looks like a regular synchronous (or asynchronous, implementation-dependent) request. As a demonstration, I will give a few examples.

Python contract call example

from platform_client import Client
client = Client(contracts_path=CONTRACTS_PATH, url=AMQP_URL, app_id='client')
client.call("ban-management.create-ban.v1", {
  "wgid": 1234567890,
  "reason": "Fraudulent activity",
  "title": "ru.wot",
  "component": "game",
  "bantype": "access_denied",
  "author_id": "v_nikonovich",
  "expires_at": "2038-01-19 03:14:07Z"
})
{
  u'ban_id': 31415926,
  u'wgid': 1234567890,
  u'title': u'ru.wot',
  u'component': u'game',
  u'reason': u'Fraudulent activity',
  u'bantype': u'access_denied',
  u'status': u"active",
  u'started_at': u"2019-02-15T15:15:15Z",
  u'expires_at': u"2038-01-19 03:14:07Z"
}

The same contract call, but using Elixir

:platform_client.call("ban-management.create-ban.v1", %{
  "wgid" => 1234567890,
  "reason" => "Fraudulent activity",
  "title" => "ru.wot",
  "component" => "game",
  "bantype" => "access_denied",
  "author_id" => "v_nikonovich",
  "expires_at" => "2038-01-19 03:14:07Z"
})
{:ok, %{
  "ban_id" => 31415926,
  "wgid" => 1234567890,
  "title" => "ru.wot",
  "conponent" => "game",
  "reason" => "Fraudulent activity",
  "bantype" => "access_denied",
  "status" => "active",
  "started_at" => "2019-02-15T15:15:15Z",
  "expires_at" => "2038-01-19 03:14:07Z"
}}

In place of the contract “ban-management.create-ban.v1” there can be any other platform functionality, for example: “account-management.rename-account.v1” or “notification-center.create-sms-notification.v1”. And all of it will be available through this single point of integration with the platform.

The overview will be incomplete if you do not demonstrate the Contract API from the point of view of the server developer. Consider a situation in which a developer needs to implement a handler for the same ban-management.create-ban.v1 contract.

from platform_server import BlockingServer, handler
class CustomServer(BlockingServer):
  @handler('ban-management.create-ban.v1')
  def handle_create_ban(self, params, context):
    response = do_some_usefull_job(params)
    return response
d = CustomServer(app_id="server", amqp_url=AMQP_URL, contracts_path=CONTRACTS_PATH)
d.serve()

This code will be enough to start serving a given contract. The server library will unpack and check the request parameters for correctness, and then call the contract handler with the request parameters ready for processing. Thus, the server developer is protected by a library, which, in case of receiving incorrect request parameters, will itself send a validation error to the client and register the fact of a problem.

Due to the fact that under the hood the Contract API is implemented on the basis of events, we get the opportunity to go beyond the scope of the Request / Response script and implement a wider range of interservice interactions.

For instance:

make a request and forget (without waiting for an answer)
make requests to several contracts simultaneously (even without using an event loop)
make a request and receive answers from several handlers at once (if provided for by the integration script)
register a response handler (triggered if the contract handler reported completion, accepts the result of the work of the contract handler, i.e. its response)

And this is not a complete list of scenarios that can be expressed through an event model of interaction. This is a list of those that we are currently using.

Instead of a conclusion

We have been using the Contract API for several years. Therefore, it is not possible to talk about all the scenarios of its use within the framework of one review article. For the same reason, I did not overload the article with technical details. She already turned out quite voluminous. Ask questions, and I will try to answer them directly in the comments. If a topic is particularly interesting, it will be possible to disclose it in more detail in a separate article.

Tags: