How to make an internal product external. The experience of the Yandex.Tracker team

    Recently, we opened Yandex.Tracker , our task and process management system , for external users . Yandex uses it not only to create services, but even to buy cookies in the kitchen.


    As you know, the smaller the company, the more simple tools it can use. If in the morning you can greet each employee personally, then even a chat in Telegram is enough for you to work. When separate teams appear, it’s not only impossible to greet each person personally, but you can get confused in task statuses.



    A cloud of words in ticket headers in the internal Yandex.Tracker


    At this stage, it is important to maintain the transparency of the processes: all parties should be able at any time to learn about the progress of work on a task or, for example, leave a comment that will not disappear in the stream of work chat. For small teams, the tracker is a kind of news feed with the latest news from the life of their company.


    Today we will tell readers of Habrahabr why Yandex decided to create its tracker, how it is arranged inside, and what difficulties we had to face, opening it out.


    Yandex now employs more than six thousand people. Despite the fact that many of its parts are arranged as independent startups with their teams of different sizes, there is always a need to understand what is happening on the next floor - their work can overlap with yours, their improvements can help you, and some processes vice versa can negatively affect yours. In such a situation, it is difficult, for example, to call a colleague from another workspace in Slack. Especially when the transparency of the task is important for many people from different directions.


    At some point, we started using the famous Jira. This is a good tool, the functionality of which, in principle, suited everyone, but it was difficult to integrate with our internal services. In addition, on a scale of thousands of people who need a single space where everyone can navigate without a flashlight, Jira ceased to be missed. It also happened that it went under load, even though it worked on our servers. Yandex grew, the number of tickets also increased, and updates to new versions took more and more time (the last upgrade took six months). It was necessary to change something.


    At the end of 2011, we had several options for solving the problem:


    • Increase the performance of the old tracker. Discarded the idea, because redoing the architecture of someone else’s product ourselves is bad. This would mean at least putting an end to updates.
    • Saw the tracker into several independent copies (instances) to reduce the load on each. The idea is not new, it is used by large companies in similar cases. As a result, end-to-end reports, filtering, linking and task transfer between copies would not work. All this was critical for the company.
    • Get another tool. We examined the possible options for trackers. Most of them do not allow easy scalability, that is, the cost of infrastructure improvements and subsequent changes to our needs would be higher than the cost of our own development ..
    • Write your tracker. A risky option. It gives the most freedom and opportunity in case of success. In case of failure, we have problems with one of the key development and planning tools. As you might have guessed, we took a chance and chose this option. Expected pros outweighed the cons and risks.

    The development of our own tracker began in January 2012. The tracker team itself was the first to transfer its tasks to the new service a few months after the start of work on the project. Then the process of moving the remaining teams began. Each team put forward its requirements for functionality, they were worked out, the tracker was overgrown with new features, then the team was transported. It took two years to completely move all the teams and close Jira.


    But let's go back a bit and look at the list of requirements that was compiled for the new service:


    • Fault tolerance. As you may have heard , the company regularly holds exercises with the shutdown of one of the DCs. The service should experience them imperceptibly for both the user and the team, without the need to perform manual actions at the beginning of the exercises.
    • Scalability. The tasks in the tracker have no statute of limitations. A developer or manager may need to look at today's task, as well as at the one that was closed 7 years ago. And this means that we cannot delete or archive old data.
    • Integration with internal company services. A close tie with our many services was required by most Yandex teams: integration with long-term planning services, version control systems, a catalog of employees, etc.

    And even at the time of collecting requirements, we decided on the technologies that we will use to create the tracker:


    • Java 7 (now already 8) for the backend.
    • Node.js + BEMHTML + i-bem for the frontend.
    • MongoDB as the main data warehouse: automatic failover, good speed and the ability to easily enable sharding, schemaless (convenient for custom fields in tasks).
    • Elasticsearch for quick search and aggregation by arbitrary field. Flexible settings for analyzers for sugest, percolator and other elastic buns also played a role in the selection.
    • ZooKeeper for Discovery backends service. Backends interact with each other to disable caches, distribute tasks, and collect their own metrics. With the help of ZooKeeper and the client, it is very easy to organize the discovery of it.
    • File storage as a service, which removes the headache of replication and backups from us when storing user attachments.
    • Hystrix for communication with external services. To prevent cascading outages, do not load adjacent services if they experience problems.
    • Nginx for terminating https and rate limits. As practice has shown, terminating https inside java is not a good idea in terms of performance, so we transferred this task to nginx. The rate limiter is also more reliable to organize on his side.

    As for any other public and internal Yandex services, we also had to think about the requirements for the performance margin and scalability of the service. An example to understand the situation. At the beginning of system design, we had about 1 million tasks and 3 thousand users. Today, the service has almost 9 million tasks and more than 6 thousand users.


    By the way, despite the rather decent number of users in the internal Tracker, most of the requests come to the tracker, through the API from Yandex services integrated with it. It is they who create the main load:



    Below you can see percentiles of answers in the middle of the working day:



    We try to regularly evaluate the future load on the service. We make a forecast for 1-2 years, then with the help of Lunapark we check that the service will withstand it:



    This graph shows that the task search API begins to give a noticeable number of errors only after 500-600 rps. This made it possible to estimate that, taking into account the growing load from internal customers and the growth in the amount of data, we will withstand the load in 2 years.


    In addition to a high load, other unpleasant stories can happen with the service, which must be handled so that users do not notice this. We list some of them.


    1. Data center failure.
      A very unpleasant situation, which nonetheless occurs regularly thanks to exercises. What happens with this? The worst case is when the monga master was in a disabled DC. But even in this case, the intervention of the developer or admin is not required due to automatic failover. In elasticity, the situation is slightly different: some of the data was in a single copy because we have a replication factor of 1. Therefore, it creates new shards on the surviving nodes so that all shards have a backup again. Meanwhile, the balancer over the backend receives a connection timeout in those requests that were executed on the instances in the disabled DC, or an error from the working backend, whose request went to the missing DC and did not return. Depending on the circumstances, the balancer may try to repeat the request or return an error to the user. But in the end, the balancer will understand


    2. Loss of connectivity between backend and base / index due to network issues.
      A slightly simpler situation at first glance. Since the balancer over the backend regularly checks its condition, the situation when the backend cannot reach the base pops up very quickly. And the balancer again takes the load off this backend. There is a danger that if all the backends lose contact with the database, then they will all be closed, which will ultimately affect 100% of the requests.


    3. High load of queries in the search for the tracker.
      Search by tasks, their filtering, sorting and aggregation are very labor-intensive operations. Therefore, it is this part of the API that has the most stringent load limits. Previously, we manually found those who bombarded us with requests and asked them to reduce the load. Now this happens more often, so the inclusion of rate limits made it possible to not notice an overly active API client.

    Yandex.Tracker - for everyone


    Other companies were interested in our service more than once - they learned about our internal tool from those who left Yandex, but could not forget the Tracker. And last year, inside we decided to prepare the Tracker for entering the world - to make a product out of it for other companies.
    We immediately began to work on architecture. We faced a big task of scaling the service to hundreds of thousands of organizations. Prior to this, the service was developed over the years for one of our company, taking into account only its needs and nuances. It became clear that the current architecture will require strong improvements.


    As a result, we had two solutions.


    Separate instances for each organizationInstance that can host thousands of organizations
    Pros:
    • Minimum number of changes in already written code

    Minuses:
    • Expensive resources
    • Complex monitoring
    • Complexity of calculations and migrations

    Pros:
    • Reasonable use of resources
    • Fast deployment and relatively quick migrations
    • Easy monitoring

    Minuses:
    • Labor input


    Obviously, the stability of the service for external users is no less important than for internal users, so you need to duplicate databases, search, backend and frontend in several data centers. This made the first option much more difficult to maintain - many points of failure were obtained. Therefore, we chose the second one as the final option.


    It took us two months to rewrite the main part of the project, for such a task it was a record time. Nevertheless, in order not to wait, we raised several copies of the tracker on the dedicated hardware so that there was something to test the front-end and interaction with related services with.


    Separately, it is worth noting that even at the design stage, we made a fundamental decision to keep one code base for both Trackers: internal and external. This allows us not to engage in copying code from one project to another, not to reduce the speed of releases and release opportunities out almost immediately after they appear in our internal Tracker.


    But as it turned out, it was not enough to add another parameter to all application methods, we also encountered the following problems:


    1. Non-rubbery monga and elastic. It was impossible to put the data in one instance, the elastic is bad for a large number of indices, and the monga could not accommodate all the organizations. Therefore, the backend was divided into several large instances, each of which can serve the organizations assigned to it. Each instance is fault tolerant. At the same time, it is possible to drag the organization between them.
    2. The need to perform cron tasks for each organization. Here we had to solve the problem with each task individually. Somewhere replaced pull data with push. Somewhere, one cron task generated a separate task for each organization.
    3. A different set of task fields in each organization. Due to the availability of optimizations for working with them, we had to write a separate cache for them.
    4. Index mapping update. A fairly common operation that occurs when the tracker is updated to a new version. Added a mechanism for incremental updating of mapping.
    5. Opening API to external users. I had to add a rate limiter, block access to the service API.
    6. Having a support service for rare activities. Our support and developers do not have the right to look at the user data of organizations, which means that all actions must be performed by company administrators. Added a number of admin panels for them in the interface.

    A separate point is the performance assessment. Because of the many changes, it was necessary to evaluate the speed of work, the number of organizations that would fit in the instance, as well as the supported rps. Therefore, we conducted regular firing, after having previously populated a large number of organizations in our test tracker. As a result, we determined the load limit, after which new organizations will need to be placed in a new instance.



    We made another special dedicated instance of the Tracker to place a demo version in it . To get into it, just having an account on Yandex is enough. Some features are blocked in it (for example, downloading files), but you can get acquainted with the real interface of the Tracker.


    Also popular now: