mi5ha6in September 3, 2018 at 15:25

Scaling up development: from startup to hundreds of engineers

Many other large IT companies started with a startup, and Badoo is no exception. In recent years, the company has gone from several dozen engineers to several hundred. Nikolai Krapivny was at the forefront on most of this path and made decisions: what is best to do and what not to do, how to cope with problems. His report on TeamLead Conf was dedicated to this experience and the picture of the world that resulted in it.

Of course, each company has its own way , but the problems of human communications are approximately the same for everyone. Someone else’s experience will help you think in advance about the problems that will have to face the growth of the company. Even if these values do not directly apply, it will tell you which way to think.

The story consists of three parts. The first isabout communications , about how they change with the growth of the company. The second part is about how, with the increase in the number of engineers in the team, to try to maintain the speed of development . And the third part is why Badoo lives in two offices , and how to cope with the problem of communication.

Let's get started!

About the speaker: Nikolay Krapivny (@ cyberklin ) has been working at Badoo for the last eight years, five of them are involved in team management and building development processes.

Before diving into the first part, I want to say that this is a story about our path and does not claim to be absolute truth. Each company has its own path, but I am sure that our experience, the values that we have formed for ourselves, some knowledge will help you in your growth, help you build the right process. Despite the fact that you have a different specificity, everything is a little different, I hope this will be useful to you.

Communications

To begin with, let's talk a little theoretically about what happens to communications when a company grows.

Communication is about how departments interact with each other, how people interact with each other, how communication takes place so that something is done in the company.

Consider a rather hackneyed, but nonetheless vital, example: an abstract startup team. Several people gathered, someone closer to the business, and someone more technical person. But overall, this is a small team that does something that maybe someday will become the second Facebook. And in this team, all the work is built on communications. The team is small, and there is no point in introducing any processes. Everything works just like that: someone talked with someone, agreed to quickly do something, do it.

Despite the fact that in the process, which was built solely on communications, in conversations: “Let's do it,” “Let's do it faster,” “Let's do it like that,” there are certain problems, such a team undoubtedly has its advantages.

Work is fast . The time from an idea to how this idea becomes available to the user is very small. The idea came, we talked with someone how to do it faster, it's already done, it's ready.
It is flexible . In this small team there is no such thing that someone is only engaged in something specific, and cannot, when necessary, connect to a task that is important. In principle, everyone does everything, and if something is important to us, then everyone makes an effort to do this.
In general, due to the fact that as such processes have not yet been built, such work is quite effective . We do not spend too much time on overhead costs, on some processes, on some built-up formal schemes.

These are just the values that every business wants to see: the most flexible equation of resources, the minimum time-to-market and low operating costs.

The company is growing - communications are “torn”.

When a company grows, the advantages of a small team, when everything works quickly, on interaction, on conversations, become a problem. The load on communications from the amount of information transmitted begins to grow, and we come to the fact that communications are “torn” . We are starting to lose more on communications than we are gaining. We need to talk with too many people, somewhere there is a misunderstanding when transmitting information from person to person, somewhere we just lost something, somewhere we forgot. And everything that was then built, which gave speed, we are slowly starting to lose.

If you extrapolate and look at the company's development model over a long time interval, then it looks like a cycle. The number of people increases, the load on the process increases, communications begin to break. What previously worked ceases to work. Therefore, we are forced to repair something in these places. Often this happens at the borders of departments. To fix, you have to formalize the process of communication. And this cycle repeats many times: the number of people is increasing, something is starting to work inefficiently, we are introducing new processes, somehow we are formalizing them, we are getting a new supply for growth until it breaks elsewhere, and so on again and again. It's like scaling a system, like a performance: if you increase the load on the system - the weakest element, the slowest part will not stand it. We repair, somehow improve, a window appears in which you can increase the load on the system. So with the scaling of the company.

It was a small introductory theoretical part.

Now let's take a look at what cycles we went through, what problems we encountered, and how we solved them.

Technical task

As a first example, consider the task of formalizing the relationship between a business and an engineering team. The terms of reference, or, as we call it PRD, is a request for what needs to be changed in terms of design functionality. This is a fairly obvious formalization that all companies go through. I think that most of you work in companies where there is some kind of formal process of transferring a development task. From a product team, from a business or from an external customer - it doesn’t matter.

We went through several parts of the complication of this process. At first we just wrote. When the team became larger than the one that allows you to do business simply by talking among ourselves, we began to write all this in tasks. The objectives were formulated as “what needs to be done”. Further, the complexity of the product grew, the number of people in the company grew, and we came to the conclusion that it is useful to maintain the current version of the current working system in one place. We moved all of this to Wikis, and a discussion of the changes to the wiki comments so that everything is in one place. The next step was to formalize what should be in the PRD + PRD review process. Now we have a template that fixes what information must be in the PRD, what should be described and what data should be collected before starting work. For instance,

The goal, why are we doing this functionality.
What platforms, products, countries do we plan to launch.
Description of the functional in use cases format: main cases + a predefined list of “complicated cases” that everyone forgets about.
Tokens (separately processed by the copywriter).
Communications: whether there will be email / push notifications for this functionality, and if so, which ones.
Release plan, depending on the marketing / other projects in the company.
Analytics: how we will evaluate the results, which business metrics we need to add to assess the success of the change.

Thus, in the current form, the interaction between the product and technical team is formalized quite strongly and helps us not to lose some important points in the process of transferring a task to work.

Server client

We grew further, mobile development appeared and became one of the key areas. The following point arose at which the communication “broke off”. This is the point at the junction between the client and server . It is about how the client should interact with the server at the protocol level, at the relationship level. This was decided by conversations between the client guys and the server guys. But the number of teams grew, the number of people in these teams grew. And the fact that information about the interaction of the client and the server was stored only in the heads of developers began to lead to problems.

Documentation

The problems we encountered were fairly simple and obvious. The client-server relationship is not only a protocol, but also a communication scheme according to this protocol. What commands to send and when, how the client should request something, how the application starts - everything must follow the protocol.

For example, the developers of the client part solve the problem and believe that the API has a suitable command that can be called and everything will be fine. This client is released and creates a problem on the server, because the team was too heavy for it and requires too many resources. In addition, iOS and Android understand the API a little differently, and implement it in different ways, because of this we cannot quickly make changes to the API. Thus, we have come to the conclusion that the protocol needs to be documented.

Release not back

The peculiarity of mobile platforms is also that the release cannot be returned. If the application is laid out in the store and the user has installed it, then most likely it will take a very long time to live with this version of the client. Error at the stage of designing the protocol, at the stage of determining the interaction between the client and server, dear. In Badoo for another year or two, we will have to support any application that is released until the number of users drops to a certain limit.

To solve this problem, we decided to allocate a separate MAPI team, which will document the protocol, and will be a point of knowledge exchange between the client and server. This team includes employees from client and server development. This blended team is transforming product requirements into modifying the protocol and its documentation. Since the error at the protocol implementation stage is quite expensive for us, the processes in this team are a bit more complicated and harder than in all the others. They use a double code review, trying to eliminate the possibility of error as much as possible.

This team quickly became a hub for knowledge sharing. Often there are situations when the client and server developers cannot agree on how they should interact. For example, iOS may be the only way, but for Android it is not suitable. The new team solves these controversial problems and, if necessary, gathers the right people to make the right decision.

If you look at the diagram of our process, then the Mobile API team is an intermediate link between when requirements are ready and when development begins. I.e:

from the product team receives the task of developing TK (PRD);
the protocol design team draws up the documentation;
development of the client and server parts according to the documentation begins.

With such a process, server and client development can go independently, and we often use it.

Statistics problem

The company continued to grow and develop, there were more people and projects. Slowly, we came to the point that a separate team has emerged that deals with data, statistics, and helps the product team analyze how users respond to changes. As I said, problems appear at the junction of teams . We have a new team, and after a while this joint also began to work inefficiently.

The fact is that analysts need good data to identify patterns and answer tricky product questions. Good data means that all statistics should be subordinate to a single language. When we talk about statistics and about our product, we need to speak one language.

Prior to this, in each technical task, the product manager described the principles of collecting statistics with words: for this button, you need to measure the click rate, for this screen - the conversion. But then the developer himself decided which events to track, how to measure (from the client or server), which graphs to draw, and for example, which sections to add to these events. The developer can make graphs cut by type of device, add gender, collect statistics by country. These disparate data come to the analytical department, but on their basis it is impossible to accurately assess the quality of the solution in the product. As a result of this, the reverse shaft of tasks arises: we make changes, these changes are implemented, the product manager requests analysis, the statistics team requests additional data, the task goes for revision, statistics are being finalized,

The process of collecting and analyzing statistics needs to be formalized.

We decided that the requirements for statistics will be recorded in the ToR, and the owners of knowledge about the requirements will be analysts. The analyst, at the stage of transferring work on TK to development, says what statistics are needed, what events to track, by what slices to break data. If the analyst asks to expand existing statistics or add new ones, then we add new functionality or modify existing ones. To do this, we formalized working with data in code. We made a single API, which simply does not allow sending insufficient data or invalid data.

In parallel, from the point of view of tools, we have a quick Microstrategy tool for data visualization and our own tool for A / B testing. The owners of all the knowledge on how to use these tools correctly are analysts.

Another stage is added to the process diagram. PRD goes through the coordination stage in the analytics department, and only after that it is transferred to the MAPI and development. So it works right now.

Load distribution

The next problem is associated with increased load and interaction within one department. I lead the backend development team for our products, and on its example I will illustrate what problems arise with the increase in the number of employees within one team.

In a team of up to 15 people, everything is quite simple. We believed that everyone does everything and distributed tasks mainly on the basis of who is free now - that’s what he does. Why up to 15?

It is believed that one or team leader or technical leader should lead a team of up to 7–9 people. This is an empirically established number of an adequate number of subordinates.

We had a team lead and his deputy, so together we controlled 14-15 people. With further growth, some additional division became necessary. The flow of development tasks needs to be balanced. We have identified the main requirement for this process: we are forming specialization . Each piece of code will have experts, 1-2, and preferably 3, who know this code best and who support this code. But at the same time, there is an orthogonal requirement: to maintain flexibility . For example, if five people support the messenger, and too many urgent tasks have arrived, then they should not be idle. If the team has free resources, they should be included in the performance of other people's tasks. These requirements contradict each other, but we still want to try to achieve this.

We divided a large team into development groups of 4-9 people. At the head of each group is the technical leader and he is the direct leader of the team. We introduce such a concept as a component. A component is a piece of code completed in terms of product functionality. Each component is assigned to a specific group. Each component within the group has 1-2-3 people who are experts on this piece, and are engaged in its development and support.

In terms of load sharing, each task has a component.
The tasks of technical debt and support are distributed in the “native” group - the one for which this component is assigned.
We are trying to distribute new functionality in the “native” group. But only if we have such an opportunity.
In order to maintain flexibility, we do not exclude a situation where one group helps another and does something that is not connected with its components.
In this case, either a technical assignment review or a code review is carried out - this is done by the "native" group.

In this option, we are working now. The team has 30 people, 5 groups and 22 components that we share between them. While we do not see the limit for further growth in this format, we will adhere to it to a certain scale.

An interesting side effect: what happens in a team when the number of projects, the number of people, the number of changes grows quite strongly. We are faced with the fact that there are so many in total that it is difficult to understand the specific reasons for a change.

I will give an example of the growth in the registration of new users in Brazil. The reason may be: a spam bot that registers new accounts and spoils our lives; problems with sessions; just a promotional campaign; launch of a new wave of marketing in Brazil. The change is visible on the chart, and we want to understand with minimal effort what it happened.

We made for ourselves a tool called WTF. This is one tool that collects in itself from all kinds of subsystems and parts of production something that can somehow affect metrics. This tool is integrated into the charting tool, and you can see the changes at intervals. As a bonus, we try to integrate not only technical metrics (crashes, configuration changes), but also business metrics (promos and advertising companies).

The interface is simple: the red line is a change associated with some configuration change. Such a tool helps track changes in a growing project.

To summarize the first part of my report:

With the growth of the communications team will be missed. They will be overloaded and become ineffective.
Most often this happens between departments, in our case, between server and client development.
Where it breaks, we formalize the process.
New tools will be needed as the number of projects grows.

It worked for us:

Formal interaction between the product and engineering departments is implemented through TK.
Interactions with BI are based on the requirements of analysts;
The MAPI team is engaged in a protocol for the client and server parts.
All interaction within the department, occurs according to the principle of the component - this is a way of formalizing the distribution of tasks.

The development process involves 200 people. With further growth, perhaps we will face new challenges. Then in a couple of years there will be a new report about how we all remade :)

Speed

We want to keep the speed of making changes to the system with the growth of the team. At the same time, faced with communication problems, we introduced a number of formal processes and got a multi-stage scheme.

Time-to-marker with such a process is increasing and increasing. Now we look like this.

Our system is like a big ship. He swims very fast, is heavily armed, everything is cool, until you need to make some very small change. In order to maneuver, react to the market, we need to pump the change through our entire scheme.

Then we thought: maybe everything is wrong. Maybe we are growing incorrectly, and we need to redo everything. The option with cross-functional teams comes to mind. We scale the system vertically. We say: more work - more people. And lost in delivery speed. Maybe it’s worth switching to the scheme when our team consists of a large number of startups. Each startup will do part of the work itself, and inside it will have effective communications. Then it will not be necessary to introduce formal processes.

The idea of converting functional teams into cross-functional ones in order to speed things up arose many times over once in the course of our evolution. We refused it because of several minuses.

Less resource flexibility . Redistributing people between cross-functional teams is more difficult. The response to a change in load or process is slower.
The issue of technological control in the system . There are 10 teams with back-end, front-end and analytics in each. The question arises: whether each back-end will write in its own language and whether the development stack will be dragged away. This also threatens to create new bicycles to solve the same problems. This puts an additional burden on the administration of the entire system.
This system only works on some scale . It is necessary to provide a bus factor of more than one, so you cannot make a team with only one back-end. There should be at least two specialists, and subjectively it seems that more people are needed to do the same number of tasks.

If you imagine our system as a mass service system for applications (where applications are product hypotheses and changes), you can find the answer to the question about speed. On the graph is a saturation diagram of the queuing theory with OX requests per second. For our process, this indicates the number of tasks completed. On the OY axis, the processing time of each request.

From the point of view of queuing theory, the system can be optimized either by the number of tasks to be solved, or by the time of processing the task.

The functional team is optimized for the number of tasks performed. Cross-functional - on time delivery. In a cross-functional team, everything happens faster, time-to-market is less, but also less than solved problems. To make a task faster, you need to have a certain amount of resources either completely free and wait for the task to arrive, or perform some task that is not so important and can be postponed. As part of functional teams, we, in essence, optimize the use of development resources. Due to this internal optimization, we get a large number of completed tasks.

Let's get back to the problem. We still lack the flexibility and speed for fast product projects. We want delivery time to be minimal, and we don’t want to waste time on processes. We want to take the pros from both approaches. To achieve this, we shared our workflow. For business and some specific tasks (such as marketing) speed is important. For them, we will apply the approach with cross-functional teams. And for the area in which delivery speed is not so important, we will apply the general scheme.

In fact, these are project teams. The product department says what is needed now, which is important for the company where we want to add and improve. Most often these are experimental projects in which it is not known for sure whether it will fly or not. They do not need to invest a lot of resources on documentation or building ideal solutions. A large number of companies in marketing work on the principle that we either do it in 2 days and launch a campaign, or we miss this opportunity. For such areas, we form project teams that provide us with delivery speed. For everything else, processes work according to the scheme that I spoke about earlier. In project cross-functional teams, some stages may be omitted, for some tasks the work can move as in a startup, just in communications.

This system needs constant monitoring. We constantly monitor which areas are critical for us, where speed is needed at the moment, which temporary project teams to create.

70-80 percent of the tasks go a long formal way. All tasks related to technical debt and support also go the usual way. In this scheme, we maintain technological control, documentation and the reliability of decisions made. Because the project teams are temporary, they do something and their members diverge back into their departments, and further maintain their code.

Let's summarize the second part.

Functional teams are the optimization of quantity, cross-functional ones are the optimization of the speed of what has been done.

In each company, you need to decide what balance you want. If you are ready to spend a little more time, a little more resources, and are ready to solve some problems associated with the technological stack, but want more speed - perhaps a cross-functional approach is suitable for you. If you are not ready, and just want to grow evolutionarily, as we are a functional approach. We use both for different tasks.

Moscow — London

We have sorted out the relatively serious topics, let's talk about how we work between the two offices. How did it happen that we sit in different offices. Why did it happen, what problems are going on with this, and how do we solve them.

It happened historically. Our business has always been located in London, and Moscow was a technology office. Mobile development, as a strategic direction, appeared immediately in London, near the business. It was actively developing, and at some point we came to the conclusion that client development sits in London, and the backend, server-side and web, historically sit in Moscow. We realized that with our approach related to cross-functional teams, we are losing a lot. The distance between offices creates problems if we need a small team but immediately from a specialist in server-side, web, product and analytics. Then in the year 14, the great migration of peoples began. We have moved the entire web development team to London, and half the server-side. Now all client teams work together with the web in London, and the server-side is divided 50/50. This decision seems strange, but overall, in terms of the process, we won. Because now in London, near customers, near products, there is some piece of backend that can participate in quick projects, can help and can work faster. From the point of view of our department, we lost a little, because the team is divided into two offices.

As a bonus to all this, we got other advantages.

Improving recruitment opportunities — we can hire from both Europe and Russia. We can accept those who want to change jobs not because he doesn’t like something here, but because he just wants to leave, such people also happen.

Communication from business to engineer has improved. Because now part of our team sits in London, and it quickly receives information about what is happening, what projects will be and what results are already there. I called it "presence in the center."

Fault tolerance. A trifle, but nice - earlier, for example, when the Moscow office together went to dig potatoes on May holidays, work in London got up. With the 50/50 option, we are resistant to 10-day New Year holidays and similar situations.

One of the issues we are struggling with is the time difference. More precisely, it is difficult to fight it, but it is. She's not that big, just 3 hours. Probably some of you work in teams where the difference is 12 hours. But 3 hours is such an unpleasant difference if you work together. In London, a person arrives while he gets involved in work, the first hour passes, the time comes when he goes to the cruising speed of work, and it would seem that now he will do something with Moscow, but Moscow is leaving for lunch. Moscow comes from lunch - London went to lunch. The half-day shift is very unpleasant. Therefore, the Moscow office is open from 11:00, and the London - from 9:00. With this temporary shift, we partially offset the difference, and work in almost the same mode.

The plus, which is due to the time difference, is that releases take place 2 times a day, one in the morning, one in the evening. The morning is done by the Moscow guys, because for them the hour of the day is a time of serious work, and people just arrived in London, there might not be someone else, maybe there are some problems. And the evening release is made by London, if something goes wrong, it will be during working hours and no one will have to linger.

The next problem we are struggling with is what I called "heat shortage." Due to the fact that the room is divided, the subordinate-leader relationship occurs remotely. The role of a leader for his subordinate can be roughly divided into two:

Sensei : the long-term role is to ensure growth, set a strategic goal, etc.
Ментор — оперативная, краткосрочная роль, по сути, это: «я научу тебя воевать прямо сейчас». Как правильно себя вести, чтобы решать текущие задачи.

The strategic role can be performed remotely by some regular visits and conversations, since it is strategic and there is no need for constant intervention in the work. The role of the mentor, operational management, is quite difficult to perform remotely. Therefore, for ourselves, we have developed a rule that for all young, new employees, technical mentoring occurs locally. There is some person in the team who takes on the responsibilities of a leader who mentors the person in emerging issues locally. In this case, the leader can still perform work related to the strategic growth of a person.

No one cancels the fact that you just need to meet more often . Everything is standard here:

We go on business trips to each other. Meet on design work.
Раз в квартал проходит performance review. Все руководители, у которых есть подчинённые в другом офисе, обязательно приезжают, чтобы поговорить один на один. Всё-таки разговор по видеоконференции — это совсем не то, что поговорить лично с человеком.
Новые сотрудники обязательно посещают разные офисы — познакомиться, посмотреть, как работает другая команда, познакомиться с менеджером продукта для инженера или наоборот.
Мы делаем встречи групп. Отдел разделен на группы, и каждая группа раз в квартал собирается вместе. Причем, в разных городах по очереди. Сначала одна группа вся собирается в Москве, сотрудники что-то делают вместе, как-то взаимодействуют, проходят своего рода тимблидинг.
Once a year we try to conduct a general gathering of the department in an informal setting. Usually this is a weekend for which you can do something useful, discuss problems, but at the same time just talk “for life”. It helps to feel that we are doing one common thing.

We also have an event for the whole company, called “All Hands”. Once a quarter, all company employees gather at all, and someone talks about what we have achieved lately. In order to reduce the distance between offices, this meeting is held either in Moscow or in London. In one quarter, everyone who is supposed to speak comes to Moscow, and a video conference is taking place in London. The next quarter, on the contrary, there is a performance in London, and in Moscow, only a video conference.

That's how we live in Badoo.

Подсмотреть за новой порцией чужого организационного опыты приглашаем на Saint TeamLead Conf. В программе выступления от весьма известных компаний: Сбербанк-Технологии, Avito, JetBrains, Spotify… да они все крутые!

Этот доклад — один из десятков докладов, если вы не хотите ждать, когда у нас дойдут руки расшифровать их и опубликовать на Хабре, то смотрите плейлист конференции на нашем YouTube-канале.

Чтобы точно ничего не пропустить подпишитесь на специальный список рассылки. Мы стараемся экономить ваше время и публикуем полезные новости: обзоры расшифровок, свежие видеозаписи, избранные доклады будущих конференций.

Теги: