Scale development: from startup to hundreds of engineers

Many other large IT companies started with a startup, and Badoo is no exception. In recent years, the company has gone from several dozen engineers to several hundred. Nikolai Krapivny was on the front line on most of this path and made decisions: what is better to do and what not to do, how to cope with problems. His report on TeamLead Conf was devoted to this experience and the picture of the world, which was formed as a result.

Of course, each company has its own path , but the problems of human communications are all about the same. Someone else's experience will help to think in advance about the problems that will have to face the growth of the company. Even if these values do not fit in directly, it will tell you what direction to think.

The story consists of three parts. The first isabout communications , about how they change with the growth of the company. The second part is about how to increase the number of engineers in the team to try to keep the speed of development . And the third part - from why Badoo lives in two offices , and how to cope with the problem of communication.

Let's get started!

About the speaker: Nikolai Krapivny (@ cyberklin ) has been working for Badoo for the last eight years, five of them are involved in managing teams and building development processes.

Before diving into the first part, I want to say that this is a story about our path and does not claim to be absolute truth. Each company has its own way, but I am sure that our experience, the values that we have shaped for ourselves, some knowledge will help you in your growth and help you build the right process. Despite the fact that you have different specifics, everything is a little different, I hope this will be useful for you.

Communications

To begin with, let's theoretically discuss what happens to communications when a company grows.

Communication is about how departments interact with each other, how people interact with each other, how communication takes place so that something is done in the company.

Let us consider a hackneyed, but nonetheless vital, example: the command of an abstract startup. Several people gathered, someone is closer to business, and someone is more technical. But overall, this is a small team that does something that maybe someday will become the second Facebook. And in this team, all work is built on communications. The team is small, and there is no point in introducing any processes. Everything just works.: someone talked to someone, agreed to quickly do something, do something.

Despite the fact that in the process, built only on communications, on conversations: “And let's do it”, - “And let it be faster”, - “Let's do it like this”, there are certain problems, this team certainly has its advantages.

Work happens fast . The time from the idea to the idea becomes available to the user is very short. The idea came, we talked with someone, how to do it faster, it is already done, ready.
It is flexible . In this small team there is no such thing that someone is only doing something concrete, and cannot, when necessary, connect to a task that is important. In principle, everyone does everything, and if something is important to us, then everyone makes efforts to do it.
In general, due to the fact that as such processes have not yet been built, such work is quite effective . We do not spend extra time on overhead, on some processes, on some rebuilt formal schemes.

These are the values that every business wants to see: the most flexible equation with resources, the minimum time-to-market and low operating costs.

The company grows - communications "are torn".

When a company grows, the advantages of a small team, when everything works quickly, on interaction, on conversations, become a problem. The load on the communication of the amount of transmitted information begins to grow, and we come to the conclusion that the communication "break" . We start to lose more on communications than we win. It is necessary to talk with too many people, somewhere there is a misunderstanding when transferring information from person to person, somewhere we just lost something, forgot something. And everything that was then built, which gave speed, we quietly begin to lose.

If you extrapolate and look at the company's development model over a long time interval, then it looks like a cycle. The number of people increases, the load on the process increases, communications begin to break. What previously worked stops working. Therefore, we are forced to repair something in these places. Often this happens at the boundaries of departments. To fix, you have to formalize the process of communication. And this cycle is repeated many times.: the number of people increases, something starts to work inefficiently, we introduce new processes, we somehow formalize them, we get a new stock for growth until it breaks in a different place and so on and on. It’s like scaling the system, as with performance: if you increase the load on the system - the weakest element, the slowest part will not stand it. We repair, somehow improve, a window appears in which you can increase the load on the system. So with the scaling of the company.

It was a small introductory theoretical part.

Now let's take a practical look at what cycles we went through, what problems we encountered, and how we solved them.

Technical task

As a first example, consider the task of formalizing the relationship between a business and an engineering team. The terms of reference, or, as we call it PRD, is a request for what needs to be changed in terms of design functionality. This is a fairly obvious formalization that all companies undergo. I think that most of you work in companies where there is some kind of formal process for transferring a development task. From the grocery team, from the business or from an external customer - it does not matter.

We have gone through several parts of the complication of this process. At first we just wrote. When the team became more than the one that allows you to do things just by talking among themselves, we began to write all this in tasks. Tasks were formulated as “what needs to be done.” Further the complexity of the product grew, the number of people in the company grew, and we came to the conclusion that it is useful to maintain the current version of the current operating system in one place. We moved it all to the wiki, and the discussion of changes to the comments to the wiki, so that everything was in one place. The next step was to formalize what should be in the PRD + PRD review process. Now we have a template that records what information must necessarily be in the PRD, what should be described and what data should be collected before starting work. For example,

The goal, why we are doing this functionality.
On which platforms, products, countries we plan to launch.
Description of the functional in the use cases format: main cases + a pre-written list of “difficult cases” that everyone has forgotten about.
Tokens (separately processed by a copywriter).
Communications: will there be email / push notifications for this functionality and, if so, which ones.
Plan release, depending on the marketing / other projects in the company.
Analytics: how we will evaluate the results, what business metrics we need to add to assess the success of the change.

Thus, in the current form, the interaction between the product and the technical team is formalized quite strongly and helps us not to lose any important points in the process of transferring the task to work.

Server client

We grew further, mobile development appeared and became one of the key areas. There was a next point at which the communication "broke off". This is the point at the interface between the client and the server . It is about how the client should interact with the server at the protocol level, at the relationship level. This was solved by conversations between client guys and server ones. But the number of teams grew, the number of people in these teams grew. And the fact that information about the interaction of the client and the server was stored only in the heads of the developers, began to lead to problems.

Documentation

The problems we encountered were fairly simple and obvious. The client-server relationship is not only a protocol, but also an interaction scheme according to this protocol. What commands to send and when, how the client should request something, how the application is started — everything should follow the protocol.

For example, client-side developers solve the problem and believe that the API has a suitable team that can be called and everything will be fine. This client is released and creates a problem on the server, because the team was too heavy for him and requires too many resources. In addition, iOS and Android understand the API a little differently, and implement it differently, because of this we can not quickly make changes to the API. Thus, we came to the conclusion that the protocol needs to be documented.

The release does not return back

The peculiarity of mobile platforms is that it is impossible to return the release. If the application is laid out in the store and the user has installed it, then most likely, the client will have to live with this version for a very long time. Error at the design stage of the protocol, at the stage of determining the interaction between the client and the server, dear. In Badoo, another year or two we will have to support any application that is released until the number of users falls to a certain limit.

To solve this problem, we decided to allocate a separate MAPI command, which will document the protocol, and will be a knowledge sharing point between the client and the server.. This team includes employees from client and server development. This mixed team turns product requirements into protocol changes and documentation. Since the error at the implementation stage of the protocol is rather expensive for us, the processes in this team are a bit more complicated and more difficult than in all the others. They use double code review, trying to eliminate the possibility of an error.

This team quickly became the center of knowledge sharing. Often there are situations where the developers of the client and server can not agree on how they should interact. For example, iOS can only do this, but for Android it is not suitable. The new team solves these controversial problems and, if necessary, gathers the right people to make the right decision.

If you look at the outline of our process, the Mobile API team is an intermediate link between when the requirements are ready and when the development begins. I.e:

from the product team comes the task of developing TK (PRD);
the protocol design team compiles the documentation;
development of client and server parts begins according to the documentation.

With such a process, server and client development can proceed independently, and we often use it.

Problem statistics

The company continued to grow and develop, there were more people and projects. Slowly, we came to the conclusion that a separate team stood out, which deals with data, statistics, helps the product team to analyze how users react to changes. As I said, problems appear at the junction of teams . We have a new team, and after a while this joint also began to work inefficiently.

The fact is that analysts need good data to identify patterns and answer tricky product questions. Good data means that all statistics should be subordinated to a single language. When we talk about statistics and our product, we need to speak in one particular language.

Before that, in each technical task the product manager described the principles of statistics collection with the words: this button needs to measure click rate, this screen has conversion rate. But then the developer himself decided which events to track, how to measure (from the client or server), which graphics to draw, and for example, which cuts to add to these events. A developer can make graphs cut into device types, add gender, collect statistics by country. These disparate data come to the analytical department, but based on them it is impossible to accurately assess the quality of the solution in the product. As a result, there is a reverse shaft of tasks: we make changes, these changes are implemented, the product manager requests analysis, the statistics team requests additional data, the task goes for revision, statistics are being finalized,

The process of collecting and analyzing statistics needs to be formalized.

We decided that the statistics requirements will be recorded in the TK, and the analysts will be the owners of the requirements knowledge. The analyst, at the stage of transferring work on TK to development, says what statistics are needed, what events to monitor, and for what cuts to break the data. If the analyst asks to expand the existing statistics or add a new one, then we add new functionality or modify the existing one. For this, we formalized working with data in code. We made a single API that simply does not allow sending insufficient data or invalid data.

In parallel, in terms of tools, we have a fast Microstrategy tool for data visualization and our own A / B testing tool. The owners of all knowledge of how to properly use these tools are analysts.

Another stage is added to the process diagram. PRD passes the stage of coordination in the department of analytics, and only after that is transferred to MAPI and development. So it works right now.

Load distribution

The next problem is related to the growth of load and interaction within one department. I lead the backend development team for our products, and using her example, I will illustrate what problems arise with the increase in the number of employees within one team.

In a team of up to 15 people, everything is quite simple. We believed that everyone does everything and distributed tasks mainly according to the principle, who is free now - he does. Why up to 15?

It is believed that one or timlid or technid should lead a team of up to 7–9 people. This is an empirically established number of an adequate number of subordinates.

We had a team leader and his deputy, so together we controlled 14–15 people. With further growth, it became necessary to some additional division. The flow of development tasks needs to be balanced. We have determined the main requirement for this process: we form a specialization . Each piece of code will be experts, 1-2, and best 3, who know this code best, and who support this code. But at the same time, there is an orthogonal requirement: to maintain flexibility . For example, if five people support the messenger, and there are too many urgent tasks, then they should not stand idle. If the team has free resources, they should be involved in the performance of other people's tasks. These requirements are contradictory, but we still want to try to achieve this.

We have divided a large team into development groups of 4-9 people. At the head of each group is the leader and he is the immediate leader of the team. We introduce such a thing as a component. A component is a piece of code that is finished in terms of product functionality. Each component is assigned to a specific group. Each component within the group has 1-2-3 people who are experts on this piece, and are engaged in its development and support.

In terms of load sharing, each task has a component.
The tasks of technical duty and support are distributed in the “native” group - the one to which this component is assigned.
We try to distribute new functionality into the “native” group. But only if we have this opportunity.
In order to maintain flexibility, we do not exclude a situation where one group helps the other and does something that is not related to its components.
In this case, either a technical task review or a code review is conducted - this is done by the “native” group.

In this version we are working now. The team has 30 people, 5 groups and 22 components that we share between them. Until we see a limit for further growth in this format and up to a certain scale, we will stick to it.

An interesting side effect: what happens in a team when the number of projects, the number of people, the number of changes grows quite strongly. We are faced with the fact that everything has become so numerous that it is difficult to understand the specific reasons for a change.

I will give an example of the growth of registration of new users in Brazil. The reason may be: a spam bot that registers new accounts and spoils our life; problems with sessions; just promo campaign; launching a new wave of marketing in Brazil. The graph shows a change, and we want to understand with minimal effort what caused it.

We have made for ourselves a tool called WTF. This is one tool that collects in itself from various subsystems and parts of production that can somehow influence the metrics. This tool is integrated into the graphing tool, and you can see the changes at intervals. As a bonus, we try to integrate not only technical metrics (accidents, configuration changes), but also business metrics (promo and advertising companies).

The interface is simple: the red line is the change associated with some configuration change. This tool helps to track changes in the conditions of the grown project.

Let's sum up the first part of my report:

With the growth of the communications team will be missed. They will overload and become ineffective.
Most often this happens between departments, in our case between server and client development.
Where it breaks, we formalize the process.
New tools will be needed as the number of projects grows.

It worked for us:

Formal interaction between the grocery and engineering departments implemented through the TK.
Interactions with BI are based on analyst requirements;
The MAPI team deals with the protocol for the client and server parts.
All interaction within the department, occurs as a component - it is a way to formalize the distribution of tasks.

The development process involved 200 people. With further growth, we may face new challenges. Then in a couple of years there will be a new report about how we all remade :)

Speed

We want to keep the speed of making changes to the system with the growth of the team. At the same time, faced with problems in communications, we introduced a number of formal processes and obtained a multi-stage scheme.

Time-to-marker with such a process all increases and increases. Now we look like this.

Our system is like a big ship. He swims very fast, coolly armed, everything is cool, until you need to make some very small change. In order to maneuver, react to the market, we need a change to pump through our entire scheme.

Then we thought: maybe everything is wrong. Maybe we are growing wrong in general, and we need to redo everything. A variant with cross-functional teams comes to mind. We scale the system vertically. We say: more work - more people. And lost in speed of delivery. It may be worth switching to a scheme when our team is a large number of startups. Each startup will do part of the work itself, and inside it will have effective communications. Then there will be no need to introduce formal processes.

The idea to transform functional teams into cross-functional ones in order to speed things up has arisen many times over and over again during our evolution. We refused from it because of several minuses.

Less resource flexibility . Redeploying people between cross-functional teams is more difficult. The response to a change in load or process is slower.
The issue of process control in the system . There are 10 teams with backenders, front-enders and analysts in each. The question arises: will not every backenderder write in his own language and drag the development stack to his side. It also threatens to create new bicycles to solve the same tasks. This places an additional burden on the administration of the entire system.
This system works only on some scale . It is necessary to provide a bus factor greater than one, so you cannot make a command with only one backend. All specialists should be at least two, and subjectively it seems that more people are needed to do the same number of tasks.

If we present our system as a system of mass service of requests (where applications are product hypotheses and changes), you can find the answer to the question about speed. The graph shows the saturation theory of queuing theory, which has requests per second on the OX axis. For our process, this means the number of tasks performed. On the OY axis, the processing time of each request.

From the point of view of queuing theory, the system can be optimized either by the number of tasks solved, or by the time of task processing.

The functional team is optimized for the number of tasks performed. Cross-functional - on time delivery. In a cross-functional team, everything happens faster, the time-to-market is smaller, but fewer solved problems. In order to make a task faster, it is necessary that a certain amount of resources be either completely free and wait for the task to arrive, or perform some task that is not so important and can be postponed. Within the framework of functional teams, we essentially optimize the use of development resources. Due to this internal optimization, we get a large number of completed tasks.

Let's return to the problem. We still lack the flexibility and speed for fast food projects. We want the delivery time to be minimal, and do not want to waste time on the processes. We want to take advantages from both approaches. To achieve this, we have divided our workflow. For business and some specific tasks (for example, marketing) speed is important. For them, we will apply an approach with cross-functional teams. And for the area in which the speed of delivery is not so important, we will apply the general scheme.

In fact, these are project teams. The grocery department says what is needed now, what is important for the company, where we want to add and improve. Most often these are experimental projects in which it is not exactly known whether it will fly or not. They do not need to invest a lot of resources on the documentation or the construction of ideal solutions. A large number of companies in marketing work on the principle - either we do it in 2 days and we launch a campaign, or we miss this opportunity. For such areas, we form project teams that provide us with speed of delivery. For everything else, the processes work according to the scheme I mentioned earlier. In the design cross-functional teams, some stages can be omitted, for some tasks, work can move as in a startup, just on communications.

This system needs constant monitoring. We constantly monitor which areas are critical for us, where speed is needed at the moment, which temporary project teams to create.

A percent of 70–80 tasks go through a long formalized path. All the tasks related to technical debt, and support also go the usual way. In this scheme, we keep technological control, documentation and reliability of decisions made. Because the project teams are temporary, they do something and their members diverge back to their departments, and continue to maintain their code.

Let's summarize the second part.

Functional teams are quantity optimization, cross-functional ones are speed optimization done.

In each company you need to decide which balance you want. If you are ready to spend a little more time, a little more resources, and are ready to solve some problems related to the technological stack, but you want more speed - perhaps a cross-functional approach is right for you. If you are not ready and you just want to grow evolutionarily, as we are - a functional approach. We use both for different tasks.

Moscow — London

We dealt with relatively serious topics, let's talk about how we work between two offices. How is it that we are sitting in different offices? Why it happened, what problems go with it and how we solve them.

It happened historically. Our business has always been located in London, and Moscow was a technology office. Mobile development, as a strategic direction, appeared immediately in London, near business. It was actively developing, and at some point we came to the conclusion that client development sits in London, and the backend, server-side and web, historically sit in Moscow. We realized that with our approach related to cross-functional teams, we lose very much. The distance between offices creates problems if we need a small team but immediately from a server-side specialist, web, product and analytics. Then in the year 14, the great migration of nations began. We moved the entire web development team to London, and half the server-side. Now all client teams work with the web in London, and the server-side is split 50/50. This decision seems strange, but in general, in terms of the process, we won. Because now in London, near the customers, near the products, there is some piece of the backend that can participate in fast projects, can help and can work faster. From the point of view of our department, we lost a little, because the team is divided into two offices.

Bonus to this all, we have other benefits.

Improving recruitment opportunities — we can hire from both Europe and Russia. We can accept those who want to change jobs, not because he does not like something here, but because he just wants to leave, there are such people too.

Improved communication from business to engineers. Because now a part of our team is sitting in London, and it quickly receives information about what is happening, what projects will be and what results are already there. I called it “being in the center.”

Fault tolerance. A trifle, but it's nice - before, for example, when on the May holidays the Moscow office went together to dig potatoes, work in London rose. With the 50/50 option, we are resistant to 10-day New Year holidays and similar situations.

One of the questions we are fighting is the time difference. More precisely, it is difficult to fight it, but it is. It is not so big, only 3 hours. Probably one of you works in teams, where the difference is 12 hours. But 3 hours is such an unpleasant difference if you work together. In London, a man comes, while he is involved in work, the first hour passes, a time comes when he goes to cruising speed, and it would seem that he will do something with Moscow now, but Moscow goes to lunch. Moscow comes in from lunch - London is gone for lunch. A half day shift is very unpleasant. Therefore, the Moscow office is open from 11:00, and London - from 9:00. With this time shift we partially offset the difference, and work in almost the same mode.

Plus, which is due to the time difference - this is what releases occur 2 times a day, one in the morning, one in the evening. The morning is done by the Moscow guys, because for them the hour of the day is a time of serious work, and in London people just came, someone else might not be there, some, perhaps, there are problems. And the evening release is done by London, if something goes wrong, it will be during working hours and no one will have to linger.

The next problem with which we are fighting, I called the "lack of heat." Due to the fact that the room is divided, the subordinate-manager relationship takes place remotely. The role of the leader for his subordinate can be very roughly divided into two:

Sensei : a long-term role - to ensure growth, set a strategic goal, etc.
Mentor is an operational, short-term role, in essence, this: “I will teach you to fight right now.” How to behave properly to solve current problems.

The strategic role can be performed remotely by some kind of regular visits and conversations, since it is strategic, and there is no need for constant intervention in the work. The role of mentor, operational management, it is quite difficult to perform remotely. Therefore, for ourselves, we have developed a rule that for all young, new employees, technical mentoring occurs locally. There is a person in the team who takes upon himself the duties of a leader who mentor a person in emerging issues locally. At the same time, the head can still perform work related to the growth of a person's strategic.

Nobody cancels the fact that you just need to meet more often . Here we have everything standard:

We go on business trips to each other. Meet on the project work.
Once a quarter passes performance review. All executives who have subordinates in another office must come to talk one-on-one. Still, a videoconference conversation is not at all something like talking to a person personally.
New employees must visit different offices - meet, see how the other team works, meet the product manager for the engineer or vice versa.
We do group meetings. The department is divided into groups, and each group meets once a quarter. Moreover, in different cities in turn. At first, one group all gathers in Moscow, the employees do something together, somehow interact, undergo a kind of teambuilding.
Once a year we try to hold a general collection of the department in an informal setting. Usually this is a weekend for which you can do something useful, discuss problems, but at the same time just talk "for life". It helps to feel that we are doing one thing in common.

We also have an event for the whole company, which is called “All Hands”. Once a quarter, in general, all the employees of the company gather, and someone talks about what we have achieved lately. In order to reduce the distance between offices, this meeting is held in Moscow, then in London. In one quarter, all who should speak, come to Moscow, and in London there is a videoconference. In the next quarter, on the contrary - in London there is a performance, and in Moscow - only a video conference.

This is how we live in Badoo.

Look for a new portion of someone else's organizational experiences are invited to Saint TeamLead Conf . The program includes performances from very well-known companies: Sberbank-Technologies, Avito, JetBrains, Spotify ... yes, they are all cool!

This report is one of dozens of reports; if you don’t want to wait for our hands to decipher them and publish them on Habré, then watch the conference playlist on our YouTube channel .

To definitely not miss anything subscribe to a special mailing list. We try to save your time and publish useful news: reviews of transcripts, fresh videos, selected reports of future conferences.

Tags: