PatientZero April 12, 2018 at 19:10

Riot Games: Anatomy of Technical Debt

Original author: Bill “LtRandolph” Clark

Transfer

Hi, my name is Bill "LtRandolph" Clark. I work as technical manager for the LoL Champion Team . Over the past few years, I managed to work in different departments of the League development , but the only thing I was constantly obsessed with was technical debt. I need to find him, understand him and, if possible, eliminate him.

When developers discuss any existing technology, for example patch 8.4 League of Legendsthen technical debt is often mentioned. I call technical debt code or data that future developers will have to pay for. Countless posts, articles, and definitions have been dedicated to this sad side of software development. In my post I want to discuss the types of technical debt that I had to meet while working at Riot, and talk about the model that we started using in the company. If I were asked to highlight the most important lesson that can be learned from this article, I would say that this is the metric of “infection” described below.

Metrics

In order to make the right decisions about which problems need to be fixed now and which can be set for later (or, let's be realistic, completely forget about them), we need some way to measure each specific element of the technical code. I chose three main measurement axes for evaluation: impact, elimination costs, and infection.

Influence

The first axis is the most obvious: the effect of debt. It takes the form of problems faced by players (bugs, missing functions, unexpected behavior) and developers (slower implementation, problems with the workflow, arbitrary useless nonsense that you have to remember). It is worth noting that by "developer" is meant any creator of the game working on any aspect of it. Programmers writing new code have to deal with part of the technical debt, other parts prevent designers from creating new scripts, some do not allow effect artists to create new particle systems, and so on.

Elimination Costs

The second axis is associated with the cost of getting rid of technical debt. If we decide to fix a problem in the code or data, then it will take some measurable amount of time. If this is a deep-seated assumption that affects every line of code in the game, then it can take weeks or months of development time. If this is a stupid mistake in one function, then it can be fixed in a matter of minutes. Regardless of the time taken to implement the fix, we also need to take into account the risk of introducing such a fix. Even the system that I consider “bad” can be used as a tool to create a wonderful game. If I change the way errors are handled by the scripting engine or when particles calculate their creation time, then this can ruin the behavior of more than 500 spells of more than 140 game champions.

Infection

The third axis measures what I'm obsessed with, namely infection. If technical duty is allowed to continue, how far does it spread? This spread may occur due to the fact that other systems interact with the debt-affected system, due to copy-paste of data created on top of the system, or due to the fact that it will affect the way other developers implement new functions.

If the fragment of technical debt is well limited, then the cost of its elimination in the future is almost the same as the current. Considering the need for correction, we can weigh its impact today. On the other hand, if a piece of debt is very contagious, then it will gradually become more and more difficult to eliminate. Particularly disgusting in the contagious technical debt is that its influence is prone to sprawl when more and more systems are infected with the technical compromises underlying it.

Types of debt

Now that we have a system for measuring each specific element of technical debt, let's discuss some general categories of technical debt that I noticed in League of Legends .

Local debt

Local debt resembles the classic black box model of programming. As far as the rest of the game is concerned, the local system (spells, network layer, script engine) looks pretty reliable. No one should keep a debt in mind while doing development without touching the system. But if someone opens the lid and looks inside, he will be amazed at what he sees.

A couple of examples of local debt from the real world can be found in our own eyes. Due to the structural features of the eye, we see everything upside down. More importantly, the retinal nerve creates a blind spot in the middle of each eye. This distorted data is transmitted to the visual centers of the brain, which must turn the image over and fill in the blind spots so that the rest of the brain can interact with the “correct” image. These quirks are localized in the eye / optic nerve system and other systems easily avoid them, so they are “good enough”.

One of the most famous examples of local debt in League of Legends is the Jarvan Cataclysm, still composed of minions. When designers need to tie gameplay effects to a point (or several points), then one of the tools available to them is the ability to create an “invisible minion”. What I call the "minion" RiotXypherous describes here. These game objects are a stable and well-understood way to track and execute script logic. In cases like the Jarvan wall, you need to create a large number of minions (to be exact, then 24) to ensure that no one can squeeze through the wall. An alternative solution may be a ring relief design, consisting of a single logical element that controls the possibility of passing through the Cataclysm. If we use this approach, we can clean up the logic and slightly reduce computational costs. Let's look at the Cataclysm in our impact / cost / infection model to see why fixing it is currently not the best option.

Cataclysm Metrics

1. Influence: 1/5

Previously, when 12 minions were created, people could sometimes squeeze through the wall, so Riot Exgeniar increased their number to 24. The fact that the wall is made of minions almost never affects other developers in the process of creating new content. (A small digression: the infamous " Ult Hitch Jarvana " was caused by a combination of this duty and a boot bug caused by an attempt to read missing definitions of automatic attacks.)

2. The cost of elimination: 2/5

So far, we have no way to draw shapes to create arbitrary geometry without writing new code. If we would like to create a ring for the implementation of the “area trigger”, so that the Jarvan barrier works more efficiently, we would have to write special mathematical calculations for calculating collisions with the ring. We use structural solid geometry for other purposes, which can drastically reduce correction costs.

3. Infection: 1/5

No one should take into account the implementation of the Jarvan wall when developing new opportunities, so it is well limited. The only risk of infection is that other designers can copy this implementation into their new champions (which happens from time to time). But no matter how far implementation problems go, the potential spread of the Cataclysm is low and well understood.

This is a fairly typical type of local debt. Most often, local debt is characterized by a low infection rate. If the impact is higher than the cost of removal, then usually the debt is eliminated by the conscious developer before it is too late.

When deciding whether to eliminate local debt, first ask yourself the question: is it worth it? If the debt is not actually contagious, then it will be safe enough to leave it alone for any necessary time. One of the biggest mistakes that I meet is the instinctive desire to crack down on local debt, caused by the perfectionism of the developer, when in fact the influence of debt does not justify the effort. If you decide to make a correction, then due to the locality of the changes, the correction and regression testing are usually easy.

Recent examples of eliminating local debt include moderator bugs, forcing champions to make their way to the coordinate 0,0,0, Monsoon Jeanne, ignoring spell shields, and the call stack of the Tears of the goddess without the cost of mana.

McGyver Duty

McGyver's Debt is named for the mid-80s television series. Angus McGyver solved the problems with the help of his Swiss army knife, electrical tape and items found at hand.

His decisions often used a combination of two unexpected parts; in the context of technical debt, this means that two conflicting systems are fastened to each other by “electrical tape” in the places of their interaction in the code base.

In Seattle (as in many other cities) there is a sad example of McGyver’s debt described above. There were two competing settlements in the city, each with its own grid of blocks. When these settlements grew into a modern Emerald City, slightly different nets were combined, which led to terrible forms of neighborhoods and buildings, as well as completely inefficient use of space. I am particularly surprised by this small cut-off corner of the building in the lower left corner.

One of the best examples of McGyver debt in the LoL code base- using std :: string from C ++ along with our own AString class. Both of these are ways of storing, modifying, and passing character strings. In the general case, we found that std :: string leads to many “hidden” memory allocations and computational costs, and it’s easy to write code that does bad things with them. AString has been specifically designed with sound memory management in mind. Our strategy for replacing std :: string with AString was to allow both of them to exist in the code base and to provide conversions between the two types (using .c_str () and .Get (), respectively). We made many improvements to AString that were easy to implement and made it easier for us to work with it, and encouraged developers to slowly replace std :: string in the process of changing the code. Thus, we gradually replaced std ::

Metrics std :: string vs AString

1. Influence: 2/5

At the moment, most of the highly influential std :: string memory allocations have been superseded by profiling , so the main cost now is a little mental effort to switch from one system to another.

2. The cost of elimination: 3/5

Converting to AString was not just a find and replace task. AString has several aspects for various purposes (in addition to the basic AString with dynamic memory allocation, there is AStackString for the initial location of the stack in the memory and ARefString for links to static strings). For proper implementation, a real, thinking person should look at the replacement point. The crowding out of the old system will be a long and slow process.

3. Infection: -2 / 5

Making AString easier to work than std :: string, we actually wrapped the infection in our favor. Every time a developer makes a change to the game code, there is a chance that AString will spread further like a virus.

Usually, the greatest cost of McGyver’s debt is the intellectual, necessary to switch modes of thinking when crossing borders. If some kind of bug or function is saved because they are in the “wrong” system, then the logical step is usually the transition to the “correct” system. Here, the key metric to be monitored is the contagion ratio of the new and old systems. If you can reverse the balance in favor of the new system, then the best system will inevitably win.

When considering the need to eliminate McGyver’s debt, strive to find ways to make a better (global) system desired at the local level. If a time-limited developer who implements greedy optimizations in his daily work decides to move to the desired final state, then you are on the right track.

Another approach that might work is large-scale brute force refactoring. With a close connection of systems, it is possible to eliminate part or all of McGyver's debt with the help of cunning regex.

Fundamental debt

Fundamental duty is when a certain assumption lies very deep in the heart of the system and is inextricably linked with all its work. Experienced users of the system are sometimes difficult to recognize the fundamental duty, because it seems something "natural".

A ridiculously stupid example of fundamental debt in the real world is the measurement system, known as the American system of measures. I grew up in the USA and my brain is filled with useless transformations, for example, I remember that 5,280 feet in a mile, 2 pints in a quart, and 4 quarts in a gallon. The US government has decided many times to switch to the metric system, but we remain one of the seven countries that have not accepted the International System of Units as an official measurement system. This duty lies in road signs, recipes, elementary schools and the brains of people.

We talked about some of the biggest fragments of the fundamental debt Riot struggles with in previous articles in our technical blog, for example, Determinism in League of Legends and Game Data Server .

Another example of a fundamental debt that I think a lot about is the use of the Lua scripting language. League designers use a tool called BlockBuilder to create complex behaviors by connecting functional blocks together, for example, getting distances between points, creating minions, dealing damage, or working on managing script execution. The set of operations from which designers can choose is quite large, but limited, and the parameters of each operation are minimal. However, many years ago, in the prehistoric era of League of Legends, it was decided not to store the blocks and parameters in a simple, limited format that matches the data. Instead, they began to be stored as arrays and tables in the powerful, beautiful and extremely complex Lua language for this purpose. A decade has passed since this decision was made, and today one of the most frequent operations in the engine is manipulating Lua objects.

BlockBuilder Lua Metrics

1. Impact: 4/5

The mismatch between lua and this task space has a lot of cost. Each call stack is contaminated with an average of six ordered stack frames for each frame of BlockBuilder logic. These ordered operations are not cheap in terms of server CPU utilization. Reading differences in script changes is an unreasonably difficult task. Parsing / searching script files to determine their functionality requires a fairly deep understanding of the Lua language.

2. The cost of elimination: 4/5

Since Lua is so deeply embedded in the engine, digging it up would be a daunting task. Currently, there is a proposal to create a wrapper class that behaves like Lua objects, but with a much simpler internal structure so that we can gradually transform the script internals into something more suitable. But no matter how we approach this task, we need to be careful and thoughtful.

3. Infection: 4/5

Each time a system encounters scripting (which is the basic unit of LoL logic ), this system is driven by the operations and requirements of the Lua backend. On average, we create a new building block every 3-4 days. Each of them directly manipulates Lua objects. The longer we do not replace Lua, the more difficult it becomes to replace it.

Typically, fundamental debt is high on all three axes. High costs force one to stick to an outdated system, which is often the right decision, but high impact and high infection mean that correcting egregious fundamental debt will be rewarded many times.

Most often in Riot, the strategy for eliminating fundamental debt is to build a new system next to the old. If possible, I recommend converting fundamental debt into McGyver debt, gradually porting systems to using the new system with the possibility of conversion operations between the new and old systems. This makes it easy to start taking advantage of targeted areas, while reducing risk exposure. However, sometimes such a transformation is not possible. In this case, creating a transition during compilation (or, if possible, at boot time) allows you to gain confidence in the new system without putting everything on the card. A compiled schema is used in the GDS transform , and a loaded schema worked for determinism .

Data debt

Data debt begins with a piece of technical debt from one of the other categories. This may be a bug in the script system, a not very desirable file format for objects, or two systems that do not interact well with each other. But then, on top of this flaw in the code, a bunch of content is created (graphics, scripts, sounds, etc.). Soon, fixing the original technical debt becomes incredibly risky, and it turns out to be very difficult to say what might break when trying to fix it.

My favorite real-world example for understanding data debt is DNA. The genome is an organism that slowly grew over millions of years through copying with losses (mutations), transcription errors, and the pressure of evolution. Some copying errors are useless, but harmless, others are harmful, while others give huge advantages. Finding out what each DNA fragment actually does is incredibly difficult. We fully understand what base pairs mean, and how sets of base pairs are converted to amino acids to create proteins. We are even beginning to understand some of the roles that DNA can play in addition to coding. But in the three billion base pairs of the human genome, there is still too much of what we do not even remotely understand. EpisodeRadiolab on CRISPR talks about how one of these puzzles was solved.

The debt of data in League of Legends has the greatest impact when it turns a trivial fix into an exhausting test. I’ll tell you only about one small example, but you can believe it: data debt is one of the most important reasons for making changes to the LoL engine . Our game developers have in-depth knowledge about the implementation of gaming systems and have enough skills to predict what data can break when some code fragments change.

An unforgettable example of data debt, fixed several years ago, was associated with block parameters in our script language BlockBuilder. The image above shows an example of how I increase the value of Owner armor by a variable plus a constant. I expect Owner to receive a 25-unit armor bonus: 20 from the Delta variable, which is passed to the block, and 5 from a constant. However, due to the fact that the variable name matches the parameter name, this action added 40 units. (Don’t even ask why not 45; I have no idea what thought process led to this.)

When the developer on the NoopMoney Champions team started fixing this ridiculous behavior, all he had to do was remove the four lines of code. But in the case of such a highly contagious debt, even small changes require careful planning. This bug could double any numerical parameters 400,000 lines of scripts LoL. Even worse, these scripts “behaved well” in the sense that the game is balanced and tuned relative to these possibly doubled values. NoopMoney had to make it possible to disable the fix in real time (in case of unexpected bugs), as well as perform a detailed search for regex and load the quality control department to determine which scripts work correctly thanks to this bug. In the end, the problems from fixing this bug turned out to be rather insignificant; It took a change in the scripts of a small group of champions. But due to the data debt, it turned out to be difficult to predict.

Parameter naming bug metrics

1. Influence: 2/5

The appearance of this bug had little effect on the game. He doubled the transmitted value and had the probability of dropping a constant. But he became yet another bit of useless collective knowledge that designers and developers had to take into account (after learning about them). The attention of the developer is too valuable a resource to be scattered in this way.

2. The cost of elimination: 2/5

As I said, the correction process was simple. By creating a real-time fix rollback function, we were able to increase confidence in its security. The most expensive part was the initial analysis with an assessment of the extent of the problem for targeted testing.

3. Infection: 4/5

What was unsuccessful in this bug was that it was based on very logical behavior. For example, if you want to damage a unit, it is completely logical to store the value in the Damage variable. Alas, the ApplyDamage block that received this value had a parameter with the same name, which led to a bug. Then, when someone else wanted to create a similar spell, he simply copied these blocks, spreading the bug further.

Typically, the cost of fixing data debt is high because it is difficult to measure change. The more dangerous thing is that it is almost always extremely contagious due to some data properties (as opposed to code). First, it is generally considered acceptable to create a new data element by copying an existing one. If you do a new skill shot spell, you can save a lot of time by taking the mystical shot of Ezreal as a basis. All problems with an existing data item extend to its descendants. Secondly, unlike code, data is rarely subjected to technical analysis. Therefore, it is difficult to notice and stop the spread of erroneous practices, even if they are well known. Finally, to correct errors in the data, a person with eyes and brain is usually needed - the compiler and formal logic will not cope with them.

To eliminate data debt, I saw two main approaches. The first I call the “do it right” flag. For data creators, this means moving from an old “broken” behavior to a new “fixed” behavior. Ideally, after it turns out that the old content is using a broken version, the fixed version should become the default version. Then, as in the case of McGyver’s debt, you can start a slow and gradual replacement to switch to a new version. At the same time, there is a constant cost of adding more and more nonsense to the editor's UI.

The second approach I call "just fix the mistake." NoopMoney used it when fixing a bug with parameter names. It involves correcting the error and repairing all the data that it affected. To this task was not so awesome, you can use some techniques. First you need to perform many grep and regex searches to try to evaluate the theoretical impact of the bug. Secondly, conduct targeted testing. Finally, you can prepare the switching function to return to the old behavior after the introduction of the fix in case you miss something worse than the bug being fixed. It is also worth noting that determinism helped us a lot in testing these kinds of changes . It allowed us to make sure that the server provides the same results before and after the changes.

Summarize

When evaluating an example of technical debt, you can use metrics of influence (on users and developers), cost of elimination (temporary and degree of risk), as well as infection. I suppose most developers regularly evaluate the impact and cost of elimination, but I rarely saw discussions about infection. When a problem gets deeper and harder and harder to fix, infection can become the developer’s most serious enemy. However, sometimes you can turn an infection into your own weapon, making the fix more contagious than the problem.

When working on LeagueMost of the technical debt I observe falls into one of these four categories. Local debt is like a black box with disgusting content. It is McGyver's debt that two or more systems are taped together with conversion functions. With a fundamental duty, the whole structure is built on some unsuccessful assumptions. In data debt, huge amounts of data are layered on some other type of debt, which makes its correction risky and lengthy.

I hope this post provides you with useful food for thought and discussion of technical debt.

Tags:

Riot Games: Anatomy of Technical Debt

Metrics

Influence

Elimination Costs

Infection

Types of debt

Local debt

Cataclysm Metrics

1. Influence: 1/5

2. The cost of elimination: 2/5

3. Infection: 1/5

McGyver Duty

Metrics std :: string vs AString

1. Influence: 2/5

2. The cost of elimination: 3/5

3. Infection: -2 / 5

Fundamental debt

BlockBuilder Lua Metrics

1. Impact: 4/5

2. The cost of elimination: 4/5

3. Infection: 4/5

Data debt

Parameter naming bug metrics

1. Influence: 2/5

2. The cost of elimination: 2/5

3. Infection: 4/5

Summarize

Also popular now: