m1rko May 13, 2019 at 13:52

Dear customer, that’s why this change took so long.

Transfer

Changes in complex software systems seem to take forever, right? Even engineers often think that changes are going beyond what is supposed to be, although we are aware of the complexity of the system!

For customers, the situation is even more incomprehensible. The problem is compounded by random complexity, which is added over time due to poor system support. There is a feeling that we are trying to scoop up water from a ship with a thousand holes.

Therefore, sooner or later, the customer will send a letter: “Why the hell does it take so long?” Let's not forget that as software engineers we have a window into the world, which they often lack. They trust us very much, but sometimes a seemingly insignificant change really takes a lot of time. Because of this, questions arise.

Do not be offended by this question; take it as an opportunity to show empathy and give a person a clearer picture of the complexity of the system. At the same time, you can suggest ways to improve the situation. When someone is upset, this is the best moment to offer a solution!

A letter is published below, which in one form or another, we have repeatedly sent over the years. We hope it helps you answer these questions.

Letter

Dear Client,

I saw your comment on the card “Notify before the expiration of the assignment” and I will be glad to discuss it at our next meeting. Here, for reference, I will summarize my thoughts, it is not necessary to answer.

To paraphrase your note:

Changing the deadline for completing tasks by one day for notification by mail should take one line. How can it take 4-8 hours? What am I missing?

On the one hand, I agree with you. Just change the request part from tasks due <= todayto tasks due <= tomorrow.

On the other hand, reducing it to such a simplified idea, we inadvertently ignore the inherent complexity and make a number of engineering decisions. Some of them we should discuss.

Part 1. Why is this small change more than it seems?

This is a simple, small change, one line of code. Spending the whole day on it, even half a day, seems excessive.

Of course, you can’t just roll out a change in production without running at least locally or on a test server. You should make sure that the code is executed correctly, and if the request changes, you need to compare the output and make sure that it looks more or less correct.

Here, comparing the output can be minimal, just a small spot check: make sure the results make sense, etc. This is a notification for internal employees. If the math by date is incorrect (a slight mistake), we will quickly hear about it from the teams. If it were, say, an email for your customers, a deeper study would be required. But for this easy testing and review, 20-40 minutes are enough, depending on whether something strange or unexpected appears. Digging in data can eat time. Releasing changes without conducting a review is simply unprofessional negligence.

Thus, we add time for normal logistics, such as committing a code, merging changes, deploying, etc.: from the beginning of work to release in production, at least an hour passes from a competent, professional engineer.

Of course, this assumes that you know exactly which line of code to change. The task workflow basically lives in the old system, but some parts of the logic live in the new system. Moving logic from the old system is good, but it means that the functionality of the task is currently divided into two systems.

Since we have been working together for so long, our team knows which process sends an email with an expired task and can point to a line of code in the new system that initiates the process. So we don’t have to waste time figuring this out.

But if we look at the task code in the old system, there are at least four different ways to determine if the task is due. In addition, looking at the patterns and behavior of email, there are at least two more places where it seems that custom logic for this task is implemented.

And then the notification logic is more complicated than you thought. It distinguishes between general and individual tasks, open and private, repeating, the function of additional notification of the manager in case of an overdue task, etc. But we can quickly find out that in fact only 2 out of 6+ definitions of the overdue task are used for notifications. And only one thing needs to be changed in order to achieve the goal.

Such a review can easily take another half an hour or so, maybe less if you have recently been in this part of the code base. In addition, latent complexity means that we can exceed our estimate for manual testing. But let's just add 30 minutes for extra effort.

Thus, we reached 1.5 hours to feel confident that the change will be carried out as expected.

Of course, we have not yet checked whether any other processes use a mutable query. We do not want to accidentally disrupt other functions by changing the concept of “deadline” to the day that precedes the last day to complete the task. We should consider the code base from this point of view. In this case, there seems to be no major dependencies - probably because the bulk of the user interface is still on the old system. Therefore, there is no need to worry about changing or testing other processes. In the best case, this is another 15-30 minutes.

Oh, and since the main part of the task user interface is still in the old system, we really need to do a quick overview of the task functionality in this system and make sure that the feedback is correct. For example, if the user interface highlights tasks whose deadline has arrived, we can change this logic to match the notification. Or at least come back and ask the client how he wants to do it. Recently, I did not look at the functionality of the task in the old system and I do not remember if it has any idea of the deadline / delay. This review adds another 15-30 minutes. Perhaps more if the old system also has several definitions of a “task”, etc.

Thus, we went into the range of 2–2.5 hours to complete the task with the confidence that everything will go fine, without unintended side effects or confusion in the user's work.

Part 2. How can I reduce this time?

Unfortunately, the only result of these efforts is only the fulfillment of the task. This is not optimal, which is very disappointing. The knowledge acquired by the developer in the course of work is personal and ephemeral. If another developer (or ourselves after 6 months) again needs to make changes to this part of the code, the process will have to be repeated.

There are two main tactics to remedy the situation:

Actively clean the code base to reduce duplication and complexity.
Write automated tests.

Note: we have already discussed the documentation, but in this case this is not the best solution. Documentation is useful for high-level ideas, such as explaining business logic or frequently repeated processes, such as a list of new partners. But when it comes to code, the documentation quickly becomes too voluminous and becomes outdated as the code changes.

You noticed that none of these tactics are included in our 2–2.5 hours.

For example, maintaining a clean code base means that instead of simply completing the task, we ask questions:

Why are there so many different ways to identify tasks whose deadline has approached / expired?
Do they all need and work on them?
Can these methods be reduced to one or two concepts / methods?
If the concept is divided between the old and the new systems, can it be consolidated?

Etc.

The answers to these questions can be quite quick: for example, if we come across clearly dead code. Or they may take several hours: for example, if tasks are used in many complex processes. As soon as we have these answers, it will take even more time for refactoring to reduce duplication / confusion and get a single description of the “deadline” concept - or rename concepts in the code to clearly understand how they differ and why.

But in the end, this part of the code base will become much simpler, it will be easier to read and modify.

Another tactic we usually use is automated testing. In a sense, automated tests are similar to documentation that cannot be outdated and which is easier to detect. Instead of manually running the code and viewing the output, we write a test code that launches the request and verifies the output programmatically. Any developer can run this test code to understand how the system should work and to make sure that it still works that way.

If you have a system with decent test coverage, these changes will take significantly less time. You can change the logic and then run the full test suite and make sure that

the change works correctly;
the change did not break anything (this is even more valuable information than in the first paragraph).

When we build systems from scratch at Simple Thread, we always include time for writing automated tests in the evaluation of deadlines. This can slow down initial development, but greatly improves work and maintenance efficiency. Only when the system grows, do you really understand the importance of tests, but at this point it can be very difficult to return tests to the system. The presence of tests also greatly simplifies the work of new employees, and changing the behavior of the system is much faster and safer.

Part 3. Where did we come from? Where are we going?

Today, we seldom indicate in your assessment the time to clear code or write tests. This is partly because writing tests from scratch is a minor overhead, and adding tests to the codebase backdating is a lot of work, such as restoring the foundation under the house people live in.

This is also partially due to the fact that starting to work with you, we immediately switch to resuscitation mode. We have almost daily problems with synchronizing third-party data, weekly problems with generating reports, constant requests for support for small data changes, inadequate monitoring and system logging, etc. The code base is drowning under the weight of technical debts, and we are feverishly trying to keep the systems on float, while sticking holes with electrical tape.

Over time, systems become more stable and reliable, we automate / provide UI for self-service of frequent support requests. We still have a lot of technical debts, but we got out of emergency mode. But I don’t think that we will ever completely move away from this resuscitation mentality to a more proactive, mature “plan and execute” mentality.

We try to clear the code on the go, and we always test thoroughly. But being careful and diligent is not proactive refactoring or creating the infrastructure necessary for good automated tests.

If we do not start paying some technical debts, we will never be able to significantly improve the situation. Highly qualified, competent developers will take months to navigate and make non-trivial changes.

In other words, 4-8 hours for this task is about 2-4 times the supply, but it will significantly reduce the efforts for such changes in the future. If this part of the code base was cleaner and had good coverage with automatic tests, then a competent experienced developer would complete it in an hour or less. And the key point is that the new developer will take a little more time.

For such a change in terms, we need your consent. This is a conscious attempt to fundamentally improve the performance of your system, not just how users perceive it. I understand that it is difficult to agree to such investments precisely because there is no visible benefit, but we are happy to sit with you and prepare some clear figures that will show how these investments will pay off in the long term from an engineering point of view.

Thanks
Al

Tags:

Dear customer, that’s why this change took so long.

Letter

Part 1. Why is this small change more than it seems?

Part 2. How can I reduce this time?

Part 3. Where did we come from? Where are we going?

Also popular now: