#NoDeployFriday: helps or harms?
Is it necessary to prohibit deployment to production at certain times? Or the #NoDeployFriday movement became a relic of the times when there were no comprehensive integration tests and continuous deployment?
In your team, you might face the same dilemma. Who is right and who is to blame? Is abandoning the deployment on Fridays a reasonable strategy to reduce risks, or is it a bad culture that prevents us from creating better and more stable systems?
I’m sure that engineers who had the good fortune of being “in touch” lost their days off because of all the Friday changes that broke. I was in this situation too. A phone call when you exit with your family or in the middle of the night, notifying you of the crash of the application. After you get into the computer and check the fast-growing logs, it becomes obvious that everything was ruined by a rare unhandled exception. Disgusting.
The analysis reveals that for the scenario that led to the failure, no tests were written, apparently because it was not considered probable. After a series of lengthy phone calls with other engineers in search of a better way to roll back the changes and fix everything, the system starts working again. Fuh.
A five-why meeting is held on Monday.
"Let's just stop deploying on Fridays. Then on the weekend everything will work stably, and next week we will be on the alert after all kinds of releases . "
Everyone nods. If something does not go into operation before noon on Thursday, then it waits until Monday morning. Does this approach harm or help?
As you know, Twitter statements are often very subjective. Although the ban on Friday releases seems reasonable, someone will quickly point out that this is just a crutch because of the fragility of the platform, which is caused by poor testing and deployment processes.
Some even suggest that you just like quiet deployments more than the weekend itself:
Other users believe that the implementation of function flags may be a possible solution.
This user believes that problems of a risky deployment should not arise due to the processes and tools available to us today.
Who makes the decisions?
All this exchange of opinions indicates that we, as a community of engineers, can strongly disagree and do not necessarily agree with each other. Who would have thought. This situation probably also demonstrates that the overall picture with #NoDeployFriday contains such nuances that are not too well reflected on Twitter. Is it true that we all must apply continuous deployment, otherwise we “do it wrong”?
In making such a decision there is a psychological aspect. The hostility to Friday releases comes from the fear of making mistakes during the week (due to fatigue or rush), which can do harm while most employees rest for two days. As a result, a Friday commit containing a potential problem can spoil the weekend for a bunch of people: duty engineers, other engineers who will remotely help solve the problem, and possibly infrastructure specialists who have to recover damaged data. If the failure turns out to be serious, then other employees of the company may also be involved in the situation, who will need to contact customers and minimize damage.
Taking the position of an idealist, we can assume that in an ideal world with perfect code, perfect test coverage and perfect QA, no changes can lead to a problem. But we are people, and people tend to make mistakes. There will always be some strange border cases that are not closed during development. This is life. So the #NoDeployFriday movement makes sense, at least theoretically. However, this is only a blind tool. I believe that it is necessary to evaluate the changes made depending on the situation, and a priori it is necessary to proceed from the fact that we deploy any day, even on Fridays, but at the same time should be able to isolate those changes that should wait until Monday.
There are some issues that we can discuss. I divided them into categories:
- Understanding the “radius of destruction” of the change.
- The soundness of the deployment process.
- The ability to automatically detect errors.
- How long does it take to solve problems.
Now let's discuss.
Understanding the "radius of destruction"
When the online spears about Friday releases again begin to break, they always forget about the important - about the very nature of the changes. There are no identical changes in the code base. Some commits govern the interface a bit and nothing more; others refactor hundreds of classes without affecting the functionality of the program; still others change database schemas and make major changes to the process of real-time data consumption; fourth ones can restart one instance, while fifths can initiate a cascade restart of all kinds of services.
Looking at the code, engineers should have a good idea of the “radius of destruction” of the changes made. What part of the code and application will be affected? What could fall if the new code crashes? Is it just a click on a button that will throw an error, or will all new entries be lost? Is a change made to a single isolated service, or will many services and dependencies change simultaneously?
I can’t imagine who will refuse to make changes with a small “radius of destruction” and a simple deployment on any day of the week. But at the same time, major changes - especially those related to the storage infrastructure - should be carried out more carefully, perhaps at a time when there are fewer users online. It will be even better if such large-scale changes are put into operation in parallel to test and evaluate their work under real load, and no one will even know about it.
Here you need to make decisions depending on the situation. Is every engineer aware of the “radius of destruction” of changes in the production environment, and not just in the development environment? If not, why? Is it possible to improve documentation, training, and the display of the effects of code changes in production?
Is the "radius of destruction" small? Launch on Friday.
Is the "radius of destruction" large? Wait until Monday.
The soundness of the deployment process
One way to reduce risks is to continuously improve the deployment process. If in order to launch a fresh version of the application it is still necessary for a specialist to know which script to run, which file and where to copy, then it’s time to do automation. In recent years, tools in this area have stepped far forward. We often use Jenkins Pipeline and Concourse , they allow you to directly set the assembly, testing, and deployment pipelines with code.
The process of full deployment deployment is an interesting thing. It allows you to step back and try to abstract away what should happen from the moment the pull request is initialized until the application is put into operation. A description of all the steps in the code, for example, in the tools mentioned above, will help you generalize the definitions of steps and reuse them in all applications. In addition, it will be interesting for you to note some strange or lazy decisions that you once made and reconciled with.
To each engineer who has read the previous two paragraphs and reacted in the style of “Well of course! We have been doing this for years! ”I can guarantee that 9 more others have presented their application infrastructure and grimaced, realizing the amount of work that needs to be done to transfer the system to a modern deployment pipeline. This implies taking advantage of modern tools that not only perform continuous integration, but also allow you to continuously supply bugs to the prod, and engineers just need to press the button for commissioning (or even do it automatically if you're brave enough).
Improving the deployment conveyor requires involvement and appropriate staff - this is definitely not a side project. A good solution would be to highlight a team to improve internal tools. If they still do not know about the existing problems - and they probably know - then you can collect information on the most painful situations associated with the release process, then prioritize and fix it together with others. Slowly, but surely, the situation will improve: the code will go into operation faster and with fewer problems. More and more people will be able to learn better approaches and make improvements on their own. As the situation improves, approaches will be distributed in teams, and this new project will be completed correctly, without the usual copying of old bad habits.
From the moment of the merge, the pull request to the commit should be automated so that you do not even need to think about it. This not only helps isolate the real problems in QA, because the only variable is the changed code, but it makes writing the code a lot more enjoyable. Commissioning is decentralized, which increases personal autonomy and responsibility. And this, in turn, leads to more deliberate decisions regarding when and how to roll out new code.
Reliable deployment conveyor? Roll out on Friday.
Manually copying scripts? Wait until Monday.
Ability to detect errors
Commissioning does not stop after the code starts working. If something goes wrong, we need to know about it, and it is advisable that we be informed about this, and not have to seek out information on our own. To do this, you need to automatically scan application logs for errors, explicitly track key metrics (for example, the number of messages processed per second, or the proportion of errors), as well as a warning system that informs engineers about critical problems and shows a negative trend for certain metrics.
Operation is always different from development, and engineers need to monitor the operation of certain parts of the system. You need to answer questions about each subsequent change: did it speed up or slow down the system? There are more or less timeouts? Are we limited by processor or I / O?
Data on metrics and errors should be transmitted to the warning system. Teams should be able to determine which signals indicate a negative situation, and send automatic messages about it. For our teams and the most serious incidents, we use PagerDuty.
Measuring production system metrics means engineers can see if something has changed after each deployment, for better or for worse. And in the worst cases, the system will automatically inform someone about the problem.
Good monitoring, notifications and on-call specialists? Deploy on Friday.
Manually view logs via ssh? Wait until Monday.
How long does it take to solve problems?
Finally, the main criterion is how long it will take to fix the problems. This partly depends on the “radius of damage” of the changes made. Even if you have a licked deployment pipeline, some changes are hard to fix quickly. Rollback of changes in the data extraction system and in the search index scheme may require laborious reindexing, in addition to fixing some line of code. The average deployment, validation, correction, and redeployment of CSS changes may take minutes, while major changes to the repository may require days of work.
For all works within the deployment pipeline, which at the macro level can increase the reliability of changes, no changes are the same, so you need to evaluate them separately. If something goes wrong, can we fix it quickly?
Is it fully fixed with a single restore commit? Deploy on Friday.
Are there big difficulties if something goes wrong? Wait until Monday.
Think for yourself, decide for yourself
What is my position on #NoDeployFriday? I think it all depends on the release. Changes with a small “hit radius” that are easy to roll back can be deployed at any time, any day. With large changes, the impact of which must be closely monitored in the production system, I highly recommend waiting until Monday.
In fact, it's up to you to deploy on Fridays. If you are working with a creaky and fragile system, it is best to avoid Fridays until you have done everything necessary to improve the deployment process. Just be sure to do it, do not brush it off. Refusing Friday releases is a normal way to cover up temporary infrastructure flaws. This is a reasonable damage reduction for the good of the business. But it’s bad if this rule covers constant flaws.
If you are not sure what effect the changes will have, then postpone until Monday. But think about what you can do next time to better understand this effect, and improve the associated infrastructure for this. As always in life, each decision has its own nuances. The solutions are not divided into “black” and “white”, into “right” and “wrong”: while we are doing everything we can for business, applications and each other, improving our systems, then we are doing everything well.