Checklist: what had to be done before deploying microservices to production

This article contains a brief squeeze from my own experience and that of my colleagues, with whom I had been fighting incidents day and night. And many incidents would never have occurred if all these microservices that we love so much were written at least a little more carefully.

Unfortunately, some programmers seriously believe that a Dockerfile with any team at all inside is a microservice in itself and can be deployed even now. Dockers are running — money are incoming. This approach turns into problems starting from performance degradation, inability to debug, service failures and ending in a nightmare called Data Inconsistency.

If you feel that the time has come to launch one more app in Kubernetes / ECS / whatever, then I have something to object to.

Disclaimer: this is a translation of my original post. English is not my common language. Please, help me to improve myself. If you see an error, feel free to show it to me, I will be thankful.

I formed for myself a certain set of criteria for assessing the readiness of applications for launch in production. Some items on this checklist cannot be applied to all applications, but only to special ones. Others generally applicable to everything. I am sure you can add your variants into the comments or dispute any of mine items.

If your microservice does not meet at least one of the criteria, I will not allow you to put it into my ideal cluster, built in a bunker 2000 meters under the ground with heated floors (yeah, it will be hot) and a closed self-sufficient Internet system.

Here goes!..

Note: the order of the items doesn't matter. At least for me.

A short description in the Readme

It contains a short description of itself at the very beginning of Readme.md in its git repository.

God, it seems so simple. But how often have I come across that the repository doesn't contain a short explanation of why it is needed, what tasks it solves, and so on. And I don't even talk about something more complicated, such as configuration options.

Integration with a monitoring system

Send metrics to DataDog, NewRelic, Prometheus, etc.

Analysis of resource consumption, memory leaks, stacktraces, service interdependencies, error rate and so on… It is extremely difficult to control what happens in a large distributed application without understanding all of this.

Alerts are configured

Has configured alerts that cover all standard situations, plus known unique situations.

Metrics are good, but no one will follow them manually. Therefore, we automatically receive calls/pushes/texts if:

CPU / memory consumption has increased dramatically.
Traffic has increased / fell dramatically.
The number of transactions processed per second has brightly changed in any direction.
The size of a deploy artifact has noticeably changed (exe, app, jar, ...).
The percentage of errors or their frequency exceeded the permissible threshold.
The service stopped sending metrics (a situation often overlooked).
The regularity of certain expected events is broken (cron job doesn't work, not all events are processed, etc.)
...

Runbooks created

A document has been created for the service describing known or expected abnormal situations.

how to make sure that the error is internal and does not depend on a third-party;
if it does depends: where, to whom and what to write;
how to restart it safely;
how to restore from a backup and where are they, the backups;
what special dashboards / search queries were created to monitor this service;
does the service have its own admin panel and how to get there;
is there an API / CLI and how to use it to fix known problems;
and so on.

The list can be very different in different organizations, but at least the basic things must be there.

All logs are written to STDOUT / STDERR

The service does not produce any log files in the production mode, does not send them to any external services, does not contain any redundant abstractions such as log rotation, etc.

When an application creates files with logs — these logs are useless. You will not jump into 5 containers running in parallel, hoping to catch the necessary error from "tail -f" (yes, you will, crying running away ...). Restarting a container will result in the complete loss of its logs.

If an application writes logs to a third-party system, for example into Logstash, this creates useless redundancy. Neighboring service can't do this because it is based on another framework? You'll get a zoo.

The application writes a part of its logs to the files, and another part to the STDOUT because the developer wants to see the INFO in the console but DEBUG in the files? This is generally the worst option. No one needs this complexity and maintain the extra code and configurations that one has to learn first.

Logs mean Json

Each line of the log is written in Json format and contains an agreed set of fields.

Everyone still writes logs in plain text. This is a real disaster. I would be happy never to know about Grok Patterns. Sometimes I dream about them and I'm freezing, trying not to move, so as not to attract their attention. Just try once parsing Java exceptions using Logstash and feel that pain.

Json is a boon, it is a fire given from heaven. Just add there:

timestamp with milliseconds according to RFC 3339;
level: info, warning, error, debug
user_id;
app_name,
and other fields.

Upload them to any suitable system (properly configured ElasticSearch, for example) and enjoy. Connect the logs of many microservices and feel again what monolithic applications were good at.

(You can also add a Request-Id and get a tracing...)

Logs with verbosity levels

The application must support an environment variable, for example LOG_LEVEL, with at least two options: ERRORS и DEBUG.

It is desirable that all services in the same ecosystem support the same environment variable. Not a config option, not the command line option (although this could be wrapped, of course), but right from the environment variable by default. You should be able to get as many logs as possible if something goes wrong and as few logs as possible if everything is fine.

Locked versions of dependencies

Dependencies for package managers are fixed, including minor versions (For example, cool_framework = 2.5.3). Commited lockfiles are also good way to do this.

This had been already mentioned many where, of course. Someone locks dependencies on their major versions, hoping that in minor versions there will be only bug fixes and security fixes. It is a mistake.

Dockerized

The repository contains a production-ready Dockerfile and docker-compose.yml

Docker has long become a standard for many companies. There are exceptions, but even if you don't have a Docker in production, any engineer still should be able simple to run "docker-compose up" and don't think about anything else to get the dev-build for local verification. And the system administrator should have an artifact already verified by developers with the correct versions of libraries, utilities, and so on, in which the application at least somehow works to adapt it to production.

Configuration via environment

All important configuration options are read from the environment and the environment has higher priority over configuration files (but lower than the command line arguments).

Nobody will ever want to read your configuration files and study their format. Just accept it.

More details here: https://12factor.net/config

Readiness and Liveness probes

Contains the appropriate endpoints or cli commands to test the readiness to serve requests at startup and during the lifecycle

If an application serves HTTP requests, it should by default have two interfaces (CLI checks are also possible):

To verify that the application is alive and not stuck, the liveness probe is used. If the application does not respond, it can be automatically stopped by orchestrators like Kubernetes. Honestly speaking, killing a hung application can cause a domino effect and permanently put your entire service down. But this is not a developer problem, just make this endpoint and switch your phone to flight mode.
To verify that the application is not just started, but is ready to accept requests, a Readiness probe can be performed. If an application has established a connection with a database, a queue system, and so on, it must respond with a status from 200 to 400 (for Kubernetes).

Resource limits

Contains limits for memory, CPU, disk space and for any other available resources in the agreed format.

The concrete implementation of this point will be very different in different organizations and for different orchestrators. However, they must be configured specifically for all available environments (prod, dev, test, ...) and be stored outside from the git repo with the application code.

Automated builds and delivery

The CI / CD system used in your organization or project is configured and able to deliver the application to the desired environment according to the accepted workflow.

Nothing is ever delivered to production manually.

No matter how difficult it is to automate builds and delivery of your application, this must be done before this project gets into production. This item includes building and executing Ansible/Chef cookbooks/Salt/..., building applications for mobile devices; forks of operating system; images of virtual machines, whatever.
Hard to automate this? You can't bring this into the world then. No one will be able to build this manually again after you leave.

Graceful shutdown

The application understands SIGTERM and other signals and will gracefully shutdown itself after processing the current task.

This is an extremely important item. Docker-processes become orphaned and continue working for months in the background, where no one sees them. Non-transactional operations terminate in the middle of execution, creating inconsistency of data between services and between data storages. This leads to errors that cannot be foreseen and they can be very, very expensive.

If you aren't able to control some of the third-party dependencies and cannot guarantee that your code will correctly handle SIGTERM, use something like dumb-init.

More information here:

Database connection is checked regularly

The application constantly pings the database and automatically reacts to the "connection lost" exception appeared from a ping or from any other queries, trying to restore it on its own or correctly completes its work.

I saw a lot of cases (it’s not just a figure of speech) when services created to process queues or events working as a daemons lost connection by timeout and started infinitely writing errors to logs, returning messages back to their original queue, sending them to Dead Letter Queue or simply not doing their job.

Scaled horizontally

As the load grows, it is enough to run more instances of the application to ensure that all requests or tasks are processed.

Not all applications can be scaled horizontally. A good example is Kafka Consumers.This is not necessarily bad, but if a specific application cannot be launched twice, all interested people need to know about this in advance. This information should be an eyesore, formatted with bold in the Readme and be added wherever it is possible. Some applications by their nature can't be launched in parallel under any circumstance, which creates serious difficulties to maintain them.

It is much better if the application itself controls these situations or a wrapper is written for it, which effectively monitors the "competitors" and simply prevents the process from starting or starting their work until some another process completes its own or until some external configuration allows N processes to work simultaneously.

Dead letter queues and resistance to "bad" messages

If the service listens queues or responds to events, changing the format or content of messages does not cause it to fall. Unsuccessful attempts to process the task are repeated N times, after which the message is sent to Dead Letter Queue.

Many times I have seen endlessly restarted consumers and queues swelled to such a big size that their subsequent processing took many days. Any listener of the queue must be ready to the format changes, to random errors in the message itself (data types in json, for example) or while the message is being processed by the child code. I even faced a situation where the standard library for RabbitMQ in one extremely popular framework did not support retries at all, counters of attempts, etc. and there was no easy way to extend its logic.

And much worse when the message is simply destroyed in case of failure.

Limited number of processed messages and tasks by one instance

It supports an environment variable that if necessary can be set to limit the maximum number of processed tasks, after which the service will shut down gracefully.

Constantly increasing memory consumption and "OOM Killed" at the end is the norm of life for modern Kubernetic minds. Implementing a primitive check that would just spare you the very need to examine all these memory leaks would make life easier. I have often seen people spend a lot of time and effort (and money) to stop these leaks, but there are no guarantees that your workmate's next commit will not make things worse. If the application can survive a week, this is an excellent indicator. Let it then simply stop itself and will be restarted. This is better than SIGKILL (about SIGTERM, see above) or the "out of memory" exception. For a couple of decades, this workaround is enough for most cases.

Not locked by a third-party integration with IP address whitelistening

If an application makes requests to a third-party service that allows requests only from limited IP addresses, the service makes these requests indirectly (for example, through a reverse proxy).

This is a rare case, but extremely unpleasant. It is very inconvenient when one small application blocks the possibility of changing the cluster network or moving to another region of the entire infrastructure. If you need to communicate with something that does not play with oAuth or VPN, set up a reverse proxy in advance. Do not implement in your programs the dynamic addition / deletion of external integrations for these purposes that use third-party hostnames etc., since by doing this you stick yourself to the only one available runtime environment. It is better to start from automating these processes to manage, for instance, Nginx configs or some, and in your application, refer to it.

Obvious HTTP User-agent

The service replaces the User-agent header with customized one for all requests to any APIs and this header contains enough information about the service itself and its version.

When you have 100 different applications communicate with each other, you can go crazy, seeing in the logs something like "Go-http-client/1.1" and the dynamic IP address of the Kubernetes container. Always identify your application and its version explicitly.

Does not violate a license

Does not contain dependencies that limit the application and it is not a copy of someone else's code, and so on.

This is a self-evident case, but I have seen things that even a lawyer who wrote the NDA has got hiccups now.

Does not use unsupported dependencies.

When you first start the service, it does not include dependencies that are already out of date.

If the library that you have taken into the project is no longer supported by anyone — look for another way to achieve the goal or start supporting the library itself.

Conclusion

In my list there are some more very specific checks for specific technologies or situations, or maybe I just forgot to add something. I am sure you also know what must be included to it.

Tags: