Invisible deployment of a monolithic application in production on AWS. Personal experience
I am a Lead DevOps Engineer at Miro (ex-RealtimeBoard). I will share how our DevOps team solved the problem of daily server releases of a monolithic stateful application and made them automatic, invisible to users and convenient for their own developers.
Our development team is 60 people who are divided into Scrum-teams, among which there is also the DevOps team. Most Scrum commands support the current functionality of the product and come up with new features. The task of DevOps is to create and maintain an infrastructure that helps the application work quickly and reliably and allows teams to quickly deliver new functionality to users.
Our application is an endless online board. It consists of three layers: a site, a client, and a server in Java, which is a monolithic stateful application. The application keeps a constant web-socket connection with clients, and each server keeps in memory a cache of open boards.
The entire infrastructure - more than 70 servers - is located in Amazon: more than 30 servers with our Java application, web servers, database servers, brokers and much more. With the growth of functionality, all this must be updated regularly, without disrupting the work of users.
Updating the site and the client is simple: we replace the old version with a new one, and the next time the user accesses a new site and a new client. But if we do this when the server is released, we get downtime. For us, this is unacceptable, because the main value of our product is the joint work of users in real time.
The CI / CD process with us is git commit, git push, then automatic assembly, auto-testing, deployment, release and monitoring.
For continuous integration, we use Bamboo and Bitbucket. For automatic testing - Java and Python, and Allure - to display the results of automatic testing. For continuous delivery - Packer, Ansible and Python. All monitoring is done using ELK Stack, Prometheus and Sentry.
Developers write code, add it to the repository, after which automatic assembly and automatic testing are launched. At the same time inside the team gathers ups from other developers and conducts Code Review. When all the required processes, including autotests, have been completed, the team holds the build in the main branch, and the build of the main branch begins and is sent for automatic testing. The whole process is debugged and performed by the team on its own.
In parallel with build build and testing, build of AMI image for Amazon starts. To do this, we use Packer from HashiCorp, a great opensource tool that allows you to build an image of a virtual machine. All parameters are passed to JSON with a set of configuration keys. The main parameter is builders, which indicates for which provider we are creating the image (in our case, for Amazon).
It is important that we not only create an image of a virtual machine, but configure it in advance using Ansible: install the necessary packages and make configuration settings to launch a Java application.
We used to use the usual Ansible playbook, but this led to a lot of repetitive code, which became hard to keep up to date. We changed something in one playbook, forgot to do it in another, and as a result we ran into problems. So we started using Ansible-roles. We made them as versatile as possible so that we could reuse them in different parts of the project and not overload the code in large repeating pieces. For example, we use the Monitoring role for all types of servers.
From the side of Scrum-teams this process looks as simple as possible: the team receives notifications in Slack that the build and AMI-image are assembled.
We introduced pre-releases to deliver product changes to users as quickly as possible. In fact, these are canary releases that allow you to safely test new functionality on a small percentage of users.
Why are releases called canary? Previously, miners, when they descended into the mine, took a canary with them. If there was gas in the mine, the canary died, and the miners quickly rose to the surface. So it is with us: if something goes wrong with the server, then the release is not ready and we can quickly roll back and most of the users will not notice anything.
How the canary release starts:
On the side of the Scrum commands, the pre-release launch process again looks as simple as possible: the team receives notifications in Slack that the process has started, and after 7 minutes the new server is already in operation. Additionally, the application sends to Slack the entire changelog of changes in the release.
In order for this protection and reliability check barrier to work, Scrum teams monitor new errors in Sentry. This is an open source bug tracking application in real time. Sentry integrates seamlessly with Java and has connectors with logback and log2j. When the application starts, we transfer to Sentry the version on which it is running, and when an error occurs, we see in which version of the application it occurred. This helps Scrum teams respond quickly to errors and fix them quickly.
The pre-release should work for at least 4 hours. During this time, the team monitors its work and decides whether to release the release to all users.
Several teams can simultaneously release their releases . To do this, they agree among themselves what gets into the pre-release and who is responsible for the final release. After that, the teams either combine all the changes into one pre-release, or launch several pre-releases at the same time. If all pre-releases are correct, they will be released as one release the next day.
We do a daily release:
Everything is built using the Bamboo and Python applications. The application checks the number of running servers and prepares to launch the same number of new ones. If there are not enough servers, they are created from the AMI image. A new version is deployed on them, a Java application is launched, and the servers are put into operation.
When monitoring, the Python application using the Prometheus API checks the number of open boards on new servers. When it understands that everything is working properly, it closes access to the old servers and transfers users to new ones.
The process of transferring users between servers is displayed in Grafana. In the left half of the graph, servers running on the old version are displayed, in the right - on the new one. Intersection of charts is the moment of user transfer.
The team oversees the release of Slack. After the release, the entire changelog of changes is published in a separate channel in Slack, and in Jira all tasks associated with this release are automatically closed.
We store the state of the whiteboard on which users work, in the application memory and constantly save all changes to the database. To transfer the board at the cluster interaction level, we load it into memory on the new server and send the client a command to reconnect. At this point, the client disconnects from the old server and connects to the new one. After a couple of seconds, users see the inscription - Connection restored. However, they continue to work and do not notice any inconvenience.
What have we come to after a dozen iterations:
This was not possible immediately, we stepped on the same rake many times and filled a lot of cones. I want to share the lessons that we have received.
First, the manual process, and only then its automation. The first steps do not need to go deeper into automation, because you can automate what in the end is not useful.
Ansible is good, but Ansible roles are better. We made our roles as universal as possible: we got rid of repetitive code, so they only carry the functionality that they should carry. This allows you to significantly save time by reusing roles, which we already have more than 50.
Reuse the code in Python and break it into separate libraries and modules. This helps you navigate complex projects and quickly immerse new people in them.
The invisible deployment process is not over yet. Here are some of the following steps:
Our infrastructure
Our development team is 60 people who are divided into Scrum-teams, among which there is also the DevOps team. Most Scrum commands support the current functionality of the product and come up with new features. The task of DevOps is to create and maintain an infrastructure that helps the application work quickly and reliably and allows teams to quickly deliver new functionality to users.
Our application is an endless online board. It consists of three layers: a site, a client, and a server in Java, which is a monolithic stateful application. The application keeps a constant web-socket connection with clients, and each server keeps in memory a cache of open boards.
The entire infrastructure - more than 70 servers - is located in Amazon: more than 30 servers with our Java application, web servers, database servers, brokers and much more. With the growth of functionality, all this must be updated regularly, without disrupting the work of users.
Updating the site and the client is simple: we replace the old version with a new one, and the next time the user accesses a new site and a new client. But if we do this when the server is released, we get downtime. For us, this is unacceptable, because the main value of our product is the joint work of users in real time.
How our CI / CD process looks like
The CI / CD process with us is git commit, git push, then automatic assembly, auto-testing, deployment, release and monitoring.
For continuous integration, we use Bamboo and Bitbucket. For automatic testing - Java and Python, and Allure - to display the results of automatic testing. For continuous delivery - Packer, Ansible and Python. All monitoring is done using ELK Stack, Prometheus and Sentry.
Developers write code, add it to the repository, after which automatic assembly and automatic testing are launched. At the same time inside the team gathers ups from other developers and conducts Code Review. When all the required processes, including autotests, have been completed, the team holds the build in the main branch, and the build of the main branch begins and is sent for automatic testing. The whole process is debugged and performed by the team on its own.
AMI image
In parallel with build build and testing, build of AMI image for Amazon starts. To do this, we use Packer from HashiCorp, a great opensource tool that allows you to build an image of a virtual machine. All parameters are passed to JSON with a set of configuration keys. The main parameter is builders, which indicates for which provider we are creating the image (in our case, for Amazon).
"builders": [{
"type": "amazon-ebs",
"access_key": "{{user `aws_access_key`}}",
"secret_key": "{{user `aws_secret_key`}}",
"region": "{{user `aws_region`}}",
"vpc_id": "{{user `aws_vpc`}}",
"subnet_id": "{{user `aws_subnet`}}",
"tags": {
"releaseVersion": "{{user `release_version`}}"
},
"instance_type": "t2.micro",
"ssh_username": "ubuntu",
"ami_name": "packer-board-ami_{{isotime \"2006-01-02_15-04\"}}"
}],
It is important that we not only create an image of a virtual machine, but configure it in advance using Ansible: install the necessary packages and make configuration settings to launch a Java application.
"provisioners": [{
"type": "ansible",
"playbook_file": "./playbook.yml",
"user": "ubuntu",
"host_alias": "default",
"extra_arguments": ["--extra_vars=vars"],
"ansible_env_vars": ["ANSIBLE_HOST_KEY_CHECKING=False", "ANSIBLE_NOCOLOR=True"]
}]
Ansible-roles
We used to use the usual Ansible playbook, but this led to a lot of repetitive code, which became hard to keep up to date. We changed something in one playbook, forgot to do it in another, and as a result we ran into problems. So we started using Ansible-roles. We made them as versatile as possible so that we could reuse them in different parts of the project and not overload the code in large repeating pieces. For example, we use the Monitoring role for all types of servers.
- name: Install all board dependencies
hosts: all
user: ubuntu
become: yes
roles:
- java
- nginx
- board-application
- ssl-certificates
- monitoring
From the side of Scrum-teams this process looks as simple as possible: the team receives notifications in Slack that the build and AMI-image are assembled.
Pre-releases
We introduced pre-releases to deliver product changes to users as quickly as possible. In fact, these are canary releases that allow you to safely test new functionality on a small percentage of users.
Why are releases called canary? Previously, miners, when they descended into the mine, took a canary with them. If there was gas in the mine, the canary died, and the miners quickly rose to the surface. So it is with us: if something goes wrong with the server, then the release is not ready and we can quickly roll back and most of the users will not notice anything.
How the canary release starts:
- The development team at Bamboo clicks on a button -> a Python application is called that launches the pre-release.
- It creates a new instance in Amazon from a pre-prepared AMI image with a new version of the application.
- Instance is added to the necessary target groups and load balancers.
- With Ansible, an individual configuration is configured for each instance.
- Users are working with the new version of the Java application.
On the side of the Scrum commands, the pre-release launch process again looks as simple as possible: the team receives notifications in Slack that the process has started, and after 7 minutes the new server is already in operation. Additionally, the application sends to Slack the entire changelog of changes in the release.
In order for this protection and reliability check barrier to work, Scrum teams monitor new errors in Sentry. This is an open source bug tracking application in real time. Sentry integrates seamlessly with Java and has connectors with logback and log2j. When the application starts, we transfer to Sentry the version on which it is running, and when an error occurs, we see in which version of the application it occurred. This helps Scrum teams respond quickly to errors and fix them quickly.
The pre-release should work for at least 4 hours. During this time, the team monitors its work and decides whether to release the release to all users.
Several teams can simultaneously release their releases . To do this, they agree among themselves what gets into the pre-release and who is responsible for the final release. After that, the teams either combine all the changes into one pre-release, or launch several pre-releases at the same time. If all pre-releases are correct, they will be released as one release the next day.
Releases
We do a daily release:
- We introduce new servers to work.
- We monitor user activity on new servers using Prometheus.
- Close access for new users to old servers.
- We transfer users from old servers to new ones.
- Turn off the old server.
Everything is built using the Bamboo and Python applications. The application checks the number of running servers and prepares to launch the same number of new ones. If there are not enough servers, they are created from the AMI image. A new version is deployed on them, a Java application is launched, and the servers are put into operation.
When monitoring, the Python application using the Prometheus API checks the number of open boards on new servers. When it understands that everything is working properly, it closes access to the old servers and transfers users to new ones.
import requests
PROMETHEUS_URL = 'https://prometheus'
def get_spaces_count():
boards = {}
try:
params = {
'query': 'rtb_spaces_count{instance=~"board.*"}'
}
response = requests.get(PROMETHEUS_URL, params=params)
for metric in response.json()['data']['result']:
boards[metric['metric']['instance']] = metric['value'][1]
except requests.exceptions.RequestException as e:
print('requests.exceptions.RequestException: {}'.format(e))
finally:
return boards
The process of transferring users between servers is displayed in Grafana. In the left half of the graph, servers running on the old version are displayed, in the right - on the new one. Intersection of charts is the moment of user transfer.
The team oversees the release of Slack. After the release, the entire changelog of changes is published in a separate channel in Slack, and in Jira all tasks associated with this release are automatically closed.
What is user migration
We store the state of the whiteboard on which users work, in the application memory and constantly save all changes to the database. To transfer the board at the cluster interaction level, we load it into memory on the new server and send the client a command to reconnect. At this point, the client disconnects from the old server and connects to the new one. After a couple of seconds, users see the inscription - Connection restored. However, they continue to work and do not notice any inconvenience.
What we learned while making deploy invisible
What have we come to after a dozen iterations:
- The scrum team checks its code itself.
- The scrum team decides when to launch the pre-release and bring some of the changes to new users.
- Scrum-team decides whether its release is ready to go to all users.
- Users continue to work and do not notice anything.
This was not possible immediately, we stepped on the same rake many times and filled a lot of cones. I want to share the lessons that we have received.
First, the manual process, and only then its automation. The first steps do not need to go deeper into automation, because you can automate what in the end is not useful.
Ansible is good, but Ansible roles are better. We made our roles as universal as possible: we got rid of repetitive code, so they only carry the functionality that they should carry. This allows you to significantly save time by reusing roles, which we already have more than 50.
Reuse the code in Python and break it into separate libraries and modules. This helps you navigate complex projects and quickly immerse new people in them.
Next steps
The invisible deployment process is not over yet. Here are some of the following steps:
- Allow teams to complete not only pre-releases, but all releases.
- Make automatic rollbacks in case of errors. For example, a pre-release should automatically roll back if critical errors are detected in Sentry.
- Fully automate the release in the absence of errors. If there were no errors on the pre-release, it means it can automatically be rolled out further.
- Add automatic code scanning for potential security errors.