Cluster of Puppets: Amazon ECS Experience at iFunny

Despite the title, this article has nothing to do with the Puppet configuration management system.

Along with the trend of “cutting” large monoliths into small microservices, a trend in container orchestration came into operation of web applications. Immediately after the hype on Docker, the hype for service launch tools on top of Docker rises. Kubernetes is most often spoken about, however its many alternatives in the present also live and evolve.

So at iFunny, they thought about the benefits and value of orchestrators and eventually chose Amazon Elastic Container Service. In short: ECS is a container management platform on EC2 instances. For details and experience in battle, read below.

Why container orchestration

After numerous articles about Docker on "Habr" and beyond its limits, presumably everyone has an idea of what it is intended for and for which it is not. Let's now clarify the situation, why do we need a platform over Docker:

Automation of service deployments “out of the box”

Threw containers on the car? Wow! And how to update them, avoiding any degradation of the service when switching traffic going to the static port to which the client is bound? And how to roll back quickly and painlessly in case of problems with the new version of the application? After all, Docker alone does not solve these issues.

Yes, you can write such a thing yourself. At first, iFunny worked like that. For the blue-green deployment, Ansible playbook was used, in which the good old Iptables was controlled by switching the necessary rules to the IP of the new container. The connections from the old container were cleaned by tracking connections with the same old and good conntrack.

It seems to be transparent and understandable, but, like any home-made thing, it led to the following problems:

incorrect processing of inaccessible hosts. Ansible does not understand that the EC2 instance can fall at any moment, not only due to malfunctions, but also due to scaling down in the Autoscaling group. Thus, in a standby situation, Ansible may return an error after the playbook is completed. Issue on this issue seems to be closed, but still not resolved;
five hundred from the Docker API. Yes, sometimes Docker can give a server error under heavy loads, and with Ansible you can’t handle it either;
you can not stop the deployment. What happens if you kill the process that starts the playbook at the time of replacing the rules in IPtables? How big will the chaos formed on the hosts be? What part of the machines will be unavailable?
lack of parallelization of tasks within one host. In Ansible, you cannot run in parallel the iterations that make up the task. It’s easier to explain this: if you have the task of launching 50 containers with a common pattern, but with different parameters, then you will be forced to wait for container 1 to start before starting container 2.

Summarizing all the problems, we can conclude that declarative SCMs are not very suitable for such imperative tasks as deployment. You ask, what does the orchestra have to do with it? Yes, despite the fact that any orchestrator will give the opportunity to deploy the service with one team without the need to describe the process. At your disposal are all the known deployment patterns and smooth processing of the above failures.

In my opinion, orchestration platforms are the only way to implement a quick, simple and reliable deployment with Docker. Perhaps AWS lovers like us will use Elastic Beanstalk as an example. We also used it for some time in the grocery environment, but there were enough problems with it so that they did not fit into this article.

Simplification of “configuration management”

At one time, I heard a very interesting comparison of the orchestration platform with the launch of processes on the CPU by the operating system. After all, you do not care about what kernel this or that program works on?

The same approach applies to orchestrators. You, by and large, will not care about which machine and in how many copies the service is running, since the configuration is dynamically updated on the balancer. You need the bare minimum host configuration in production environments. Ideally, just install Docker. An “even bigger ideal” is to remove Configuration Management altogether if you have CoreOS.

Thus, your fleet is not what you need to monitor day and night, but a simple pool of resources, parts of which can be replaced at any moment.

Service-centric infrastructure approach

In recent years, there has been a shift in the web application infrastructure from host-centric to service-centric. In other words, this is a continuation of the previous paragraph, when instead of monitoring the hosts, you monitor the external performance of the service. The philosophy of the orchestration platform fits into this paradigm much more harmoniously than if you keep the service in a strictly fixed host pool.

You can also attach microservices to this item. In addition to deployment automation, it is easy and quick to create new services and orchestration with new services and link them together (Service Discovery instruments are most often delivered “in the box” with orchestrators).

Infrastructure getting closer to developers

DevOps for the iFunny development team is not an empty phrase, or even an engineer. They strive to give developers maximum freedom of action to accelerate the very Flow, Feedback and Experimentation .

In the last year or two, an API monolith has been actively created, new microservices are constantly being launched. And in practice, container orchestration helps a lot in quickly starting and standardizing the launch of a service as a technical process. With a good approach, the developer can create new services at any time, without waiting for a couple of weeks (or even a month), until his task from the general list reaches the admin.

There are still a lot of reasons why it is good to use orchestrators. You can add about the recycling of resources, but in this case even the most attentive reader will not endure.

Choosing an Orchestra

Here one could talk about comparing a dozen solutions on the market, endless benchmarks for launching containers and deploying clusters, disruption of the cover in the form of many bugs that block certain product features.

But in fact, everything is much more boring. At iFunny, they try to make the most of AWS services because the team is small, and there is, as always, not enough time, knowledge and experience to write your own bicycles or “file” everyone. Therefore, it was decided to go along the beaten track and take a simple and understandable tool. And yes, ECS as a service in itself is free: you pay only for EC2 instances, on which your agents and containers are running, at a standard rate.

Small spoiler: this approach worked, but with ECS there were a lot of other issues. A confession about “how sorry that Kubernetes is not” will be at the end of the article.

Terminology

Let's get acquainted with the basic concepts in ECS.

Cluster → Service → Task

A bunch that should be remembered first. Initially, the ECS platform is a cluster that does not need hosting on the instance and is managed through the AWS API.

In a cluster, you can run tasks from one or more containers that run in a common bundle. In other words, Task is an analogue of Pod in Kubernetes. For flexible container management - scaling, deployment and the like - there is a service concept.

A service consists of a certain number of tasks. If we are already comparing ECS with Kubernetes, then the services are analogous to the Deployments concept.

Task definition

Description of the task launch parameters in JSON. The usual specification, which can be taken as a wrapper over the docker run command. All “pens” for tagging and logging are presented, which, for example, are not in the Docker + Elastic Beanstalk bundle.

ECS agent

It is a local agent on instances in the form of a running container. He is engaged in monitoring the state of instance, utilizing its resources and sending commands to launch containers to the local Docker daemon. The source code for the agent is available on Github .

Application Load Balancer (ALB)

AWS Next Generation Balancer It differs from ELB for the most part in its concept: if ELB balances traffic at the host level, then ALB balances traffic at the application level. In the ECS ecosystem, the balancer plays the role of the destination of user traffic. You do not need to think about how to direct traffic to the new version of the application - you just hide the containers behind the balancer.

ALB has the concept of target group to which application instances connect. It is to the target group that you can bind the ECS service. In such a bundle, the cluster will pick up information about which ports the service containers are running on and transfer it to the target group to distribute traffic from the balancer. Thus, you don’t have to worry about which port the container is open on or how to prevent collision between several services on the same machine. In ECS, this is automatically resolved.

Task placement strategy

A strategy for distributing tasks across available cluster resources. The strategy is made up of type and parameter. As in other orchestrators, there are 3 types: binpack (i.e., drive the machine to failure, then switching to another), spread (even distribution of resources across the cluster) and random (I think everyone understands this). Parameters can be CPU, Memory, Availability Zone and Instance ID.

In practical experience, the strategy of distributing tasks according to Availability Zones (in other words, according to data centers) was chosen as a reinforced concrete option. This reduces competition for machine resources between containers and spreads straw in the event of an unexpected failure of one of the Availability zone in AWS.

Healthy percentage

The parameter of the minimum and maximum share of the desired number of tasks between which the service is considered healthy. This parameter is useful for configuring a service deploy.

The application version update itself can occur in two ways:

if max percentage> 100 , then new tasks are created in accordance with the parameter, and in the same amount, after connecting new tasks to the traffic, old ones are killed; if max percentage = 200, then everything happens in one iteration, if 150 - in two and so on;
if min percentage <100 , and max percentage = 100 , then everything happens the other way around: first, old tasks are killed in order to free up space for creating new tasks; at this time, all traffic is accepted by the remaining tasks.

The first option is similar to the blue-green deployment and looks perfect if it were not for the need to keep 2 times more resources in the cluster. The second option gives an advantage in recycling, but can lead to degradation of the application if a decent amount of traffic has arrived on it. Which one to choose is up to you.

Autoscaling

In addition to scaling at the EC2 instances level, there is also scaling at the task level in ECS. As in the Autoscaling Group, you can configure triggers for the ECS service as part of Cloudwatch Alarms. The best option is to scale by the percentage of CPU used of the value specified in the Task Definition.

An important point : the CPU parameter, as in Docker itself, is indicated not in the number of cores, but in the value of CPU Units. In the future, the processor time will be allocated based on which of the tasks has more units. In ECS 1 terminology, the CPU core is equivalent to 1024 units.

Elastic Container Repository

AWS Docker Image Hosting Service In other words, you get yourself a Docker Registry for free, without having to host it. The plus is in simplicity, the minus is that you cannot have more than one domain, and for each service you have to create your own repository separately.

Integration into existing infrastructure

Now the fun part about how ECS has taken root in iFunny.

Deployment pipeline

As an orchestrator and resource planner, ECS may be good, but he does not provide a deployment tool. In ECS terminology, deployment is an update service. To update, you need to create a new version of Task Definition, update the service, indicating the revision number of the new definition, wait until it completes the update, and roll back to the old revision if something goes wrong. And AWS at the time of writing did not have a ready-made tool that would do everything at once. There is a separate CLI for ECS , but it’s more about the analogue of Docker Compose than about deploying services in isolation.

Fortunately, the Open Source world compensated for this shortcoming with a separate ecs-deploy utility . In fact, this is a shell script of several hundred lines, but it does a very good job of a direct task. You simply specify the service, cluster and Docker image that you want to deploy, and it will execute the whole algorithm step by step. It will also roll back in case of an update file, and clean out obsolete Task Definitions.

At first, the only drawback was the inability to update Task Definition completely through the utility. Well, suppose you want to change the limits on the CPU or reconfigure the log driver. But this is a shell script, the easiest thing for DIY! This feature was added to the script in a couple of hours. It will continue to be used, updating services exclusively by Task Definitions, which are stored in the root of application repositories.

True, they have not been paying attention to Pull Request for half a year, as well as to a dozen others. This is a question about the cons of Open Source.

Terraform

Through Terraform in iFunny, all the resources that are in AWS are deployed. The resources necessary for the operation of the services are no exception: in addition to the service itself, it is the Application Load Balancer and its associated Listeners and Target Group, as well as an ECR repository, the first version of Task Definition, autoscaling for alarms and the necessary DNS records.

The first idea was to combine all the resources into one Terraform module and use it every time you create a service. At first it looked great: only 20 lines - and you have a production-ready service! But, as it turned out, maintaining such a thing over time is much more expensive. Since the services are not always homogeneous and various requirements constantly appear, it was necessary to edit the module almost every time when using it.

In order not to think about “syntactic sugar”, I had to get everything back to square one, describing all resources in Terraform state step by step, wrapping things that you can wrap in small modules: Load Balancing and Autoscaling.

At some point, state grew so much that one plan with its update took about 5-7 minutes, and it itself could be blocked by another engineer who is raising something on it right now. This problem was solved by dividing one large state into several small ones for each service.

Monitoring & logging

Everything turned out extremely transparent and simple here. A couple of new metrics for utilization of cluster services and resources were added to dashboards and alerts, so that it was clearly visible at what point the services began to scale and how well it ultimately worked.

We, as before, wrote the logs to the local Fluentd agent, who delivered them to Elasticsearch with a further opportunity to read them in Kibana. ECS supports any log-driver that is in Docker, unlike the same Beanstalk, and this is configured as part of the Task Definition.

In AWS, you can also try the awslogs driver , which displays logs directly in the management console. Useful thing if you do not have many logs to separately raise and maintain a system for collecting logs.

Scaling & resource distribution

This was where most of the pain was. The strategy of scaling services was chosen in a long way by trial and error. From this experience it became clear that:

Binpack on the CPU, of course, utilizes the cluster well, but when the load is full, everything can lie down for a minute or two, until Docker figure out how to divide the CPU time under such conditions;
None of the orchestrators (including ECS) in nature have the concept of dynamic rebalancing containers. For example, the problem of scaling at the time of peaks could be solved by adding new hosts to the cluster so that the cluster evenly distributes resources. But they will be idle until an update is launched on some service. This topic was sharply discussed in Docker Swarm , but it still remains unresolved. Most likely, because of the difficulty to solve it both conceptually and technically.

As a result, under loads, it was decided to scale the services instantly and in large volumes, and the instances - upon reaching 75% of the resource reserves. Perhaps not the best option from the point of view of iron utilization, but at least all the services in the cluster will work stably without interfering with each other.

Underwater rocks

Try to recall the case when the introduction of something new for engineers ended with a 100% happy ending. You can not? So in the iFunny episode with ECS was no exception.

Lack of flexibility in healthcheck

Unlike Kubernetes, where you can flexibly set up a service availability and availability check , there is only one criterion in ECS: does the application return 200 code (or any other configured by you) by one URL. There are only two criteria for the service to be bad: either the container did not start at all, or it started, but did not respond to healthcheck.

This creates problems, for example, when a key part of a service breaks during a deploy, but it still responds to check. In this case, you will have to redo the old version yourself.
Lack of Service Discovery as such. AWS offers its own version of Service Discovery , but this solution looks, to put it mildly, so-so. The best option in this situation is to implement the Consul agent + Registrator bundle inside the hosts, which is what the iFunny development team is currently doing.

Raw Scheduled Task Launch Implementation

If it’s not clear, then I'm talking about cron. Just in June last year, the ECS introduced the concept of Scheduled Tasks , which allows you to run tasks on a cluster on a schedule. This feature has long been available to customers, but the operation still seems crude for many reasons.

Firstly, the API is not created by the task itself, but by 2 resources: Cloudwatch Event with the launch parameters and Cloudwatch Event Target with the launch time. From the outside, it looks opaque. Secondly, there is no normal declarative tool for deploying these tasks.

We tried to solve the problem with Ansible, but so far things are bad with the standardization of tasks .

Ultimately, iFunny uses a self-written Python utility with a description of tasks in a YAML file for deployment, it plans to make a full-fledged tool for deploying cron tasks on ECS.

Lack of direct communication between the cluster and hosts

When, for various reasons, the EC2 instance is deleted, it is not deregistered in the cluster, and all tasks running on it simply fall. Since the balancer did not receive a signal to remove the target from the cluster, it will send requests to it until it understands that the container is unavailable. It takes 10-15 seconds, and during this time you get a bunch of errors from the server.

Today, the problem can be solved with the help of a lambda function that responds to the removal of an instance from the Autoscaling Group and sends a request to the cluster to delete the tasks of this machine (in the terminology - instance draining). Thus, instance is only beaten after all tasks have been removed from it. It works well, but Lambda in infrastructure always looks like a crutch: this could be included in the platform’s functionality.

Lack of detailed monitoring

AWS API gives only the number of registered machines and metrics for the share of reserved capacities from the cluster, from the service - only the number of tasks and CPU and memory utilization as a percentage of the amount set in the Task Definition. Here comes the pain for the adherents of the metrics church. Lack of detail on the use of resources by a specific container can play a trick when debugging problems with service overloading. Also, metrics for the disposal of I / O and the network would not hurt.

Container deregistration in ALB

An important point deducted from AWS documentation. The deregistration_delay parameter in the balancer is not the timeout for waiting for deregistration of the target, but the full timeout. In other words, if the parameter is 30 seconds, and your container will be stopped after 15 seconds, then the balancer will still send requests for the target and give the client five hundredth errors.

The way out is to set the deregistration_delay of the service above a similar parameter for ALB. It seems to be obvious, but it’s clearly not written anywhere in the documentation that initially causes problems.

Vendor lock-in inside AWS

Like any AWS cloud service, you cannot use ECS outside AWS. If for some reason you thought about moving to Google Cloud or (for some reason) to Azure, then in this case you will need to completely redo the orchestration of services.

Simplicity

Yes, ECS and its environment in the form of AWS products are so simple that it is difficult to implement extraordinary tasks in the architecture of your application. For example, you need full HTTP / 2 support on the service, but you cannot do this because ALB does not support Server Push.

Or you need the application to accept requests at level 4 (it does not matter, TCP or UDP), but in ECS you also won’t find a solution on how to transfer traffic to the service, since ALBs work only over HTTP / HTTPS, and the old ELB it doesn’t work with ECS services and generally sometimes distorts traffic (for example, this happened with gRPC).

Retrospective

Summing up all the pluses of orchestration mentioned at the beginning of the article, we can say with confidence that they are all true. IFunny now has:
- simple and painless deployment;
- less code and configuration in Ansible;
- Application unit management instead of host management
- launching services in production from scratch in 20-30 minutes directly by developers.
  But the issue of resource utilization remains unresolved.
The final step to fully migrating applications to ECS was the migration of the core API. Although all this passed quickly, smoothly and without downtime, the question remained about the advisability of using orchestrators for large monolithic applications. For one application unit it is often necessary to allocate a separate host, for a reliable deployment you need to save the headroom as several unoccupied machines, and configuration management in one form or another is still present. Of course, ECS solved many other issues in a positive way, but the fact remains that with monoliths you will not get very big benefits in orchestration.
The following picture was obtained on a scale: 4 clusters (one of them is a test environment), 36 services in production and about 210-230 launched containers in peaks, as well as 80 tasks that run on schedule. Time has shown that scaling up with orchestration is much faster and easier. But if you have a fairly small number of services and running containers, then you need to think about whether you even need orchestration.
As luck would have it, after all this battle, AWS began to launch its own Kubernetes hosting service called EKS. The process is at a very early stage and there is still no feedback on its use in production, but everyone understands that now in AWS you can set up the most popular orchestration platform in two buttons and still have access to most of its “pens”. Returning to the moment when the orchestra was selected, Kubernetes would be a priority due to its flexibility, rich functionality and rapid development of the project.

AWS also introduced ECS Fargate, which launches containers without having to host EC2 instances. At iFunny, they already tried it on a couple of test services and we can say that it is too early to draw any conclusions about its capabilities.

PS The article turned out to be quite large, but even our cases with ECS do not end there. Ask in the comments any questions on the topic or share your own successful implementation cases.

Tags: