I learned these 6 lessons from working with cloudformation for life.

Transfer
Recovery mode

I started working with cloudformation 4 years ago. Since then, I have broken many infrastructures, even those that were already in production. But every time I spoiled something, I learned new things. Thanks to this experience, I will share some of the most important lessons that I have learned.

Lesson 1: Validate Changes Before Deploying

I learned this lesson as soon as I started working with cloudformation . I don’t remember what I broke then, but I remember exactly that I used the aws cloudformation update command . This command simply rolls out the template without any verification of the changes that will be deployed. I don’t think that explanations are required, for which it is necessary to check all the changes before deploying them.

After this failure, I immediately changed the deployment pipeline , replacing the update command with the create-change-set command

# OPERATION is either "UPDATE" or "CREATE"
changeset_id=$(aws cloudformation create-change-set \
    --change-set-name "$CHANGE_SET_NAME" \
    --stack-name "$STACK_NAME" \
    --template-body "$TPL_PATH" \
    --change-set-type "$OPERATION" \
    --parameters "$PARAMETERS" \
    --output text \
    --query Id)
aws cloudformation wait \
    change-set-create-complete --change-set-name "$changeset_id"

When a change set is created, it does not affect the existing stack. Unlike the update command, the change set approach does not actually deploy. Instead, it creates a list of changes that you can review before deployment. You can view the changes in the aws console interface. But if you prefer to automate everything that is possible, then check them in the CLI:

# this command is presented only for demonstrational purposes.
# the real command should take pagination into account
aws cloudformation describe-change-set \
    --change-set-name "$changeset_id" \
    --query 'Changes[*].ResourceChange.{Action:Action,Resource:ResourceType,ResourceId:LogicalResourceId,ReplacementNeeded:Replacement}' \
    --output table

This command should produce output similar to the following:

--------------------------------------------------------------------
|                         DescribeChangeSet                        |
+---------+--------------------+----------------------+------------+
| Action  | ReplacementNeeded  |      Resource        | ResourceId |
+---------+--------------------+----------------------+------------+
|  Modify | True               |  AWS::ECS::Cluster   |  MyCluster |
|  Replace| True               |  AWS::RDS::DBInstance|  MyDB      |
|  Add    | None               |  AWS::SNS::Topic     |  MyTopic   |
+---------+--------------------+----------------------+------------+

Pay particular attention to changes where Action is Replace , Delete, or where ReplacementNeeded is True . These are the most dangerous changes and usually lead to loss of information.

When changes are viewed, they can be deployed

aws cloudformation execute-change-set --change-set-name "$changeset_id"
operation_lowercase=$(echo "$OPERATION" | tr '[:upper:]' '[:lower:]')
aws cloudformation wait "stack-${operation_lowercase}-complete" \
    --stack-name "$STACK_NAME"

Lesson 2: use stack policy to prevent stateful replacement or removal of resources

Sometimes just looking at the changes is not enough. We are all human and we all make mistakes. Shortly after we started using change sets, my teammate unknowingly performed a deployment, which led to a database upgrade. Nothing terrible happened, because it was a testing environment.

Despite the fact that our scripts displayed a list of changes and asked for confirmation, the Replace change was skipped because the list of changes was so large that it did not fit on the screen. And since this was a regular update in the testing environment, not much attention was paid to the changes.

There are resources that you will never want to replace or remove. These are statefull services, such as an instance of an RDS database or an elastichsearch cluster, etc. It would be nice if aws automatically refused to deploy, if the operation being performed would require the removal of such a resource. Fortunately, cloudformation has a built-in way to do this. This is called the stack policy, and you can read more about this in the documentation :

STACK_NAME=$1
RESOURCE_ID=$2
POLICY_JSON=$(cat <<EOF
{
    "Statement" : [{
        "Effect" : "Deny",
        "Action" : [
            "Update:Replace",
            "Update:Delete"
        ],
        "Principal": "*",
        "Resource" : "LogicalResourceId/$RESOURCE_ID"
    }]
}
EOF
)
aws cloudformation set-stack-policy --stack-name "$STACK_NAME" \
    --stack-policy-body "$POLICY_JSON"

Lesson 3: use UsePreviousValue when updating a stack with secret parameters

When you create an RDS entity, mysql AWS requires you to provide MasterUsername and MasterUserPassword. Since it is better not to keep secrets in the source code, and I wanted to automate absolutely everything, I implemented a “smart mechanism” in which credentials are obtained from s3 before deployment, and if credentials are not found, new credentials are generated and stored in s3 .

These credentials will then be passed as parameters to the cloudformation create-change-set command. During the experiments with the script, it happened that the connection to s3 was lost, and my “smart mechanism” regarded it as a signal for generating new credentials.

If I started using this script in a production environment and the connection problem arose again, it would update the stack with new credentials. In this particular case, nothing bad will happen. However, I abandoned this approach and started using another, providing credentials only once - when creating the stack. And later, when the stack requires updating, instead of specifying the secret value of the parameter, I would simply use UsePreviousValue = true :

aws cloudformation create-change-set \
    --change-set-name "$CHANGE_SET_NAME" \
    --stack-name "$STACK_NAME" \
    --template-body "$TPL_PATH" \
    --change-set-type "UPDATE" \
    --parameters "ParameterKey=MasterUserPassword,UsePreviousValue=true"

Lesson 4: use rollback configuration

Another team I worked with was using a cloudformation function called rollback configuration . I had not met her before and quickly realized that this would make deploying my stacks even better. Now I use every time I deploy my code to lambda or ECS using cloudformation.

How it works: You specify CloudWatch alarm arn in the --rollback-configuration parameter when you create the change set. Later, when you complete the change set, aws tracks the alarm for at least one minute. It rolls back the deployment if during this time alarm changes state to ALARM.

Below is an example of a cloudformation template excerpt, in which I create a cloudwatch alarm that tracks the user’s cloud metric as the number of errors in the cloud logs (the metric is created via MetricFilter ):

Resources:
  # this metric tracks number of errors in the cloudwatch logs. In this
  # particular case it's assumed logs are in json format and the error logs are
  # identified by level "error". See FilterPattern
  ErrorMetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      LogGroupName: !Ref LogGroup
      FilterPattern: !Sub '{$.level = "error"}'
      MetricTransformations:
      - MetricNamespace: !Sub "${AWS::StackName}-log-errors"
        MetricName: Errors
        MetricValue: 1
        DefaultValue: 0
  ErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-errors"
      Namespace: !Sub "${AWS::StackName}-log-errors"
      MetricName: Errors
      Statistic: Maximum
      ComparisonOperator: GreaterThanThreshold
      Period: 1 # 1 minute
      EvaluationPeriods: 1
      Threshold: 0
      TreatMissingData: notBreaching
      ActionsEnabled: yes

Now alarm can be used as a rollback trigger when executing a set of tools:

ALARM_ARN=$1
ROLLBACK_TRIGGER=$(cat <<EOF
{
  "RollbackTriggers": [
    {
      "Arn": "$ALARM_ARN",
      "Type": "AWS::CloudWatch::Alarm"
    }
  ],
  "MonitoringTimeInMinutes": 1
}
EOF
)
aws cloudformation create-change-set \
    --change-set-name "$CHANGE_SET_NAME" \
    --stack-name "$STACK_NAME" \
    --template-body "$TPL_PATH" \
    --change-set-type "UPDATE" \
    --rollback-configuration "$ROLLBACK_TRIGGER"

Lesson 5: Make Sure You Deploy the Latest Version of the Template

It's not easy to deploy the latest version of the cloudformation template, but it will do a lot of damage. Once it was with us: the developer did not send the latest changes from Git and unknowingly deployed the previous version of the stack. This led to a simple application that used this stack.

Something simple, like adding a check to see if a branch is up to date before doing the deployment, would be fine (assuming git is your version control tool):

git fetch
HEADHASH=$(git rev-parse HEAD)
UPSTREAMHASH=$(git rev-parse master@{upstream})
if [[ "$HEADHASH" != "$UPSTREAMHASH" ]] ; then
   echo "Branch is not up to date with origin. Aborting"
   exit 1
fi

Lesson 6: don't reinvent the wheel

Deploying with cloudformation might seem easy. You just need a bunch of bash scripts that execute aws cli commands.

4 years ago, I started with simple scripts called aws cloudformation create-stack command. Soon, the script was no longer simple. Each lesson learned made the script more and more complex. It was not only difficult, but also with a bunch of bugs.

Now I work in a small IT department. Experience has shown that each team has its own way of deploying cloudformation stacks. And that's bad. It would be better if everyone used a single approach. Fortunately, there are many tools that help you deploy and configure cloudformation stacks.

These lessons will help you avoid mistakes.

Tags: