
The world is not perfect

The world is not perfect. At any moment, something may go wrong. Fortunately, most of us do not launch rockets into space or build aircraft.
A modern person depends on the application in his phone and our task is to make it so that at any moment of time under any circumstances, he could open the application and see pictures with cats.
People are not perfect. We constantly make mistakes. We make typos, we can forget something or succumb to laziness. A person can corny swell or get under a car.
Iron is not perfect. Hard drives are dying. Data centers lose channels. Processors overheat and electrical networks fail.
Software is not perfect. The memory is flowing. Connections are torn. Replicas break and data goes into oblivion.
Shit happens - as our overseas friends say. What can we do with all this? And the answer is banal to simplicity - nothing. We can always test, raise a ton of environments, copy production and keep one hundred thousand backup servers, but it still will not save: the world is not perfect.
The only right decision here is to come to terms. You need to accept the world as it is and minimize losses. Each time you set up a new service, you need to remember - it will break at the most inopportune moment.
It will definitely break. You will definitely make a mistake. Iron is sure to fail. The cluster will surely crumble. And according to the laws of this imperfect world - this will happen exactly when you least expect it.
What does most of us do to deceive everyone (including ourselves)? - We set up alerts. We write tricky metrics, collect logs and create alerts, thousands, hundreds of thousands of alerts. Our mailboxes are full. Our phones are torn from SMS and calls. We plant entire floors of people to look at charts. And when once again we lose access to the service, analysis begins: what did we forget to monitor.
All this is just the appearance of reliability. No alerts, metrics and monitoring will help.
Today they called you and fixed the service - no one noticed that something had broken. And tomorrow you went to the mountains. And the day after tomorrow he swelled. People are not perfect. Fortunately, we are engineers, we live in an imperfect world and learn to defeat it.
So why do you need to wake up at night or in the morning instead of coffee, read the mail. Why a business should depend on one person and on his performance. Why. I do not understand.
I just understand that you can’t live like that, and I don’t want to live like that. And the answer is simple: Automate this (yes, with a capital letter). We need more than just alerts and calls at night. We need automatic responses to these messages. We must be sure that the system can fix itself. The system must be flexible and able to change.
Unfortunately, we do not have a smart enough AI yet. Fortunately, all our problems are formalizable.
I don’t have a silver bullet, but I have a Proof of Concept for AWS.
AWS Lambda
Serverless - in the first place, what is not running cannot break.
Event based - received an event, processed, turned off.
Able to JVM - which means you can use all the experience from the Java world (and means that I can use Clojure).
3d-party - No need to monitor and support AWS Lambda.
Pipeline is as follows:
Event -> SNS Topic -> AWS Lambda -> Reaction
By the way, SNS topic can have several endpoints. So, you can trite add mail and receive the same notifications. And we can expand the lambda function and make notifications much more useful: for example, send alerts immediately along with charts or add SMS sending.
A complete example of a single Lambda function can be found at: github.com/lowl4tency/aws-lambda-example A
lambda function nails all nodes in an ELB that is not inService.
Code parsing
In this example, we will kill all nodes that are not in the InService state. By the way, the whole Lambda function takes ~ 50 lines of code in one file, which means ease of support and ease of entry.
Any Clojure project starts with project.clj.
I used the official Java SDK and the excellent Amazonica library , which is a wrapper for this SDK. Well, so as not to drag too much, we exclude those parts of the SDK that we do not need
[amazonica "0.3.52" :exclusions [com.amazonaws/aws-java-sdk]]
[com.amazonaws/aws-java-sdk-core "1.10.62"]
[com.amazonaws/aws-lambda-java-core "1.1.0"]
[com.amazonaws/aws-java-sdk-elasticloadbalancing "1.11.26"
:exclusions [joda-time]]
[com.amazonaws/aws-java-sdk-ec2 "1.10.62"
:exclusions [joda-time]]
[com.amazonaws/aws-lambda-java-events "1.1.0"
:exclusions [com.amazonaws/aws-java-sdk-dynamodb
com.amazonaws/aws-java-sdk-kinesis
com.amazonaws/aws-java-sdk-cognitoidentity
com.amazonaws/aws-java-sdk-sns
com.amazonaws/aws-java-sdk-s3]]]
For greater flexibility of each Lambda function, I use a configuration file with the most common edn . In order to be able to handle events, we need to slightly modify the function declaration
(ns aws-lambda-example.core
(:gen-class :implements [com.amazonaws.services.lambda.runtime.RequestStreamHandler])
Point of entry. We read the input event, process this event using handle-event and write to the JSON stream as a result.
(defn -handleRequest [this is os context]
"Parser of input and genarator of JSON output"
(let [w (io/writer os)]
(-> (io/reader is)
json/read
(-> (io/reader is)
json/read
walk/keywordize-keys
handle-event
(json/write w))
(.flush w))))
Workhorse:
(defn handle-event [event]
(let [instances (get-elb-instances-status
(:load-balancer-name
(edn/read-string (slurp (io/resource "config.edn")))))
unhealthy (unhealthy-elb-instances instances)]
(when (seq unhealthy)
(pprint "The next instances are unhealthy: ")
(pprint unhealthy)
(ec2/terminate-instances :instance-ids unhealthy))
{:message (get-in event [:Records 0 :Sns :Message])
:elb-instance-ids (mapv :instance-id instances)}))
We get the list of nodes in ELB and filter them by status. All nodes that are in the InService state are removed from the list. The rest is termite.
Everything that we print through pprint will go to CloudWatch logs. This can be useful for debugging. Since we do not have a constantly running lambda and there is no way to connect to REPL, this can be quite useful.
{:message (get-in event [:Records 0 :Sns :Message])
:instance-ids (mapv :instance-id instances)}))
At this point, the entire structure that we will generate and return from this function will be written in JSON and we will see as a result of execution in the Lambda Web interface.
In the unhealthy-elb-instances function , we filter our list and get instance-id only for those nodes that ELB considered to be inoperative. We get a list of instances and filter them by tags.
(defn unhealthy-elb-instances [instances-status]
(->>
instances-status
(remove #(= (:state %) "InService"))
(map :instance-id)))
In the get-elb-instances-status function , we call the API method and get a list of all nodes with statuses for one specific ELB
(defn get-elb-instances-status [elb-name]
(->>
(elb/describe-instance-health :load-balancer-name elb-name)
:instance-states
(map get-health-status )))
For convenience, remove unnecessary and generate a list only with information that is of interest to us. This is the instance-id and status of each instance.
(defn get-health-status [instance]
{:instance-id (:instance-id instance)
:state (:state instance)})
And we filter our list, removing those nodes that are in the InService state.
(defn unhealthy-elb-instances [instances-status]
(->>
instances-status
(remove #(= (:state %) "InService"))
(map :instance-id)))
And that’s all: 50 lines that will allow you not to wake up at night and calmly go to the mountains.
Deployment
For ease of deployment, I use a simple bash-script
#!/bin/bash
# Loader AWS Lambda
aws lambda create-function --debug \
--function-name example \
--handler aws-lambda-example.core \
--runtime java8 \
--memory 256 \
--timeout 59 \
--role arn:aws:iam::611066707117:role/lambda_exec_role \
--zip-file fileb://./target/aws-lambda-example-0.1.0-SNAPSHOT-standalone.jar
Set up an alert and fasten it to the SNS topic. We fasten the SNS topic to the lambda as an endpoint. We calmly go to the mountains or fall under the car.
By the way, due to the flexibility, it is possible to program any behavior of the system, and not only by system, but also by business metrics.
Thanks.