Options for building highly available systems in AWS. Overcoming interruptions in work. Part 1

Even cloud industry monsters like Amazon have hardware issues. Due to recent interruptions in the work of the US East-1 data center, this article may be useful.

Options for building highly available systems in AWS. Overcoming outages

Fault tolerance is one of the main characteristics for all cloud systems. Every day, many applications are designed and deployed on AWS without considering this feature. The reasons for this behavior can range from technical ignorance of how to properly design a fault-tolerant system to the high cost of creating a complete high-availability system within AWS services. This article highlights several solutions that will help overcome the interruptions in the equipment of providers and create a more suitable solution within the framework of AWS infrastructure.
The structure of a typical Internet application consists of the following levels: DNS, Load Balancer, web server, application server, database, cache. Let's take this stack and consider in detail the main points that must be taken into account when building a highly accessible system:
  • Building a Highly Available System in AWS
  • High availability at the web server / application server level
  • High availability at load balancing / DNS level
  • High availability at the database level
  • Building a Highly Available System Between AWS Availability Zones
  • Building a highly accessible system between AWS regions
  • Building a highly accessible system between different cloud and hosting providers

Part 2

High availability at the web server / application server level

In order to exclude the component from having a single point of failure (SPOF), it is common practice to run a web application on two or more instances of EC2 virtual servers. This solution allows for higher fault tolerance than using a single server. Application servers and web servers can be configured both using status checking and without it. The following are the most common architectural solutions for highly accessible systems using state checks:

image


Key points to pay attention to when building such a system:

  • Since the current AWS infrastructure does not support the Multicast protocol at the application level, data must be synchronized using regular Unicast TCP. For example, for Java applications, you can use JGroups, Terracotta NAM, or similar software to synchronize data between servers. In the simplest case, you can use one-way synchronization using rsync, a more universal and reliable solution is to use network distributed file systems such as GlusterFS.
  • You can use Memcached EC2, ElastiCache or Amazon DynamoDB to store user data and session information. For greater reliability, you can deploy the ElastiCache cluster in different AWS availability zones.
  • Using ElasticIP to switch between servers is not recommended for highly critical systems, as this process can take up to two minutes.
  • User data and sessions can be stored in a database. It is necessary to use this mechanism with caution; it is necessary to evaluate the possible delays during read / write operations to the database.
  • User-downloaded files and documents must be stored on network file systems such as NFS, Gluster Storage Pool, or Amazon S3.
  • An Amazon ELB or reverse proxy session commit policy should be enabled if sessions are not synchronized through a single repository, database, or other similar mechanism. This approach provides high availability, but does not provide fault tolerance at the application level.


High Availability at the Load Balancing / DNS

Level The DNS / Load Balancing level is the main entry point for a web application. There is no point in building complex clusters, heavy replicated web farms at the application and database levels without building a highly accessible system at the DNS / LB level. If the load balancer is a single point of failure, then its failure will make the entire system inaccessible. The following are the most common solutions for providing high availability at the load balancer level:

image

1) Using Amazon Elastic Load Balancer as a load balancer for high availability. Amazon ELB automatically distributes application load across multiple EC2 servers. This service makes it possible to achieve more than the usual fault tolerance of the application, this makes it possible to gradually increase the resources between which the load is distributed depending on the intensity of incoming traffic. This allows you to provide service for several thousand simultaneous connections and at the same time it can flexibly expand, with increasing load. ELB is inherently a fault tolerant unit that can independently fix failures in its work. When the load increases, additional ELB EC2 virtual machines are automatically added at the ELB level. This automatically eliminates the existence of a single point of failure and the entire load balancing mechanism continues to work even if some ELB EC2 virtual machines fail. Also, Amazon ELB automatically determines the availability of services between which it is necessary to distribute the load and in case of problems automatically sends requests to available servers. Amazon ELB can be configured both for load balancing using random distribution of Round Robin, without checking the status of services, and using a mechanism for securing sessions and checking the status. If session synchronization is not implemented, then even using session pinning cannot ensure that there are no application errors when one of the servers fails and users are redirected to an available server.

2) Sometimes applications require:
- Complex load balancing with the possibility of caching (Varnish)
- Using load balancing algorithms:
- Minimum connections (least connection) - Servers with fewer active connections receive more requests
- Minimum connections with weighted coefficients (Weighted Least-Connections ) -servers with fewer active connections and more power receive more requests
- Distribution by source (Destination Hash Scheduling)
- Distribution by recipient (Source Hash Scheduling)
- Distribution based on location and minimum connections (Locality-Based Least-Connection Scheduling) - More requests will be received by servers with fewer active connections taking into account recipient IP addresses
- Ensure operation with large short-term bursts of load
- Presence of a fixed IP address at the balancer load

In all of the above cases, the use of Amazom ELB is not suitable. It is better to use third-party balancers or reverse proxies, such as: Nginx, Zeus, HAproxy, Varnish. But at the same time, it is necessary to ensure the absence of a single point of failure, the simplest solution to this problem is to use several balancers. Zeus reverse proxy already has built-in functionality for working in a cluster, for other services it is necessary to use round robin DNS distribution. Let's look at this mechanism in more detail, but first we’ll identify a few key points to consider when building a robust AWS load balancing system:

  • Several Nginx or HAproxy services can be configured to provide high availability in AWS, these services can determine the availability of a service and distribute requests between available servers
  • Nginx or HAproxy can be configured to distribute the load cyclically if the application does not support state availability checking. Also, these services support work using the session pinning mechanism, but if session synchronization is not ensured properly, this does not guarantee that there are no errors at the application level when one server fails.
  • Horizontal scaling of load balancers of vertical scaling. Horizontal scaling increases the number of individual machines that perform the balancing function, eliminating the existence of a single point of failure. To scale such load balancers as Nginx and HAproxy, you need to develop your own scripts and system images; using Amazon AutoScaling for scaling is not recommended in this case.
  • To determine the availability of balancer servers, you can use the Amazon CloudWatch monitoring system or third-party monitoring services such as Nagios, Zabbix, Icinga, and if one of the servers is unavailable using scripts and command-line utilities for EC2 control, start a new server instance for the balancer in a few minutes .


Now let's discuss the level that stands above the balancer - DNS. Amazon Route 53 is a highly available, reliable, and scalable DNS service. This service can effectively distribute user requests to all Amazon services, such as EC2, S3, ELB, as well as outside the AWS infrastructure. Route 53 is essentially a fail-safe DNS managed server and can be configured either through the command line interface or through the web console. The service supports both cyclic and weight distribution of loads and can distribute requests between individual EC2 servers included in the balancer, as well as for Amazon ELB. When using circular distribution, checking the availability of the service and switching requests to available servers does not work and should be taken to the application level.

High availability at the database level

Data is the most valuable part of any application and designing high availability at the database level is the highest priority in any highly accessible system. To eliminate a single point of failure at the database level, it is common practice to use multiple database servers with data replication between them. This can be either using a cluster or using a master slave. Let's look at the most popular solutions to this problem within AWS:

image

1) Using Master-Slave replication.
We can use one EC2 server as the primary (master) and one or more as secondary servers (slave). If these servers are in the public cloud, you need to use ElastcIP, but if you use a private cloud (VPC), access between the servers can be done through private IP addresses. In this server mode, databases can use asynchronous replication. When the main database server fails, we can switch the secondary server to Master mode using our scripts, thereby ensuring high availability. We can start replication between servers in Active-to-Active mode or in Active-to-Passive mode. In the first case, write operations, intermediate write and read operations must be performed on the primary server, and read operations must be performed on the secondary server. In the second case, all read and write operations should be performed only on the primary server, and on the secondary server only if the primary server is unavailable after the secondary server switches to Master mode. It is recommended that you use EBS images for EC2 database servers to ensure reliability and stability at the disk level. To provide additional performance and data integrity, you can configure an EC2 database server with various RAID options within AWS. It is recommended that you use EBS images for EC2 database servers to ensure reliability and stability at the disk level. To provide additional performance and data integrity, you can configure an EC2 database server with various RAID options within AWS. It is recommended that you use EBS images for EC2 database servers to ensure reliability and stability at the disk level. To provide additional performance and data integrity, you can configure an EC2 database server with various RAID options within AWS.

2) MySQL NDBCluster
We can configure two or more MySQL EC2 servers as SQLD and data nodes, for data storage and one managing MySQL server to create a cluster. All data nodes in a cluster can use asynchronous replication to synchronize data with each other. Read and write operations can be simultaneously distributed between all data storage nodes. When one of the storage nodes in the cluster fails, the other becomes active and processes all incoming requests. If a public cloud is used, ElasticIP addresses are required for each server in the cluster; if a private cloud is used, internal IP addresses can be used. It is recommended that you use EBS images for EC2 database servers to ensure reliability and stability at the disk level.

3) Using Availability Zones with RDS
If we use Amazon RDS MySQL at the database level, we can create one Master server in one availability zone and one Hot Standby server in another availability zone. In addition, we modem to have several secondary Read Replica servers in several access zones. The primary and secondary RDS nodes can use synchronous data replication among themselves. Read Replica server uses asynchronous replication. When the Master RDS server is unavailable, Hot Standby becomes automatically available at the same address within a few minutes. All write, intermediate read and write operations must be performed on the Master server, read operations can be performed on Read Replica servers. All RDS services use EBS images.

The remaining items will be discussed in the second part:
  • Building a Highly Available System Between AWS Availability Zones
  • Building a highly accessible system between AWS regions
  • Building a highly accessible system between different cloud and hosting providers


Original article: harish11g.blogspot.in/2012/06/aws-high-availability-outage.html
Posted by Harish Ganesan

Also popular now: