Best practices for a resilient AWS architecture

The AWS Shared Responsibility Model for Resiliency gives cloud architects a framework for how to work with AWS to build more resilient, reliable, and, ultimately, available applications on AWS. Resiliency on AWS is different than on-prem or private clouds, but it boils down to this: AWS is responsible for the resiliency of the cloud (including all the infrastructure and hardware) and you’re responsible for the resiliency of your application in the cloud.

But what does that mean when it comes to actually designing the architecture for your applications? The Reliability Pillar of the AWS Well-Architected Framework lays out best practices, and while many of them are pretty straight-forward, there are a few high-level best practices that we’ve found are essential for companies running performant, reliable applications on AWS.

Availability vs. resiliency vs. reliability

You wouldn’t believe how many times we hear the terms availability, resiliency, and reliability used interchangeably within the same conversation. But each of these terms has a specific meaning with specific metrics and goals. As you build your AWS architecture, it’s a good idea to align around a core vocabulary that can help define your efforts.

Availability - A direct measure of uptime and downtime. Often measured as a percentage of uptime (e.g. 99.99%) or amount of downtime (e.g. 52.60 min/yr or 4.38 min/mo). This is a customer-facing metric mathematically computed by comparing uptime to downtime.
Resiliency - A measure of how well a system can recover and adapt when there’s disruptions, increased (or decreased) errors, network interruptions, etc. The more resilient a system is, the more it can respond correctly when changes occur.
Reliability - A measure of the ability of a workload to perform its intended function correctly and consistently when it’s expected to. The more reliable your systems are, the more you and your customers can have confidence in them.

Customers and leadership often look at availability, which comes from efforts to improve the resiliency and reliability of systems. Here’s a good way to think about it:

Reliability determines the actions your organization takes to ensure systems perform as expected, while resiliency is how you improve the ability of your systems to respond as expected, and availability is the result of your efforts.

Key cloud resilience design principles

Cloud systems are ephemeral, adapting and changing as needed in response to customer demands and infrastructure shifts. When designing your AWS architecture, you want to incorporate core resilience design principles to make sure that it can automatically respond correctly.

Embrace AWS Scalability and autoscaling

One of the biggest advantages of the AWS cloud is its ability to horizontally scale to adapt to changes in demand. When done correctly, your system will scale up and down as needed, keeping your costs low and making sure that your applications have the resources they need for surges in users.

In practice, you should adopt a bias towards a distributed model. Instead of large single resources, distribute the workload among several smaller resources. The increased granularity allows the system to better spin up the right level of resources, but it also allows for more resiliency. Instead of having a single point of failure, smaller, distributed resources reduce the impact if one of them goes down.

Incorporate AWS redundancy and availability zones

Servers and networks will go down or become unavailable. Even with the high levels of availability from AWS servers, things like floods, bad equipment, power outages, and more can take down servers and data centers.

Fortunately, AWS makes redundancy as easy as spinning up replicated resources in different availability zones and regions. Make sure your architecture can handle sudden infrastructure outages by building in redundancy and making sure it fails over correctly in case servers or resources suddenly become unavailable.

Integrate automation for managing change and failure

AWS architectures have a lot of moving parts, and a lot of policies controlling how those parts move. Whenever possible, try to design your architecture to manage these policies using automation instead of having to manually track performance metrics or roll out changes.

On the change management level, use automation to change your infrastructure—which includes changes to automation. Not only does this create a method for tracking and reviewing changes, it also makes it easier to rollback in case of issues.

Automation should also be used for gathering, monitoring, and responding to KPI metrics. Aim to create automatic notification and tracking of failures, as well as automated recovery processes that can work around or repair the failure, such as restarting problem pods or rerouting traffic to alternate resources.

Account for dependencies and microservices

Over the last few years, AWS architectures have generally evolved away from monolithic applications to distributed microservice architectures. In fact, AWS recommends that monolithic architecture should be avoided whenever possible. Even if you have to use monolithic architectures, they should be modular so they can be migrated to service-oriented or microservice architectures in the future.

The distributed nature of microservice architectures makes them more resilient and adaptable, but it also creates a web of dependencies between services—and these become points of failure if a dependency is down. Soft dependencies allow your system to continue running, albeit usually in a diminished capacity, while hard, or critical, dependencies will crash your system and create an outage.

Dependencies are a major cause of cascading failure, so you want to make sure that your architecture is behaving as per your design. Be sure to map dependencies with automation to avoid hidden or unknown dependencies causing outages, and test your system’s response to outages to verify whether dependencies are hard or soft. And, ideally, figure out how to transition all hard dependencies to soft dependencies.

Develop and document clear standards for your distributed system

Most workloads on AWS are either service-oriented architecture or microservices architecture. (In fact, AWS used to use service-oriented architecture, but has more recently embraced microservices.) No matter which architecture you build, documentation and defined standards are essential for consistency across your organization, especially at an enterprise scale.

Start by defining your architecture, then move on to defining the standards necessary to keep that architecture resilient and reliable. The Well-Architected Framework specifically calls out these best practices to prevent failures:

Identify which kind of distributed system is required

Based on your use case, document what kind of distributed system is needed. Hard real-time distributed systems require responses to be given synchronously and rapidly, which, in turn, gives them the most stringent reliability requirements. Soft real-time systems have a more generous time window of minutes or more for response, while offline systems handle responses through batch or asynchronous processing, giving them the most flexible reliability requirements.

Implement loosely-coupled dependencies

In tightly coupled systems, a change in one component requires a change in all other components, which can degrade performance or lead to outages. Loosely coupled dependencies, including queuing systems, streaming systems, workflows, and load balancers, have more flexibility—which means you can isolate failures more.

Whenever possible, standardize around loosely-coupled dependencies to give your system greater resiliency. And if a dependency needs to be tightly coupled, then make sure you document this and test accordingly.

Make all responses idempotent

An idempotent service means each request is completed exactly once—so if there are multiple identical requests, it’s the same effect as a single request. Indempotent services give your system greater resiliency because a client can make the same request multiple times and it will still only be processed once.

Distributed systems inherently make it difficult to ensure that each action is performed exactly once, so standardizing around indempotent tokens helps improve your resiliency and reliability.

Trust, but verify through resilience testing

A modern AWS architecture has millions of moving parts and hundreds or even thousands of engineers all working together. No matter how well you design your architecture and document your standards, the constantly changing nature of your systems means that potential points of failure and reliability risks will pop up. It doesn’t matter whether it’s a misconfiguration, a typo in the code, or an expired certificate—if it could create an outage, then it’s a risk that needs to be addressed.

That’s where resilience testing comes in.

Using your defined standards as a guide, resilience testing uses Fault Injection to verify that your systems respond they way they should. For AWS, you’ll want to start with a suite of core tests that cover scalability, redundancy, and dependencies. If you’re running Kubernetes, then you should also set up automatic reliability risk monitoring.

As you become more comfortable with testing, you can add additional tests based on the standards you defined above to make sure your AWS architecture is as resilient as you designed it to be.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Close Your AWS Reliability Gap

To learn more about how to proactively scan and test for AWS reliability risks and automate reliability management, download a copy of our comprehensive guide.

Get the AWS Primer