There’s a common misconception about running workloads in the cloud: the cloud provider is responsible for reliability. After all, they’re hosting the infrastructure, services, and APIs. That leaves little else for their customers to manage, other than the workloads themselves…right?

In truth, even the largest cloud providers can’t take full responsibility for reliability. There are still reliability risks lurking in your containers, your virtual machine instances, and your serverless functions. This creates a split responsibility for reliability, where your provider is responsible for making their platform resilient, while you’re responsible for the resilience of the workloads you deploy onto that infrastructure. But what does this look like in practice?

In this blog, we’ll look at how resiliency on AWS differs from on-prem. We’ll also discuss AWS’ Shared Responsibility Model for Resiliency, and what it means for teams like yours.

How is resiliency on AWS different from resiliency on-prem?

In an on-premises environment, your organization is in full control over its entire environment, from the hardware up to the containers and applications running on it. You (as in, your organization) have full visibility into the hardware: you understand how it works, you can design its architecutre, and you can configure it to be as resilient as you like. If an incident occurs, you can directly troubleshoot and mitigate problems without having to wait on a third-party provider.

In the cloud, practically everything below the operating system level is inaccessible. You no longer know what hardware your applications are running on; you no longer have insight into hardware and OS-level reliability risks; and you can no longer monitor these systems for potential incidents. You now rely on the cloud vendor to maintain these systems on your behalf and alert you to any potential problems. Cloud providers expose metrics from the workloads you have running on their platforms (e.g. EC2 instance metrics, Kubernetes metrics, and container metrics), but these are much more limited than if you had direct access.

The primary benefit is that not having to maintain infrastructure means you can focus entirely on building and deploying your applications. This saves time, frees up your IT and DevOps teams to do more value-adding work, and lets you deploy faster. And while you don’t have control over lower-level reliability risks, cloud vendors offer service level agreements (SLAs) promising minimum levels of reliability. If they fail to meet those reliability targets, you can be reimbursed, whereas on-premises, you’d have to absorb those costs yourself.

What is the AWS Shared Responsibility Model for Resiliency?

The AWS Shared Responsibility Model for Resiliency is a framework describing the areas of the cloud computing model that AWS is responsible for maintaining. It also defines the parts that you, the customer, are responsible for. It’s part of the Reliability pillar of the AWS Well-Architected Framework, which is AWS’ official guidance on how to build resilient and efficient applications on AWS.

Note
The Shared Responsibility Model was originally written for security, but its concepts apply to other cloud areas.

What this means in practice is that AWS is solely responsible for the resiliency of its infrastructure and the services it provides. This includes ‌rack servers, data centers, network connections, the AWS console and API, and everything else that goes into running AWS as a platform. The customer is responsible for the resiliency of any services built on top of that platform, such as virtual machine clusters, private networks, Kubernetes clusters, and network API gateways.

To put it simply (and in AWS’ own words): AWS is responsible for the resiliency of the cloud, and customers are responsible for resiliency in the cloud.

AWS is responsible for the resiliency of the cloud, and customers are responsible for resiliency in the cloud.
Image © AWS: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/shared-responsibility-model-for-resiliency.html

Let’s look at each of the customer layers.

Networking, quotas, and constraints

AWS refers to these as Foundational requirements, since they impact nearly every aspect of your environment.

Networking considerations

Networking is at the heart of all cloud computing workloads. Without good networking infrastructure, services can’t communicate with each other or your customers. Cloud networking infrastructure is also extremely complex and varied: there are virtual private clouds (VPCs), numerous DNS resolvers, public and private IP addresses to manage, security groups to create and assign, and inter-availability zone (AZ) connections to make. While you don’t have to worry about laying cable or installing hardware, you do need to consider how best to apply the options and services AWS makes available to you.

Quota and constraint considerations

Cloud infrastructure is easy to provision, which also means it’s easy to over-provision. On AWS, provisioning an EC2 instance with 128 CPU cores and 256 GiB of RAM takes just a few clicks and a minute or two for the instance to come online. You could argue that you need the extra overhead for traffic spikes and growth, but until those happen, you’ll be paying for unused computing resources.

Architecture planning means balancing the need for resources to handle customer traffic and scale against the costs of operating these‌ resources. You’ll need to understand the load that customers put on each of your services, how increasing capacity changes your ability to handle that load, and what your limits are in terms of increasing capacity. Systems can only scale so much before costs and/or hardware prohibit it, and testing is necessary for finding that balance.

A common solution is to use Auto-Scaling Groups (ASGs), which dynamically add and remove nodes as needed based on resource availability. As your existing instances reach the limits of CPU or memory, AWS can automatically provision additional nodes and migrate workloads to them via load balancers or API gateways. You can set limits on how many nodes an ASG can add to set an upper cost boundary, but on average, ASGs reduce costs by only provisioning as much capacity as needed at any given moment.

Change management

Change is inevitable, especially in the cloud. Systems, applications, and services are constantly in flux, whether it’s engineering teams pushing new deployments into production, or AWS updating hardware and software components in their data centers. Your workloads must be flexible enough to adapt to these changes, while also being managed so that all changes are tracked.

For example, Amazon EKS periodically releases new versions of Kubernetes and removes old/obsolete versions. When you start a cluster update, EKS will automatically handle deploying new nodes, but it’s up to you to ensure that your Deployments, DaemonSets, plugins, Helm charts, and other resources can run on the new version. Again, AWS takes responsibility for the underlying infrastructure, and leaves the applications to you.

Failure management

Despite best efforts, failures still happen. Servers shut down without warning, data centers flood, and engineers accidentally send the wrong commands. While it’s AWS’ responsibility to recover from infrastructure failures within the terms defined in their Service Level Agreements (SLAs), this doesn’t change the fact that you’re offline and unable to serve your customers.

AWS knows this, and so AWS provides additional services and controls for mitigating these failures. The most common is redundancy, which involves running two or more instances of your workloads in separate availability zones (AZs) or regions. This way, if one AZ or region goes offline, you can reroute customer traffic to a backup AZ or region with minimal service interruption. This process varies between AWS services and depends on your application architecture, so it’s your responsibility to set it up and test it yourself.

Workload architecture

Workload architecture encompasses the very design of your applications and services: how they’re deployed, the platform(s) they run on, how services are separated, and how they intercommunicate. A modern example is service-oriented architectures (SOAs), where applications are divided into small interdependent units that work together. Platforms like EKS and Lambda support SOA architectures, but to get the most benefit from an architecture, you need to design your workloads correctly.

For example, Kubernetes (and by extension, EKS) makes it very easy to create replicas of your containers and distribute them across multiple nodes for redundancy in case of a failure. You could use Horizontal Pod Autoscaling to automatically deploy new pods as demand increases; you could set pod disruption budgets (PDBs) to maintain a minimum number of pods even during disruptions; or you could simply increase the replica count of a deployment. Regardless of the method you choose, it’s your responsibility to be aware of these options, how to configure them, and what settings work best for your specific workloads.

Continuous testing

Reliability isn’t a one-and-done initiative. Systems change over time, bugs and regressions emerge, and AWS continuously updates their services with new versions and functionality. All throughout this process, reliability risks can emerge and unexpectedly impact your systems. For this reason, you need to treat reliability as an ongoing and repeatable practice, just like you would QA testing or CI/CD deployments.

This doesn’t just mean firing off a test suite on every build. Reliability is as much an engineering mindset as it is about running tests. Engineers need to be vigilant and consider the different failure modes that could impact their services. Ideally, these risks will be detected early in the software development life cycle in order to lower the likelihood of reaching production. Continuous testing helps ensure that new risks are discovered and fixed while your team is also busy developing and pushing new features.

Additionally, reliability is a learning experience. When your team experiences an incident or outage, take the time to understand how the problem emerged and how it impacted your systems. After you deploy a fix, create a test to prove that the fix works, then make this test part of your regular reliability testing process to prevent regressions. Learning from real-world failures is incredibly effective, since it makes the risk and impact of failure more tangible for engineers.

How to manage resiliency of your AWS infrastructure

Now that you know where your resiliency responsibilities are as an AWS customer, how do you go about making sure your services and applications are resilient? The AWS Well-Architected Framework is a great place to start, as it lists the steps you can take today to fix resiliency concerns.

If you’re looking for a more automated solution (e.g. you have a large deployment with a lot of resources), Gremlin’s Reliability Management platform provides a way to define a team- or organization-wide resiliency standard for AWS, automatically monitor infrastructure for reliability risks, and test systems to better understand your reliability posture. The goal? Automate reliability in the cloud so teams can confidently focus more on delivering features rather than responding to incidents. 

Using Detected Risks, you can get a near-instantaneous report on each of your deployed services and the reliability risks they’re susceptible to. You’ll also get detailed information on what the risk means, how it could impact your service, and how to fix it.You can also test your services’ resiliency to expected issues you’ll encounter, such as latency, dependency failures, expired TLS certificates, exhausted compute resources, unexpected system failures, and much more. 

To see it for yourself, sign up for a free 30-day trial, or get a demo from one of our reliability experts.

No items found.
Categories
SRE
,
Andre Newman
Andre Newman
Sr. Reliability Specialist
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL