Reducing cloud reliability risks with the AWS Well-Architected Framework
Designing and deploying applications in the cloud can be a labyrinthian exercise. There are dozens of cloud providers, each offering dozens of services, and each of those services has any number of configurations. How are you supposed to architect your systems in a way that gives your customers the best possible experience?
AWS recognized this, and in response, they created the AWS Well-Architected Framework (WAF) to guide customers. But what is the WAF, how does it work, and how can it help you deploy faster, reduce costs, and become more reliable? Read this blog to learn more.
What is the AWS Well-Architected Framework (WAF)?
The AWS WAF is a collection of best practices for running workloads in AWS’ cloud platform. Unlike other cloud optimization guides, it was written by AWS experts and is an official part of their training materials and customer enablement. The framework is built around six pillars:
The Six Pillars of the AWS WAF
Each pillar is an entrypoint into an entire manual’s worth of training material, step-by-step action items, and compliance checklists. Each pillar covers operational best practices, as well as anti-patterns to avoid. To prevent this blog from getting too long, we’ll focus on two in particular: Operational Excellence, and Reliability.
Why is the Well-Architected Framework important?
Like all major cloud providers, AWS has a shared responsibility model with its customers. AWS’ operations teams are responsible for maintaining AWS’ core infrastructure and services, but it’s the customer’s responsibility to architect their workloads for resiliency. For example, AWS promises 99.5% availability for Amazon EC2 instances. If you need greater availability, you’ll need to set up redundant instances. This is where the shared part comes in: both you and AWS must work together to get the most benefit from the platform.
The WAF presents many scenarios like these across all aspects of AWS. Reading it will teach you which areas of AWS you have control over, how making specific changes affects your workloads, and how to focus your efforts so you can get the most out of AWS.
The Operational Excellence and Reliability pillars of the WAF
Understanding the WAF Operational Excellence pillar
"[Operational excellence is] a commitment to build software correctly while consistently delivering a great customer experience.”
— AWS Well-Architected Framework, Operational Excellence Pillar
Operational Excellence focuses on reducing maintenance and incident response times, leveraging automation and infrastructure management tools in place of manual operations, reducing manual effort and labor, and refining operations procedures so teams can respond to incidents quickly and effectively. This pillar is further split into four “best practice areas”:
Organization: How well-aligned is your organization on expectations, priorities, and business goals? Do your engineering teams understand their roles in deploying, operating, and maintaining applications in the cloud? Do they know how their work impacts business outcomes? Is it clear which teams own which workloads, and what their responsibilities are in managing those workloads? Does your organization foster a culture of collaboration, where team members are encouraged to make improvements, share learnings, and ask for resources when needed?
Prepare: Before deploying your workloads to the cloud, you need to understand how they work, what is considered good performance, and how to troubleshoot problems when they emerge. Observability is a key part of this, since it’s the only way to truly understand what’s going on inside of your services and systems. This section also focuses heavily on infrastructure management using configuration and deployment management tools, infrastructure as code (IaC), and version control, since cloud platforms enable infrastructure management in a way normal on-premises datacenters don’t. Preparation also extends to incident response, since automation isn’t foolproof. If something goes wrong, engineers need to know how to troubleshoot and address the issue.
Operate: This section builds on the work done in the “Prepare” stage to running workloads in production. Once your workloads are live, ensure your observability tools are monitoring them and that your teams are ready to respond to issues when they happen. Make sure you have clear metrics set with KPIs in place around things like mean time to detection (MTTD) and mean time to recovery (MTTR).
Evolve: This final section recommends that you and your team continuously review and improve the processes established in the previous sections. Encourage team members to share their knowledge and findings, set up communication channels for feedback, perform regular operational reviews, and set aside time for making improvements. From an operations perspective, consider running GameDays to test your incident response processes and runbooks, and update them if needed. Systems and processes change over time, so start with your oldest runbooks and move to your more recent ones.
Understanding the WAF Reliability pillar
“Resiliency is the ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.”
— AWS Well-Architected Framework, Reliability Pillar
Reliable workloads perform their intended functions correctly and consistently. This part of the WAF focuses on this, as well as failure management and recovery, change management, and architecting for resilience. This section also focuses on achieving high availability by measuring nines; e.g. three nines is 99.9% uptime, or less than 8.7 hours of downtime per year.
Key to understanding the reliability pillar is knowing where AWS’ responsibilities end and yours begin. This is called the Shared Responsibility Model for Resiliency. In short, AWS is responsible for the resiliency of the cloud, and you are responsible for resiliency in the cloud. To put it differently, AWS provides a reliable scaffolding for you to build on, but you’re responsible for the reliability of whatever you build on top of it.
Reliability includes Disaster Recovery (DR) and Business Continuity Planning (BCP). How well can you recover from failures? Do you have comprehensive recovery plans in place, and have you tested that they work? When did you last validate them, and are they due for a refresh? It’s inevitable that failures will occur even on a highly reliable platform, and having working response plans can greatly reduce your time to recovery.
AWS provides example implementations for meeting availability goals. These include: monitoring workloads, designing adaptable systems, using fault isolation and fault tolerance, backing up data, and planning for DR. You can read more about their example designs here.
How Gremlin helps you meet AWS WAF targets
“You can’t consider your workload to be resilient until you hypothesize how your workload will react to failures, inject those failures to test your design, and then compare your hypothesis to the testing results.”
— AWS Reliability Pillar announcement blog
You’ve identified your operational requirements, defined your infrastructure as code, deployed your workloads, implemented change management, and created a comprehensive Disaster Recovery plan. Now, how do you test all of these processes to confirm that your systems really are well-architected?
While it’s possible to check and compare your systems against the examples outlined in the WAF, this doesn’t really tell you how your systems will respond to real-world conditions. For instance, imagine you just finished following this guide to achieving 99.95% availability. How do you know you’ll be alerted when the database fails over, or that your DNS entries will properly fail over to a static website, or that data in your backup availability zone is properly replicated? To answer these questions, we need a more active method of testing reliability.
Gremlin lets you take a proactive approach to testing the resiliency of your cloud applications. It starts with adding your host, container, and Kubernetes-based services to Gremlin. For each service, Gremlin automatically creates a suite of reliability tests that include redundancy tests, scalability tests, and tests for slow and failed dependencies. During each test, Gremlin uses Amazon Cloudwatch (or your observability tool of choice) to monitor the state of your services and ensure they’re still available. This proves that your services are resilient to various failure modes. And while Gremlin provides a built-in test suite based on industry best practices, you can create your own fully customized test suites.
How to meet and validate the WAF Reliability pillar
The reliability pillar focuses on workloads adapting to different conditions, and there’s no better way to prove the adaptability of your workloads than by applying those conditions. For example, if you wanted to test your ability to fail over to a secondary availability zone, the best approach is to recreate a zone failure. AWS is unlikely to let you take down an entire zone, but you can still recreate a zone failure using Gremlin’s Redundancy: Zone experiment. This experiment blocks network traffic to an entire availability zone: if the service that you’re testing is truly zone-redundant, it should still respond to user requests and health checks. This has the same impact as a zone failure, but with the benefit of being easy to recover from. As soon as the experiment finishes, Gremlin reverts our systems back to normal.
While your experiment’s running, use your observability tools to monitor the impact on your systems and applications. Are requests still being handled? Is there an uptick in errors or latency? Are replacement instances being spun up automatically, or is manual intervention required? Keep a record of your observations, as this will tell you what to fix. After you implement fixes, repeat the experiment to validate your fixes.
How to meet and validate the WAF Operational excellence pillar
Running experiments is a great way to test one-off conditions, but when it comes to operationalizing reliability, teams need a more standardized and automated way of running tests. Gremlin enables this with reliability tests and test suites.
Reliability tests are repeatable experiments that test a specific aspect of a service (redundancy, scalability, etc.), and depending on how the service performs, return a score ranging from 0 to 100. Test suites consist of one or more reliability tests and are applied to each service in your Gremlin team. This lets you, for example, create a test suite that only tests for adherence to the specific sections of the WAF that apply to your team. Test suites are fully customizable by your Gremlin admin, so you can create test suites specific to your team’s needs and goals. Test suites also make full use of Health Checks, so you can have Gremlin monitor your observability metrics and alerts while it’s running tests, and safely fall back if a test causes a service to fail or take too long to respond. The results of each test feed into a per-service reliability score, which ranges from 0% to 100% and tells you at a glance how reliable your service is. Running and passing reliability tests increases the score, while failing tests or letting tests expire decreases it.
Use Gremlin to evaluate and test WAF compliance
The AWS Well-Architected Framework is a massive document, and our goal is to help you find the right path to implementing and adhering to it. Whether you’re adopting the WAF or another operational framework, we're here to help you ensure your services are resilient and available for your users.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
Why Reliability Engineering Matters: an Analysis of Amazon's Dec 2021 US-East-1 Region Outage
In the field of Chaos Theory, there’s a concept called the Synchronization of Chaos—disparate systems filled with randomness will influence the disorder in other systems when coupled together. From a theoretical perspective, these influences can be surprising. It’s difficult to understand exactly how a butterfly flapping its wings could lead to a devastating tornado. But we often see the influences of seemingly unconnected systems play out in real life.
In the field of Chaos Theory, there’s a concept called the Synchronization of Chaos—disparate systems filled with randomness will influence the disorder in other systems when coupled together. From a theoretical perspective, these influences can be surprising. It’s difficult to understand exactly how a butterfly flapping its wings could lead to a devastating tornado. But we often see the influences of seemingly unconnected systems play out in real life.Read more
Implementing cost-saving strategies on Amazon EC2 with Chaos Engineering
The COVID-19 pandemic has created a state of uncertainty, and many organizations are turning to cost-saving measures as a precaution. In a survey by PwC, 74% of CFOs expect a significant impact on their operations and liquidity. As a result, many organizations are looking to reduce costs wherever possible, and this includes cloud computing.
The COVID-19 pandemic has created a state of uncertainty, and many organizations are turning to cost-saving measures as a precaution. In a survey by PwC, 74% of CFOs expect a significant impact on their operations and liquidity. As a result, many organizations are looking to reduce costs wherever possible, and this includes cloud computing.Read more
Achieving AWS DevOps Competency status (and what it means for you)
Chaos Engineering was conceived as a direct response to the complexity and nondeterministic nature of cloud-based applications. Thoughtful fault injection closes the gap between traditional testing methodologies and modern approaches to software engineering like microservices, continuous delivery, and DevOps. It’s no surprise then, that Gremlin’s co-founders have always been committed to supporting both cloud-native teams and teams in the midst of their so-called “digital transformation.”
Chaos Engineering was conceived as a direct response to the complexity and nondeterministic nature of cloud-based applications. Thoughtful fault injection closes the gap between traditional testing methodologies and modern approaches to software engineering like microservices, continuous delivery, and DevOps. It’s no surprise then, that Gremlin’s co-founders have always been committed to supporting both cloud-native teams and teams in the midst of their so-called “digital transformation.”Read more