Building reliable distributed systems
What reliability means for distributed systems, the challenges in building reliable systems, and how to overcome these challenges using Chaos Engineering.
All applications—no matter their function, complexity, or programming language—need reliable systems to run on. Without reliable systems, we would not have the large-scale, networked computing infrastructure we have today. But what does reliability really mean in terms of technology and systems?
What is a reliable system?
Reliability is the ability for a system to remain available over a period of time. Reliable systems are able to serve customer traffic consistently without errors or interruptions. While reliable technology has always been important—as proven by the growing practice of site reliability engineering (SRE)—it’s becoming increasingly important for modern applications.
The reason is that the modern world runs on distributed systems. We use these systems to manage everything from online banking to telecommunications to healthcare. As these systems play increasingly important roles in everyday life, customer expectations for availability are also increasing. Disruptions don’t just affect our ability to generate revenue, they also impact the lives of our customers, whether it’s delayed flights caused by a computer outage, or an operational error taking down core banking services. According to Gartner, companies lose an average of $336,000 per hour of downtime, and for companies that provide critical services or are in regulated industries, these costs are even higher.
Why building reliable systems is difficult
Engineering reliable distributed systems is challenging, and even the most well-designed systems are prone to unexpected failures. This is due to two main factors: increasing development velocity, and increasing system complexity.
Software development velocity is increasing
DevOps and agile rapidly accelerated the software development life cycle, allowing organizations to innovate and deliver value to customers faster than ever. However, this increased velocity requires us to change the way we test software. Traditional Quality Assurance (QA) practices can no longer keep up with the pace and complexity of modern software development, nor can they test for emergent behaviors or infrastructure-level failure modes. This results in an increased failure rate and more failure modes making their way into production.
Systems are becoming more complex
Cloud computing and cloud-native architectures like microservices have made our applications much more dynamic and scalable, but they’ve also added significant complexity to our environments. We replaced static, long-lived monoliths with discrete, loosely coupled containers. We replaced hand-provisioned on-premises data centers with automated cloud platforms that manage our servers for us. We replaced manual deployments with CI/CD tools that compile, test, and deploy our applications for us. Our environments are larger than ever, and they’re being managed by automated platforms that orchestrate, scale, and maintain our workloads without our input.
Automation has removed much of the administrative overhead of managing systems, but it resulted in our systems exhibiting more emergent behaviors. These behaviors are difficult to anticipate, especially with traditional testing measures. Not only are automated tools imperfect, but we don’t always understand how they’ll work in every situation. In addition, many of our workloads are ephemeral: services constantly start and stop based on demand or available resources. This makes it even harder to troubleshoot problems, as a failed component might no longer exist by the time engineers detect the failure. Combine this with the fact that cloud platforms limit our ability to manage and observe the underlying infrastructure, and reliability becomes a much bigger problem.
All of this adds to the complexity of maintaining applications, and as complexity increases, so does the risk of failure. Each new automated tool, platform, orchestration service, and dependency added to our environments increases the risk of an outage, as a failure in any one component could potentially bring down the entire system. Even minor changes to configurations and deployment processes can affect applications in unexpected ways. Nonetheless, we expect site reliability engineers (SREs) to keep these systems and services running with minimal downtime.
How do we make our systems more reliable?
We’ve explored the underlying factors that cause poor reliability, but we don’t yet have a clear picture of how poor reliability manifests in systems. How do our systems and applications fail, and how do we address and mitigate these failures?
System reliability is a broad topic that varies depending on environment and team, but we can summarize it into distinct categories. These categories—which we call reliability principles—are meant to guide reliability engineering efforts so that we can identify and address failure modes more effectively. This takes the guesswork out of reliability by highlighting potential failure modes and reliability problems, providing clear objectives and corrective actions, and guiding SREs through the process of addressing them. These principles include dependency failures, scaling, self-healing, and redundancy and disaster recovery.
Dependencies are software components that provide additional functionality to an application or service. They include internal services provided by other teams in the organization; managed third-party services (e.g. software as a service, or SaaS); and self-managed external services, such as those hosted on platform as a service (PaaS) or infrastructure as a service (IaaS) platforms. Dependencies create additional points of failure that can cause our own services to fail, unless we design them to tolerate dependency failures.
Strategies for mitigating dependency failures include:
- Using automated health checks—periodic requests sent to check the availability and responsiveness of a service—to detect failed or unhealthy dependencies.
- Implementing retry and timeout logic around calls to dependencies.
- Failing over to backup dependencies or alternative code paths in case of an outage.
Scalability is the ability to dynamically allocate resources to a service in response to changes in demand. Distributed systems commonly use both vertical scaling (increasing the capacity of a single host) and horizontal scaling (increasing the total capacity of a service by distributing it across multiple hosts).
Strategies for mitigating scaling failures include:
- Using auto-scaling mechanisms to dynamically adjust capacity in response to demand.
- Validating auto-scaling rules and scaling behavior.
- Implementing and configuring load balancing.
Self-healing is the ability for a service to automatically detect and recover from internal failures. Services that require human intervention to recover from incidents result in longer outages, put more stress on your incident response teams, and are at greater risk of a failed recovery.
Strategies for mitigating self-healing failures include:
- Automatically detecting and responding to failed services, such as re-deploying workloads to healthy hosts.
- Restoring or replacing a failed service or host.
- Maintaining service availability when a host is offline.
Redundancy and Disaster Recovery
Redundancy is the process of replicating data and services to provide a fallback in case of a failure. It reduces the risk of a system failure becoming an outage by spreading the risk across multiple replicas. With the proper failover mechanisms in place, redundant systems can continue operating during a failure with zero service interruptions or data loss.
Disaster recovery is the practice of restoring operations after a disruptive event, such as a major outage or natural disaster. Teams that perform disaster recovery planning (DRP) are better prepared to respond to these events and bring their services back online faster. Redundancy is an essential part of disaster recovery planning, as it helps organizations reduce their recovery time objective (RTO), recovery point objective (RPO), and mean time to recovery (MTTR).
Strategies for mitigating redundancy failures include:
- Using observability tools to automatically detect major outages, such as a regional or zonal outage.
- Replicating services and data across multiple data centers, availability zones, or regions, and using load balancing tools to re-route traffic during outages.
- Planning for and proactively testing region evacuations.
- Creating and validating disaster recovery runbooks.
How do we apply these principles?
These principles provide guidance on how to identify and prevent failure modes, but even if we design our systems with as many resiliency mechanisms as possible, we don’t know their effectiveness until we either experience an incident or proactively test them. Traditional forms of testing aren’t comprehensive enough to test our dynamic, complex, modern systems, and so we need a new approach. That approach is Chaos Engineering.
What is Chaos Engineering?
Chaos Engineering is the practice of intentionally experimenting on a system by injecting failure, observing how the system responds, and using our observations to improve the system’s resilience. Chaos Engineering lets us simulate conditions that can result in failure—such as excessive resource consumption, network latency, and system outages—in a safe and controlled way. This helps expose latent and unexpected failure modes hidden deep within our systems, allowing us to implement fixes and validate our resilience before deploying to production.
With Chaos Engineering, we can apply the above reliability principles to real-world scenarios, such as:
- Degrading network performance between our service and a dependency to simulate a dependency failure.
- Stopping a resource such as a Kubernetes pod or virtual machine instance to validate auto-healing mechanisms.
- Consuming compute resources to trigger auto-scaling rules.
- Dropping all traffic to a cluster, availability zone, or cloud compute region to simulate a regional outage.
A safe approach to improving reliability
Chaos Engineering provides a safe and controlled approach to reliability testing. Instead of causing random, uncontrolled failures, we structure our tests around chaos experiments. These are carefully designed experiments that test our understanding of how our systems work with the goal of making them more resilient.
A chaos experiment starts with a hypothesis about how we expect our systems to behave under certain conditions. A hypothesis can cover any scenario, but we can use the reliability principles listed above to guide our initial experiments. We can ask questions such as “What happens to my service if...”
- ...it loses connection to a critical dependency, such as a database? Does it continue serving requests, or does it return errors to our users?
- ...the cluster runs out of compute capacity? Can we increase capacity without having to stop, restart, or redeploy our services?
- ...the host that our service is running on fails? Can we replicate the service to another host and redirect users without interruptions?
- ..the data center that our services are running in is taken offline by a fire, earthquake, or flood? Do we have redundant systems in another location that we can failover to? Are our engineers prepared to execute on a disaster recovery plan?
Next, we determine the best way to test this hypothesis. For example, when testing a connection to a critical dependency, we need a way to simulate a failure in the network connection between our service and our dependency. Traditionally, this would mean changing firewall rules or blocking a port on a router or switch. Chaos Engineering tools like Gremlin simplify this by providing a framework for injecting failure into systems in a controlled way, so we can test our systems under failure conditions without causing actual harm.
Chaos Engineering tools give us full control over how we inject failure, including the type of failure, the number of systems affected (called the blast radius), the severity of the failure (called the magnitude), and the duration. When starting with Chaos Engineering, keep the blast radius and magnitude low to focus reliability testing on a single system, application, or service. As we learn more about our systems and work on building their resilience, we can gradually expand the scope of our testing by increasing the blast radius and magnitude. This gives us the chance to find and uncover problems gradually, as opposed to testing the entire system all at once.
Safety is also a critical part of Chaos Engineering. Failure injection comes with the risk of inadvertently causing a larger failure than intended. This is why it’s critical to have a way to quickly halt and roll back an experiment. With Gremlin, we can manually stop experiments, or use automatic Status Checks to continuously monitor our systems for failure and halt experiments that grow out of control.
As we run experiments and uncover new failure modes, we should keep track of our discoveries and our fixes. This is how we track incremental reliability improvements over time and see how our systems are becoming more reliable as a result of our reliability initiative. Once we’ve completed an experiment and fixed the failure mode, we can automate our chaos experiments as part of our CI/CD pipeline to continuously check for regressions.
With Gremlin, you can continuously validate the reliability of your systems and services against these and other categories of failure. Gremlin makes it easy to design and run experiments on your systems, schedule attacks for continuous validation, and integrate with your CI/CD platform. Gremlin includes a full suite of Scenarios for testing resilience to common failure modes including dependency failures, auto-scaling failures, host failures, DNS outages, and region evacuation. You can get started by requesting a free trial.