Chaos Engineering is a powerful tool that can help you prove that your systems are compliant with regulations and standards surrounding risk management and disaster recovery in your enterprise IT systems.
Governance. Risk management. Compliance. Often these three terms are combined into a strategy (GRC) for managing an enterprise-level company’s adherence to regulations covering critical applications. The three aspects are simple to define.
Governance provides oversight of how things like information technology (IT) are managed with respect to the business goals of the company. We wrote about a related topic previously, change management. In that article we advocate for looser governance in some aspects of software development. Most industries have strict IT policies and procedures that are firm requirements when it comes to other development aspects, especially data management.
Risk management covers seeking out, finding, and mitigating against potential problems that could negatively impact business needs, including aspects like information security, data security, and system outages. Reliability is the goal of mitigating risk, and some practices are better than others.
Compliance ensures that the business follows laws, regulations, and standards that control things like data security requirements, business continuity, and disaster recovery. Compliance can be difficult because it not only varies by industry or part of a company, but also by location and political boundaries like nation. The nuances of compliance across countries can be difficult to manage.
Some industries require compliance with regulations in order to function. Others can successfully and legally operate without any compliance requirements. Compliance is a broad bucket that can mean anything from a self-prescribed practice to externally-imposed laws or governmental regulations. There are some good articles out there with tips for getting started. We would like to add an additional item to the list of tips for success.
Chaos Engineering is preventative medicine that creates carefully-crafted experiments. These chaos experiments include a controlled and limited blast radius to find out what actually happens to a system when something goes wrong, for the ultimate purpose of enhancing reliability.
Systems today are more complex than ever and managing all of the moving pieces is a Herculean task that no human can complete. We strive for high availability (HA) using technologies like onsite or offsite virtualization or moves to outsourcing to a cloud provider and cloud storage of data.
Some of us even have hybrid clouds where we keep sensitive data in house at all times, in company-owned data centers, while moving other processing to an external cloud. We not only use a service-oriented architecture (SOA) but even microservices in containers scattered across the globe.
Today, we must assume that failure of components will eventually occur and we must have solutions in place to detect and automate repair quickly, before business needs and customers are impacted. This is why Netflix created Chaos Monkey and why we have expanded far beyond that foundation with ways to test your system to prevent major failures in production. Amazon does it. Google does it. And a growing number of Gremlin customers are proving the value of Chaos Engineering every day.
This complexity and risk are why we hire site reliability engineers (SRE) to protect and, when needed, restore our system’s capabilities to a proper steady state after a malfunction or component failure. These engineering teams need every tool available at their disposal to do their work at the highest level of effectiveness.
This is also how Chaos Engineering can help. Practitioners of Chaos Engineering test their systems against strict criteria to prove they work as designed or to find flaws and fix them while they are still small and before they impact the system in a big way. This practice translates well to testing a system against the rules and conditions of regulatory standards.
The simple answer is “it depends.” Specifically, it depends on your industry and your location. For example, banking and commerce have different standards than health care on some facets of IT work and overlap on others.
In this section we will discuss and compare three common and fairly well-known standards: SOC 2, GDPR, and CCPA. We will then explore ways Chaos Engineering can help serve as a strategic control process, assisting with the formation, execution, and management of regulation-related plans. You can follow a similar pattern using other frameworks, like Adrian Cockroft does in the article Failure Modes and Continuous Resilience with ISO standard analysis techniques like Failure Mode and Effects Analysis (FMEA) and Systems Theoretic Process Analysis (STPA).
System and Organization Controls (SOC) 2 is an auditing procedure that uses a set of criteria encoded in reports designed to create controls for service organizations. These controls cover managing customer data based on five “trust service principles”, namely security, availability, processing integrity, confidentiality, and privacy relating to the systems that a service organization uses to process users’ data.
The criteria were developed by the American Institute of CPAs, which sets national standards and rules in the United States related to accounting and also certifies both practitioners for practice and organizations for compliance. It is a sibling of SOC 1, which is focused on internal control over financial reporting.
Certification for SOC 2 is issued by outside auditors. There are two types of reports. Type I describes a vendor’s systems and whether their design is suitable to meet relevant trust principles. Type II is more significant in scope. It takes the Type I description of how things should function, including the plan created showing how to satisfy the established trust criteria, and gives it to an auditor.
Over a 6-12 month period, the auditor(s) view a sample of systems and process documentation to confirm that the company is adhering to what is described in the Type I report and that the controls are being effective at satisfying the underlying criteria. At the end of the observation period, a report is submitted to the company indicating any deviations found from what was expected. It is the responsibility of the company to evaluate whether those deviations pose an undue risk to the business. Gremlin has undergone a SOC 2 Type II audit with no significant deviations found (that’s the SOC terminology for passing).
SOC 2 compliance is not a legal requirement for software as a service (SaaS) and cloud computing vendors. It is a voluntary process that some vendors choose to undergo because some potential customers simply will not do business with a SaaS vendor who is not certified. Those vendors who are audited regularly and certified as compliant extend a greater level of trust to their customers.
General Data Protection Regulation or GPDR is a set of data privacy laws from the European Union (EU) that standardizes how data may be handled across the member countries. It is comprehensive and detailed, including clearly stated rules for businesses and organizations and also clearly defined rights for citizens.
Unlike SOC 2, GDPR is not voluntary. It is compulsory. Any entity doing business in the EU must adhere to the mandatory stipulations of the law, even if that business is not headquartered in the EU. Every citizen in the European Economic Area (EEA) is protected by this law, regardless of where the business storing or processing data is headquartered. If you have customers from any one of the European member states, those customers are under the jurisdiction of GDPR and so is the data belonging to those customers.
Many of the stipulations have to do with consent, how data is used, accessed, and deleted. There are also regulations surrounding data protection. While there is no official certifying body, we can say that as far as we are aware and according to our legal council, Gremlin is in compliance with the GDPR.
- Has a gross annual revenue of more than $25 million or
- Derives more than 50% of their annual income from the sale of California consumer personal information or
- Buys, sells, or shares the personal information of more than 50,000 California consumers annually.
This legislation was made in California, but any enterprise that fits any one of the above criteria and does business in the state must comply with the statue or stop doing business there, regardless of where the business calls home. With California being the fifth-largest economy in the world, businesses have a strong motivation to comply. In addition, bills or bill drafts relating to data privacy have been introduced/filed in at least 25 states and in Puerto Rico, so this is only going to grow.
The three standards we chose to include here have some strong similarities. All of them cover issues around proper data with respect to privacy. To keep data private, you must have implemented strong security practices.
Only one standard we described in this article, SOC 2, is voluntary. However, it has the strongest stipulations surrounding measures like the physical security of data access. In addition, SOC 2 is involved in oversight of the entire organization and the controls covering security, availability, and processing integrity of all of the systems involved in processing users’ data and the confidentiality and privacy of that information. SOC 2 includes much broader criteria than either GPDR or CCPA.
On the surface, GPDR and CCPA are very similar. Both are mandatory in their respective locales. Both cover the privacy of personal data and how that data may be handled. Both are enforced by the government entity that enacted them.
One significant difference between the GPDR and CCPA is that in some cases the CCPA only covers and considers data that was provided by a consumer and excludes data that was purchased by or acquired through third parties. The GPDR focuses on all data related to the EU consumer and citizen.
Unlike the GPDR, the CCPA does not cover medical data, although other regulations like the California Confidentiality of Medical Information Act (CMIA) and the US-wide Health Insurance Portability and Accountability Act (HIPPA) do.
Chaos engineering as a controls strategy. Yes, you read that right. This crazy “chaos” thing meeting the staid world of governance, risk, and compliance? Yes, because, in fact, that’s what chaos engineering excels at: identifying risk in systems that are too complex to feasibly test in other ways. We’ve been using the concept of “environment” to contain risk, and “testing in production” is one of the worst things a team can be accused of. But what happens when it’s no longer feasible to fully test in pre-production environments? A brave new world awaits.
Charles Betz, Principal Analyst at Forrester, in Predictions 2020: Three Big Changes in Store for DevOps
Some aspects of regulatory compliance may be outside the scope of Chaos Engineering tests, such as testing whether our password storage is properly encrypted or our multi-factor authorization (MFA) requirement is properly limiting access.
Other aspects, like the testing of disaster recovery planning and solutions are well within the realm of chaos testing. Chaos Engineering excels at finding risk in complex systems.
Use Chaos Engineering to:
- Test that your data redundancy and automated failover methods work properly and that you meet your recovery time objective (RTO) and recovery point objective (RPO)
- Find out whether automated mitigation schemes you have designed in the event of a loss of one of a redundant set of databases spin up new instances as designed. This will demonstrate data loss prevention because of the creation of proper data backups that you have tested and know work before the final original is gone. It will also prove that data recovery will work, but should be unnecessary because it was never lost in the first place.
- Confirm that your data preservation methods are resilient to network errors; do write commands back up and overload the preservation database?
- Discover whether your data processing monitoring properly records and sends alerts when specific incident or disaster recovery thresholds occur where notifications and actions are expected
- Determine the performance of your system under heavy load, including aspects like high CPU usage, increasing network latency or packet loss, and even unusual disk I/O conditions
- Demonstrate that your system can successfully handle a cyber attack like a DDoS
It is no longer feasible to test for many of these things in any environment except production, especially when trying to prove that our overall system is going to be reliable under extreme conditions. We cannot reproduce those extreme conditions across our complex system in the simplicity of a staging or test environment.
What we can do is design chaos experiments that are carefully limited in scope, with a controlled blast radius that protects important components. We start with an extremely limited magnitude for the chaos engineering attack, something that will yield good monitoring data, but which has no ability to cause serious harm. GameDays are a great way to begin.
As we gain confidence in our system, we increase the magnitude of the attack until we are absolutely certain that the system can withstand the most unusual of circumstances in the event of a disaster with limited or no business impact. Limiting these also limits downtime and helps us meet our service-level objectives (SLO) and service-level agreements (SLA).
If done systematically and well, most problems that we find using Chaos Engineering can be solved and mitigated against in production before there is any customer impact at all. Building our systems up to withstand problems more serious and severe than the limits imposed by regulatory standards is one way to demonstrate conclusively to any auditor that we are in compliance.
Chaos Engineering is a relatively new practice. Some applications of the practice are well-known and in increasingly common use today. Small amounts of fault injection give us data that helps us make our systems more and more fault-tolerant.
We will close this article with some thoughts.
Traditional disaster relief testing where you evacuate to a hot site data center takes weeks of planning and more weeks to implement and test. For often outdated compliance reasons, this is still a box that needs to be checked.
We wrote about this in an article titled Updating the Industry’s Reliability Practices, where we share some opinionated thoughts about how Gremlin makes it easy and safe to use failure to validate engineering assumptions and bring to light unknown points of failure in your systems and processes. It is easy to take those assumptions and broaden them to include proving how and whether we comply with various standards.
When we study and test our systems with Chaos Engineering, we can learn and confirm that these systems can withstand even major disaster scenarios. We are convinced that this can be done to the point of being confident that we meet or exceed the requirements of even strict standards.
When we break small things on purpose and fix the issues we find, we prevent real-world events from causing larger failures and more severe problems. This is a cost-effective method of not only demonstrating compliance with regulations, but exceeding their requirements and consistently pleasing our customers.
This article proposes an idea for an application of Chaos Engineering that is just starting to take shape. As time passes, we are likely to find more and more ways to use this growing science to continue to deliver increasing business value, including providing ways to speed up the data-gathering process necessary for demonstrating regulatory compliance.
It’s the time of year when teams at our favourite brands are gearing up for the Black Friday and Cyber Monday shopping…Tammy ButowPrincipal Site Reliability Engineer