Self-service reliability with Internal Developer Platforms and Chaos Engineering
Up until the early 2000s, developers and Ops (at the time IT) had separate and often competing objectives, separate department leadership, separate key performance indicators by which they were judged, and often worked on separate floors or even separate buildings. The result was siloed teams concerned only with their own priorities, long hours, botched releases and unhappy customers.
DevOps was introduced in 2008 to address such dysfunctional workflows and bring everyone under the same roof. Throughout the 2010s, this new approach to software delivery, together with cloud, containerization and microservice architectures, brought about a great deal of improvement in how teams could get code out of the door and onto their new scalable infrastructures.
While it had massive positive implications across the industry, this evolution also meant things got a lot more complicated. In most modern setups, to just deploy and test some new code change, a developer now needs to touch between 5 and 10 different tools, run some bash script maintained by the Ops team, (try to) understand kubectl and HELM charts, and the list only grows, depending on how “mature” your setup is. For enterprise and high growth engineering teams in particular, testing and deploying can involve all sorts of cross-team dependencies, waiting times and, ultimately, frustrations for everyone involved.
Chaos Engineering and Internal Developer Platforms (IDPs) have emerged as the natural answers to these new cloud native challenges.
No amount of QA or other traditional testing can verify whether your application, its various services, or the entire system will respond reliably under any condition, whether "working as designed" or under extreme loads and unusual circumstances. Because Chaos Engineering can test the quality of code at runtime and has the potential for both automated and manual forms of testing, it’s become an indispensable tool in the modern QA toolbox.
Providing Ops teams with a standardized way to set application configuration baselines and establish golden paths for the rest of the engineering organization, IDPs are increasingly the go-to for teams that need a scalable approach to deploying and testing their apps and services, without compromising on the end Developer Experience. IDPs give teams the key to true DevOps, by enabling actual developer self-service.
In this article we’ll discuss the benefits of IDPs and Chaos Engineering, showing how Humanitec and Gremlin, the respective market leaders in both Internal Developer Platforms and Chaos Engineering, can work together to create a next generation deployment and testing experience for teams.
Developer self-service with Humanitec IDP
Internal Developer Platforms are the core glue to any modern Ops setup. They allow application developers to self-serve any tech and tool they need, autonomously and with no need for support from a central Ops team. Ops focus on setting baseline configurations and golden paths and let developers interact independently and effortlessly with the underlying infrastructure. This eliminates Ops overhead and unnecessary dependencies, while making full developer self-service possible.
Using Humanitec, Ops teams can wire up their whole setup and orchestrate their infrastructure from one control pane. They easily manage app configurations as well as roles and permissions across their entire organization. Developers can now spin up fully provisioned environments, add any image and resources (like DBs, DNS, storage, etc.) and deploy through a unified UI, CLI or API. They can then manage deployments, doing roll-backs and diffs, versioning configurations the same way they do with code in Git. They also gain multiple layers of visibility inside their applications, on container, cluster and app level.
Teams that implemented an IDP using Humanitec have increased their deployment frequency 4x. Waiting times have basically dropped to zero in most cases as engineers can self-serve what they need, when they need it. MTTR is reduced on average by 60%.
Build resilience with Gremlin
Chaos Engineering is the practice of intentionally performing experiments on systems for the purpose of improving their resilience. This is done by injecting measured amounts of failure into the system, observing how it responds, and using these observations to identify and fix failure points. Teams that frequently practice Chaos Engineering see their availability increase to over 99.9%, lower mean time to resolution (MTTR), and fewer high severity incidents.
The goal of Chaos Engineering isn’t to create chaos, but to mitigate chaos. While it does involve injecting failure, the size and scope of this failure is carefully controlled and designed specifically to uncover issues that can’t be uncovered using traditional forms of testing. These include technical issues (such as component failures), as well as operational issues (such as validating monitoring configurations and incident response procedures).
Once you’ve used Gremlin to run chaos experiments and build resilience into your systems, the next step is to automate your experiments for continuous validation. Systems change over time, especially as developers push new code. Gremlin provides a REST API that you can use to run chaos experiments as part of your CI/CD pipeline, ensuring a consistently high level of reliability with each new build.
How to integrate Humanitec IDP with Gremlin
Every time your engineers deploy a change to your services or applications, it’s best practice to ensure your apps and infrastructure setup are stable enough to withstand a chaos engineering test. By integrating your Internal Developer Platform with Gremlin, you enable your developers not only to self-serve the tech they need and deploy autonomously, but also to stress test those deployments in one highly functional, stable delivery flow.
To implement this unified flow for your team, you need to wire up your infrastructure to Humanitec and set up Gremlin in your cluster, defining the chaos engineering tests you’d like to run after each deployment. Once that is all set up you can connect your IDP to trigger the Gremlin API through Humanitec webhook events.
After each deployment, the webhook will send an API POST request to Gremlin and trigger your predefined attacks. You can check the test results on your Gremlin dashboard or feed them back into the IDP for your developers to consume directly. We put together a video tutorial, showing how the integration works in detail.
In this video you will see the setup of a deployment webhook in Humanitec. After an application is deployed, this webhook will trigger a Chaos Engineering test via the Gremlin API. While this test is killing pods for the application, you can see in the Kubernetes cluster (shown via k9scli.io) that the restart count is going up. This shows that our cluster is able to quickly detect and restart failed pods without significant downtime.
With internal developer platforms and Chaos Engineering, you can automate development workflows and ensure that each new deployment meets your reliability standards. Create a self-service pipeline that lets your developers release faster, while also ensuring high availability and fewer incidents. To see how to integrate Gremlin with Humanitec, see this tutorial on triggering an attack on deployment.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more