
Gremlin’s KubeCon ‘25 reliability track
Headed to KubeCon North America next week in Atlanta? So are we! Between over 300 talks, there’s something for everyone who’s into Kubernetes. But you’ve also only got three days, so we’re here to help you narrow it down with our selection of reliability talks to check out!
Oh, and don’t forget to catch Gremlin at booth #1044 in Hall B4 (near the Demo Theater). We’ll have swag, be talking about our latest integrations and releases, including Reliability Intelligence.
The Unofficial Kubernetes Reliability Track at KubeCon NA 2025
But what about reliability? The Multi-million dollar Kubernetes cost optimization question
Wed. 2:15pm
Zain Malik from Exostellar and Nibir Bora from Clean Compute present nine strategies for reducing Kubernetes costs while doing the impossible: improving reliability in the process. Their talk covers common pitfalls, including pod binning strategies, tuning requests and limits, handling API server pressure, and even how to use spot nodes without sacrificing reliability or predictability. This talk is valuable no matter where you are on your Kubernetes journey.
Zain and Nibir will host this talk on Wednesday at 2:15 p.m. in Building B, rooms B308-309.
Building resilient cloud-native infrastructure in the second decade
Wed. 5:30pm
Operational resilience is a huge focus for the CNCF, which is why its Technical Advisory Group (TAG) is hosting a session on the projects it helps maintain. This talk extends beyond Kubernetes, explaining how to integrate Observability, Business Continuity, Resource Optimization, Cost Efficiency, and Day 2 Operations to achieve Operational Resilience.
This talk is on Wednesday at 5:30pm in the Thomas Murphy Ballroom, rooms 2-3.
Kubernetes and etcd: Common pitfalls and how to avoid them
Tues. 4:15pm
We’re no strangers to common Kubernetes failure modes, and neither are Broadcom’s Nabarun Pal and Arka Saha. This talk focuses on etcd, the distributed key-value store behind all Kubernetes clusters, and the many ways it can fail. Nabarun and Arka will explore common causes of etcd failures, debugging methods, and best practices for operating etcd reliably.
This talk is Tuesday at 4:15pm in Thomas Murphy Ballroom 4.
Turn up the heat: Driving cloud native innovation into real-world impact
Wed. 9:07am
Industry leaders from Mailchimp, Bloomberg, Airbnb, and ByteDance give insights into the cloud-native strategies their companies implemented. While each leader has their own incredible story to tell, the one we’re most excited about is Maura Kelly of Mailchimp, who oversaw an on-prem migration to Kubernetes with 99.997% availability without slowing development velocity!
This talk is Wednesday at 9:07 a.m. in Exhibit Hall B2.
Other talks that piqued our interest
We love real-world experience and hearing the stories of how others have built reliable Kubernetes deployments. While these aren’t as directly related to reliability and Chaos Engineering, we’re excited to learn from these experts!
- From Code to Cluster: Orchestrating 100K+ Kubernetes Deployments with 1 Pipeline: Using built-in Kubernetes reliability tools to manage large-scale clusters.
- Making Application Rollouts Observable, Actionable and Boring: How LinkedIn manages successful large-scale deployments and identifies failures.
- Fix First, Investigate Later: How an observability tool caused a massive network failure. Maybe we can tie this to a Gremlin test?
- Beyond the Dashboard: Modern Observability for Platform Engineering at Scale: Panel of observability vendors and experts. Not necessarily reliability-focused, but tangential.
Looking forward to seeing all of you in Atlanta. Want to set up time to chat? Drop us a line and our team will reach out!
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALDownload our guide to understand DORA requirements and how Gremlin's capabilities align.
Read Gremlin for DORA10 Most Common Kubernetes Reliability Risks
These Kubernetes reliability risks are present in almost every Kubernetes deployment. While many of these are simple configuration errors, all of them can cause failures that take down systems. Make sure that your teams are building processes for detecting these risks so you can resolve them before they cause an outage.


These Kubernetes reliability risks are present in almost every Kubernetes deployment. While many of these are simple configuration errors, all of them can cause failures that take down systems. Make sure that your teams are building processes for detecting these risks so you can resolve them before they cause an outage.
Read more