Gremlin’s KubeCon ‘25 reliability track

Headed to KubeCon North America next week in Atlanta? So are we! Between over 300 talks, there’s something for everyone who’s into Kubernetes. But you’ve also only got three days, so we’re here to help you narrow it down with our selection of reliability talks to check out!

Oh, and don’t forget to catch Gremlin at booth #1044 in Hall B4 (near the Demo Theater). We’ll have swag, be talking about our latest integrations and releases, including Reliability Intelligence.

The Unofficial Kubernetes Reliability Track at KubeCon NA 2025

‍

But what about reliability? The Multi-million dollar Kubernetes cost optimization question

Wed. 2:15pm

Zain Malik from Exostellar and Nibir Bora from Clean Compute present nine strategies for reducing Kubernetes costs while doing the impossible: improving reliability in the process. Their talk covers common pitfalls, including pod binning strategies, tuning requests and limits, handling API server pressure, and even how to use spot nodes without sacrificing reliability or predictability. This talk is valuable no matter where you are on your Kubernetes journey.

Zain and Nibir will host this talk on Wednesday at 2:15 p.m. in Building B, rooms B308-309.

‍

Building resilient cloud-native infrastructure in the second decade

Wed. 5:30pm

Operational resilience is a huge focus for the CNCF, which is why its Technical Advisory Group (TAG) is hosting a session on the projects it helps maintain. This talk extends beyond Kubernetes, explaining how to integrate Observability, Business Continuity, Resource Optimization, Cost Efficiency, and Day 2 Operations to achieve Operational Resilience.

This talk is on Wednesday at 5:30pm in the Thomas Murphy Ballroom, rooms 2-3.

‍

Kubernetes and etcd: Common pitfalls and how to avoid them

Tues. 4:15pm

We’re no strangers to common Kubernetes failure modes, and neither are Broadcom’s Nabarun Pal and Arka Saha. This talk focuses on etcd, the distributed key-value store behind all Kubernetes clusters, and the many ways it can fail. Nabarun and Arka will explore common causes of etcd failures, debugging methods, and best practices for operating etcd reliably.

This talk is Tuesday at 4:15pm in Thomas Murphy Ballroom 4.

‍

Turn up the heat: Driving cloud native innovation into real-world impact

Wed. 9:07am

Industry leaders from Mailchimp, Bloomberg, Airbnb, and ByteDance give insights into the cloud-native strategies their companies implemented. While each leader has their own incredible story to tell, the one we’re most excited about is Maura Kelly of Mailchimp, who oversaw an on-prem migration to Kubernetes with 99.997% availability without slowing development velocity!

This talk is Wednesday at 9:07 a.m. in Exhibit Hall B2.

‍

Other talks that piqued our interest

We love real-world experience and hearing the stories of how others have built reliable Kubernetes deployments. While these aren’t as directly related to reliability and Chaos Engineering, we’re excited to learn from these experts!

From Code to Cluster: Orchestrating 100K+ Kubernetes Deployments with 1 Pipeline: Using built-in Kubernetes reliability tools to manage large-scale clusters.
Making Application Rollouts Observable, Actionable and Boring: How LinkedIn manages successful large-scale deployments and identifies failures.
Fix First, Investigate Later: How an observability tool caused a massive network failure. Maybe we can tie this to a Gremlin test?
Beyond the Dashboard: Modern Observability for Platform Engineering at Scale: Panel of observability vendors and experts. Not necessarily reliability-focused, but tangential.

Looking forward to seeing all of you in Atlanta. Want to set up time to chat? Drop us a line and our team will reach out!

‍

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Get Ready for DORA

Download our guide to understand DORA requirements and how Gremlin's capabilities align.

Read Gremlin for DORA