- 2 min read

The Dual Approach in Scaling: Chaos Engineering and Performance Engineering

“Do you want the most blazing program that works most of the time? Or do you want a program that maybe runs a little bit slower but it lets you sleep at night because it’s solid? I’ll go with solid every time”  - Bill Kennedy

For any enterprise, they're more than likely all too familiar with the struggles and complexities of scaling their environments and applications.  Whether these applications live on premise, in a cloud environment, or somewhere between in a hybrid state, an age-old question engineering ponders on is, “Can my application and environment scale? If this is successful, and we attract the users the business expects, will everything work as I’ve designed it to?”  There have been vast improvements in technology that have helped smooth over some of these fears and additionally, the concept of Performance Engineering has been around for decades.  Everything ranging from offerings such as Micro Focus LoadRunner Professional to open source offerings like JMeter have helped engineers put their systems to the test.  This is an important practice to ensure proper performance and scaling of these systems to meet customer (or business) expectations.  It is the group's responsibility to validate that a system can handle an influx of users for peak events such as Cyber Monday or a big promotional sale.  But often when performance testing is being done, it is happening on an environment that is stable.  These performance tests are usually run under ideal conditions which are different than real world conditions.  There aren’t any service issues, regional outages or thousands of other complexities found within on-premise or to complicate it further, cloud-native environments.  Simply put, the notion of scaling is unarguably incomplete without coupling scaling with resilience.  It won’t mean much if your systems can scale, but they are offline.  

Despite our best efforts to scale, we’re left in the dark on where we are from a resiliency perspective.  Groups should be asking themselves the question of “I know my application can handle 50k users, but can it handle these 50k users amidst a critical infrastructure outage or with the outage of a dependent service?” And if we don’t know the answer to that, then it doesn’t matter if your systems can scale… because they might not even be up.  Let’s use a simple analogy of building the World’s Tallest Building, The Burj Khalifa in Dubai which stands at a staggering 2,717ft. We could equate performance engineering with the ability to make this the tallest building in the world.  But a tall building that nobody can access or that falls over in high winds isn't very impressive. Reliability and resiliency are just as important as performance. Simply look at how many other features were built into the tower to ensure it could account for earthquakes, high winds, and in all likelihood, failures in other portions of the building.  This is the resiliency angle that is equally important to the performance one.  You wouldn’t want to do one without the other.

At Gremlin, we want to take a proactive approach to make resilience a core part of enterprises' DNA.  How can we layer in the benefits of Chaos Engineering and marry those with the proven benefits of Performance Engineering?  We believe that companies who adopt both approaches, will not only have the ability to scale but scale in a way that keeps resiliency top of mind.  This dual approach will allow these groups to comfort the business and delight end users.  Benefits can range from a reduction of incidents, higher availability numbers, or more robust, scalable systems.