Originally published on Forbes.com.
In today's world, where nearly every business is an online business that relies on software, downtime has a direct impact on the business metrics we all care about. Simply put, if your website is down, you are losing money. Real money.
For example, in 2017, organizations were losing an average $100,000 for every hour of downtime on their site. A single outage can cost a company millions of dollars. After a technological failure stranded thousands of British Airways passengers in May of last year, the CEO explained that that failure had cost the company £80 million ($102.19 million). That doesn't include the negative impact outages can have on your brand reputation, or the engineering costs to retroactively figure out the root cause of an issue.
These numbers should be alarming, but too often chief technology officers (CTOs) and chief information officers (CIOs) look at downtime as something that should never happen (which is unrealistic) or as something that's just part of doing business and is unavoidable (which is untrue). The fact of the matter is, the more an organization treats operations as a first-class citizen, the more engineering effort they save in the long run, which could be placed into shipping features and addressing customer requests. The amount of time and resources spent digging into metrics, creating post-mortems and creating a laundry list of action items after a major incident can all be saved by shifting more of the operations burden upfront.
This is why I believe in taking a proactive approach to avoiding downtime and building up resiliency. By practicing chaos engineering and continuously testing how your system will respond to stressors, you can locate and repair failures before they end up impacting your customers or making their way to the public. If you’re not familiar with the discipline of chaos engineering, I answered many of the frequently asked questions here. The high-level idea is that, similar to a flu shot, you should be thoughtfully injecting failure into your systems in order to avoid more serious problems down the line.
It may sound counter-intuitive, but given the rise of microservices and distributed cloud architectures, there’s simply no way to understand how all these moving parts behave without injecting some failure thoughtfully and on your own terms. If you choose to wait until it’s too late, then you’ve already hurt customer experience when they are trying to use your service. This doesn’t just affect your bottom line at the moment -- you can damage customer trust and spend a lot of engineering resources figuring out what happened after the fact.