I hope you’ve heard of Netflix’s famed Chaos Monkey and are familiar with the general concept of “Chaos Engineering”. I love Netflix: In terms of reliability, fault tolerance, and high availability, they’re at the top, with systems that could be considered beyond the reach of “mere mortals”. Their systems designed to regularly destroy their other systems in production. They do this to ensure that their systems are build to withstand the Chaos of modern technology. You may not think you can achieve something similar with your own systems. There are small steps you can take as part of your DevOps development process to begin your journey to developing more resilient applications.

My own (unwilling) taste of Chaos

A few years ago I accidentally destroyed a development environment. I’d made some manual changes to an AKS Cluster when I was familiarising myself with the a feature. I’d updated the Terraform template to match the manual changes I’d made, then run it with -auto-approve. I watched as it made the determination that the state didn’t match and resource re-creation was required, I hit ctrl+c a second too late and the AKS Cluster started to delete itself; I was too late. For Kubernetes, there are two parts to orchestrating a Cluster: creating the cluster and underlying infrastructure, and configuring and deploying services to the cluster. The reason this was so immediately upsetting was that I’d done a lot of work getting the nginx-ingress deployments working on my cluster, and the prospect of going through the whole process again was daunting. As I stared at the screen, I got a message on Slack: “when will the environment be back up, we need to begin implementing Application Insights.”

Luckily, I’ve been in the DevOps game for a few years, and while I hadn’t committed the Kubernetes configuration to git yet, I’d saved it all as I was working through the process. I responded on Slack that there had been a few issues, and it’ll be an hour. I rolled up my sleeves and waited for the AKS Cluster to re-create. After twenty minutes, I was looking at a newly formed Kubernetes cluster, with the cluster prompting me to download the kube-config file. I connected to the cluster and created the nginx-ingress service and configured cert-manager. The namespaces appeared and the pods seemed to be running, so far so good. I deployed the mock middleware server, and the pods came online, the ingress rules started working, and the SSL certificate was issued.

After five minutes, I browsed to the address and it was working, SSL and all. Finally, I deployed the application and it just worked! I actually couldn’t believe it, the last time I’d deleted something in production I spent two days rebuilding it (that was about six years ago, go easy on me). This thankfully short – yet stressful – exercise had created a paradigm shift within me: before I deploy anything and hand it off, I need to delete it first. If I can’t re-create it, then it isn’t ready for use. It’s that simple.

Living on the Edge

One of the biggest problems in the technology industry-at-large is the culture of mistaking “Proof of Concept” for “Minimum Viable Product”. The key word here is viability; how can our applications that easily fail be considered viable? The unvarnished truth is that they aren’t viable. It’s only through sheer luck that many businesses make it through the initial shakey phase of initial deployments without going out of business and descending into true chaos. Dodgy deployments become the standard, and inevitably, years on we’re still as susceptible to critical failures as we were when the first “Hello, World!” was pushed to production.

The Elephant in the Room: Complexity and Where to Begin

One of the reasons I – and everyone else – get into these situations where an environment may be unrecoverable is because it’s so complex. Years of spaghetti deployments, modifying code on production servers, and critical incidents leave a house of cards that is ready to fall over at the smallest breeze. Where can you possibly begin? The answer: start small.

The next time you deploy a new application make it as automated as possible. Delete and rebuild it before you hand it over to the customer. We don’t all have the luxury of green field deployments, but the pace of IT means there is always at least something new. If you support a large application, just trying moving one of the services onto automated infrastructure. Just start, and piece-by-piece your stress will decrease and your environments will get easier to administer. Reduce complexity, increase automation, and be more resilient to any chaos that may descend on your systems.

Having trouble with unicorn environments?

Still working with a ten year old environment because you’re afraid you can’t rebuild if it gets deleted? Get in touch and we can discuss a way to secure your environments from disaster and ensure a spanner in the works only causes a hiccup and not a bang!

Categories: DevOps