Black swan events might make you rethink your business continuity and disaster recovery plans—but there’s a catch. Isn’t there always?
Black Swan events are once-in-a-hundred (or a thousand) year happening. But they seem to happen much more frequently: the once-in-a-century storm two years after the last once-in-a-century storm, the worldwide pandemic, or the software update that halted thousands of companies’ Windows Systems.
All of these events seem so rare as to be improbable. Yet here we are, watching some of the world’s biggest organizations recover 8.5 million computers from a software update that put them into a cycling Blue Screen of Death (BSOD).
The Crowdstrike outage is just the latest in a long line of improbable events to throw a wrench in things when it comes to business IT. Let’s look at what you need to think about from now on—and trust me, it won’t be about your Endpoint Detection and Response (EDR). Every EDR vendor’s engineering team will burn the midnight oil to ensure this never happens again.
There are lessons to learn from an event like this—especially if you weren’t one of those companies with the sort of tech staff that digs into this stuff for fun and brings a barcode scanner to work with them the next day.
Let’s look at the short term first, and then take on lessons that stand the test of time and weather all the chaos that economics and technology can throw at us.
The Cloud Doesn’t Make IT Infallible
Regarding IT business continuity, the cloud has shifted a lot of heavy lifting onto the big vendors, but the risk has not entirely been removed. Ask me on the wrong day and I might even suggest that it’s just given us all different problems to tackle.
If you wrench your own infrastructure, removing on-premises servers and software licensing from the mix and replacing it with the cloud has alleviated a lot of physical equipment, configuration, and licensing headaches. But, it’s vital to remember the old cynic’s take on the cloud: it’s just someone else’s computer. That someone else, however, probably has more resources to bring to bear on problems than you and I.
Disaster recovery and business continuity for most organizations’ computing services are now managed by the big cloud and Software as a Service (SaaS) providers. With applications and server-side computing moved to the cloud, the problem (or so some may think) is now owned by Microsoft, Google, or Amazon.
It might be easy to take client computing and connectivity for granted, too. Until it all goes south…
Let’s break that last statement down a bit:
- Uptime, compute, storage, and fidelity are now the responsibility of cloud providers—for anything not taking place on your endpoints or network.
- The recent Crowdstrike BSOD outage has highlighted a problem with this—if your endpoints are stuck in purgatory, you can’t reach all that cloudy goodness.
- On top of that, the interconnectedness of cloud, SaaS, endpoints, and automated updates can make for a toxic mix and some weird and wonderful outages.
- The cloud might make outages someone else’s problem, but a loss of service can be catastrophic for many businesses. Knowing it’s not your fault is cold comfort if your company’s screens are blank.
The big thing in all this is that cloud uptime just didn’t matter in this particular event (July 2024)—the problem was any Windows system running Crowdstrike (PCs, servers, and Azure cloud services). If your endpoints were all reliant upon these systems that got the fateful update, there was no way of getting to them unless you had alternate methods.
The end effect would have been quite similar if several other possible issues had occurred, however.
Other Black Swans Compound the Central Lesson
COVID might feel like a distant bad dream, but the impact of a global lockdown tested companies like few other events. The business continuity and disaster recovery lessons from it are well worth a look. Overnight, entire industries sent their workers home from the office indefinitely, and all kinds of ways of working had to be reinvented. There are stories of IT departments raiding their storage rooms for old laptops, building infrastructure to support online collaboration overnight, and securing it for users connecting over their home broadband—sometimes with their own devices.
Okay, What’s the Takeaway?
Think about all the unexpected events that have hit your business in recent times.
How many did you anticipate? I’m guessing the answer to that is going to be somewhere between ‘none’ and ‘not many.’ Because they’re rare and unusual, they can also be unexpected.
Yet when you face an event that is expected—such as a power outage, flood, or powerful storm—there’s procedure, experience, anticipation, improvisation, and resilience from many corners of your business. The best thing you can do when updating or making a business continuity plan is to expect unexpected events—and build on top of the experience you have of anticipated events.
Make the Most of the Small Business Superpower
Small and medium businesses have a secret power over bigger organizations, and that’s flexibility and adaptability. Turning around a department of 100 staff takes longer than talking to Gill and Bob about what’s needed. Relationships are often time-forged and replies are immediate—there’s less hiding behind email when you know each other personally.
You can get stuff done promptly when there’s a will to do it and not as many hoops to jump through.
Keep Your Planning Simple and Flexible
Another way to promptly respond is to keep your planning simple.
Focus on classes of problems (i.e. endpoint outages, cloud service outages, network or power interruptions) rather than building detailed scenarios for once-in-a-lifetime events. Leave the movie plots to screenwriters, software companies, government agencies, and service providers.
Isolate these problems from potential causes if it makes sense. A network outage or a data center catching fire has similar effects to a DDOS attack or a ransomware incident (although the resolution and recovery would be quite different).
One example of this might be keeping a reserve of older PCs in off-site storage—something that was helpful both during the early days of COVID lockdowns and the recent Crowdstrike BSOD episode, for example.
Look After the Important Stuff
Identify what’s most important to your business.
If you’re selling to consumers online, the payment system falling over is a big problem. This is less of a concern if you’re selling to other businesses.
Also, consider your employees.
What do they need to do their job safely and effectively without taking on dangerous levels of risk? What policies (such as working remotely or core hours) can you let slide in an emergency, and what are the absolute rocks you’ll have to work around?
Adopt and Rely on the Wisdom You Already Have
Think about the emergencies you have already handled successfully.
Events like winter storms, flooding, and power outages occur more frequently than a global pandemic or systemic IT collapse. There are plenty of best-case examples of how your company and its employees have fixed these problems in the past.
Convene the Disaster Club
Have a crisis team—even if it’s just a list of people who should be managing the crisis. If you can spare the people, have a second team as well.
You want people who can take a longer view. The urgent shouldn’t drown out the important.
Don’t Panic: Rely on Business Continuity and Disaster Recovery Experience
Finally, (and I can’t stress this enough) don’t plan for the last disaster.
In some cases, it’s unlikely to repeat itself soon. Unless it truly is what insurers like to call an “Act of God”—weather, disease, the whole four horsemen stuff—then a lot of motivated and smart people have probably gone to great lengths to stop the same disaster from happening again.
Now, this does not mean to ignore these as possibilities. But, if you’ve already got good business continuity and disaster recovery practices and experience in other areas, you can rely on them to solve any incident thrown your way.
What Next?
If it’s been a while, it’s a good time to review and test your organization’s business continuity and disaster recovery plans—even if that entails reading through it with a core group and coming to a conclusion.
If you don’t have one, this needs to become your top priority.
Consider:
- The best plans are often a plan for a plan.
- Get the right structures in place to handle different types of effects on your organization.
- Don’t write an essay.
- Improvise and lean on your own experience to find a creative fix, as illustrated by the IT staff who resurrected all those PCs with the barcode scanners we mentioned earlier.
And, as always, if you need help building your plans, reviewing your plans, or adding to your business continuity and disaster recovery capabilities, don’t hesitate to reach out to us to help.