Preparing for a network failure
It’s a fact of computing life that things go wrong. Steve Cassidy explores the measures you can take to reduce recovery times when the Bad Thing happens
In business, it can be easy to sleepwalk into a crisis. That’s especially true when it comes to your network. Just sticking with a system that works (for now) can be enough to get you into trouble. The lifetime of a business-grade network switch can be as short as a decade, so if you set up rock-solid infrastructure but don’t take another look until the day it gets shut down by a dirty power cut, lightning strike or whatever, you might find it takes more than a reboot to get your business operating again. Let’s look at some precautions and responses that can minimise the pain.
Don’t rely on software fixes
Most network switches and routers run their own operating systems. You can usually manage them from a web browser, or an elaborate piece of intermediary software; often you can also log in and control them directly with a command-line interface.
But when that vital bit of hardware goes wrong, this software layer is likely to be a casualty (in fact, corrupted firmware may be the cause of your woes). Don’t assume you’ll be able to rely on built-in fault detection or remediation: you’re safer working on the premise that you’ll have to reload the firmware and rebuild your settings.
This is one of those areas where the cloud makes things worse. Most data centres use multi-tenant management arrangements, in which the fixing tool, monitoring tool and performance tool are all accessed through the same remote-access software platform. This means that if the latter is inaccessible, you have very little scope for hands-on emergency network surgery.
This realisation dawned on me in the early 2010s on Facebook. I noticed I was beginning to see more and more social media posts from friends and colleagues depicting an anonymous building, with a caption of “back here AGAIN”. Because at the highest level of cloud management, there often comes a point when someone has to pick up their laptop with all the switch firmware configuration utilities and architecture images, drive out to some out-of-town data centre in the middle of the night, and start crawling around in the back of a network cabinet. If you’re using hosted resources, it’s best to accept that you may need to visit them from time to time.