When technology fails us, we’re at a loss. Just think about the last time your phone died—didn’t it seem like the longest day in your life?
Maybe that’s an exaggeration, maybe, but many businesses are completely reliant on technology to run smoothly. If you’ve ever been in a restaurant when their point-of-sales system goes down, you know what I mean. The kitchen comes to a halt as they try to figure out which orders made it in before the crash and which didn’t. Everyone on staff has to adjust to a new way of ordering food and drinks. No one can add up a check and figure out the tax quickly. Service becomes excruciatingly slow. Nerves are frayed.
Our organizations are no different. Technology enables us to do more with less. But the more we leverage technology, the more dependent we become too.
Why systems are designed to fail—what the what?
In our highly connected world, a shockingly vast number of systems are involved in fulfilling the simplest of requests, from sending an email to making an online payment. While it seems like these systems rarely ever fail, the truth is they fail regularly, that’s expected.
Because it’s impossible to prevent all errors, modern systems are designed to fail in manageable ways. Multiple layers of redundancy are built into system design, and components are meant to be quickly replaced, not fixed.
This level of engineering is difficult, exacting work, and far outside the reach of most organizations. No ordinary organization can hope to match the resources and reliability of major platforms. It makes sense to trust our operations and data to these platforms then to try managing it ourselves.
The human element in system failure
So if these systems are so stable, what causes major outages, like the recent Salesforce and Google outages, and the Google Calendar crash last month? Most platforms are extremely stable when running in their steady state, governed by battle-tested rules and algorithms. But, they’re vulnerable when modified—and modifications are made by people. Ah, there’s the rub.
Manual changes made by well-meaning people are one of the greatest persistent challenges to platforms. The best companies approach modifications with a high degree of rigor. Changes to their systems are governed by playbooks and reviewed by many sets of eyes. This approach reduces error, but as long as humans are involved, there will be errors. The Salesforce and Google outages provide evidence for this unfortunate reality.
Despite the rare chance of human error causing an outage, these platforms are still the best possible choice, and almost always the most economic. The problem is not the technology, but rather our increased dependence on it.
Prepare for the unexpected: a contingency plan for nonprofits
Our organizations could not fulfill their missions without technology, so we must be ready to mitigate the effects when it misbehaves. You need a contingency plan in place, so you’re prepared for the unexpected, but not unlikely, system failure.
The next outage might come from your payment processor, website host, or cloud-hosted software, for example, your online learning platform, content management system, or even Salesforce. Or, a system might fail because of a power outage, loss of internet, cybersecurity attack, building fire, or earthquake.
Every organization needs a business continuity and disaster recovery (BC/DR) plan in place for essential operations, especially for areas like human services program management where the impact of an outage could be critical. A BC/DR plan helps you manage your resources so you can recover from disruption and return to normal operations as soon as possible.
The business continuity plan includes policies and procedures your organization will follow to resume mission-critical operations in the aftermath of a disaster, for example, crisis communications plan, staff responsibility matrix, and business and services impact analyses.
The disaster recovery plan lays out the processes your organization will follow to recover from a disaster, for example, network restoration and data restoration from backups.
Take the time to think about a Plan B or backup you can switch to for each mission-critical activity. For example, have a contract with two payment processors, even if you only use one. Keep critical contacts in your phone, not just in email or Slack.
Talk with your technology partners about how your systems will behave in a failure state, and what your options are. For example, many systems support a rollback or versioning that reverts it to a previous working state, like a big “undo” button. In this scenario, you may recover quickly, but will likely lose a little data. For critical systems, consider a more aggressive backup strategy so you lose as little data as possible.
Technology is wonderful, until it isn’t. Given the complexity of our connected world and the human tendency to make mistakes, system outages are inevitable. But we can prepare for them so they don’t ruin our day or the day of our supporters and members.