When was the last time we checked the backups?
If you want to make the CTO or ops-team shiver, there might be no better question. That's because, usually, no one has the answer. And that is terrifying!
So, do you know the answer? If you do, I'm willing to bet you are the outlier. Unfortunately, backups are not something we as an industry are good enough at. It might be because it's boring, because you have never needed them, or because you think your system could never fail. The truth is, you will need backups one day, and there are more than one way your backups can fail. Just checking that you have a backup might not be enough. So lets look at the 5 things that needs to be in place for you to sleep well at night!
Are you taking backups at all?
Step 1 is to actually be taking backups. Terrifyingly enough, a lot of companies needs to start here. There are many things to ensure there are backups of:
- Disk storage
- Source code
While losing the documentation for your system might not end your company right away, you can stumble into issues further down the road. Storage is generally cheap, and trying to save a few bucks on not backing up the things you might see as "less important" will cause you a lot of extra work down the road.
At the very least, you must be sure that you are backing up all business critical data and information. Ask yourself (or the boss): "If this piece of data disappeared right now, would we have a problem?". If the answer is yes, then back it up!
Do you have redundant backups?
Two is one, one is none
Having a single backup is obviously better than none, but not a whole lot. It can create a false sense of security, when in reality, the smallest error can result in you having no copies of your data. You should have a number of redundant backups in case there are issues with either the integrity or availability of your primary backup.
A likely event you will encounter is corrupted backup files. In a complex system there are plenty of issues that can arise during backup, causing a corrupted file. A process might be killed, a virtual machine might get moved to a different server and not start up properly, network issues, the list goes on. While we can't rely on every backup being usable, we can create several backups, increasing the chances that at least one of them will be able to restore the system.
The number of backups needed will be individual to every company and system, but the more users you have, or data you store, the more copies it makes sense to have, as the consequence of data loss increases.
Do you have independent backups?
There is a minor, but important difference between redundant and independent backups. While having redundant backups ensures there are multiple copies should one be corrupt or lost, there are a host of other issues that this won't save you from. For this, we need independence between the backups.
Where are your backups stored? The same server as production? Same data centre? Same city? Yeah, that's not gonna cut it if your server cuts out, the data centre burns down or the city is flooded or loses power. If your backups are not independent, they are not really redundant either.
You should also consider the "blast radius" of operations and incidents that can impact the backups. If all backups are managed by the same system, a single lost credential might be enough to compromise all the backups. Similarly, you may be using the same vendor for all backup storage. What if they are suddenly unavailable for a prolonged period, bankrupted, or have a disgruntled employee who runs
rm -rf / for the heck of it?
You should have backups that are stored in truly independent ways. Ideally, no single person in your organisation should have access to all copies, or even manage the contracts with all the vendors. You want to make sure that no one can cause catastrophic damage to your company – neither on purpose, nor by accident.
Does your backups actually work?
So, redundant backups are in place, but that's just half the battle. Do they actually work? And how to you check them?
You should setup systems that monitor your backups, and ensures that new copies are being made. It would be ideal to test every backup automatically, but this will not be feasible for most. If you are instead able to make some assumptions about your backups, you can test these. If you backup a database or file share, you might expect the size of the backup to never decrease significantly. You can then check that the new copy is at least as large as the average of the last 10 copies. If the backup completes unusually fast or slow, you might also want to be alerted, as this could indicate that something vital has changed.
Has something changed without you noticing?
This might be the scariest thing. When everything is in place, and the wheels are turning, things can suddenly break. If you use a cloud function to send files to a backup location, it can stop working. Your backup location might run out of disk space, or your vendor might silently have changed important factors like their internal backup frequency.
The best way to mitigate these risks is to monitor your backups. Your monitoring should be able to alert you automatically of there are issues. Relying on routines that you need to check the backups every Monday is a recipe for disaster.
If part of your backup strategy relies on a vendors internal backup systems, you should verify these with the vendor at least once a year. If your deal with the vendor is for them to back up your file share once a day, and do tape backups once a week, you need to ensure that nothing has changed. Large vendors can lose track of their agreements over time, and suddenly adjust the backup interval without giving notice.
Some of the suggestions here might seem a bit extreme, and present too large a cost in dollars and man-hours. And maybe they do for your organisation. But if your business would not survive a data loss, then making the effort to ensure your backups are rock solid, is a small price to pay. When that faithful day comes that you need the backups, you will be grateful to your past self when recovery is possible thanks to a fresh and safely stored backup!