Large scale exercises for your dev team

We need to run exercises in order to be prepared for the worst. Whether your team runs safety critical or business critical systems, it is the responsibility of the team to be able to respond if there is a disaster.

Large scale exercises for your dev team
Photo by Aleksey Oryshchenko / Unsplash

When shit hits the proverbial fan and it's time to enact your disaster recovery plan, your heart rate will be through the roof. There will be a hundred things that where never planed for, and untold distractions. That's why it's imperative that you and your team have exercised such a scenario before. I'll take you through why we need exercises, how to select you scenarios, and how to execute and evaluate the exercise for best effect.

We practice crisis scenarios because we as humans are inherently bad at acting rationally in a situation that fires up our crisis response. It is easy to panic and either freeze, or try to do the first thing that comes to mind. Neither are especially productive. It is only after practicing time and time again, that we are able to establish a routine that will overcome the panic of a crisis. The one crisis that most of us have practiced regularly through our life is a fire. When a fire alarm sounds, we know that we must find a safe way to leave the building we are in. We have trained ourselves to equate the sound of the fire alarm, with the action of getting out of the building we are in.

If you are ever put in charge of a commercial building, you must retrain this response. You no longer need to leave the building, but start the procedures for finding and containing the fire, evacuating people, and inform the arriving firefighters of the situation. This requires a lot of practice, and involves the fire alarm being triggered over and over again, until your instinctive response changes, and you can do the routine by heart. I have been through this kind of training twice, and can attest to how important it is the day you have a real alarm.

In situations where fire is the worst imaginable scenario, fire drills are conducted with extreme frequency. In the video below, you will see how submariners practice for fires. They have a fire drill every eight hours, every day!

Smarter every day: How to fight fire or flooding on a nuclear submarine

For IT-teams, the worst possible scenario varies. It might be that your service requires extreme uptime, but can tolerate data loss, or the other way around. You might be processing some very sensitive information that must never be exposed to a third party. Or there might be a whole range of worst case scenarios involving different third-party providers. The first step in creating a valuable exercise for your team will be to identify the most severe scenarios your business can be subjected to. If you do risk assessments of your systems (which you should), this will be the scenarios with the highest consequence rating. Just like with a fire on a submarine, we practice the scenario not because it is likely, but because it is disastrous if it occurs.

Planing

I work in a software company designed and structured an app for field staff. That day we made a tour of our flow and could not miss a shot of our work :)
Photo by Alvaro Reyes / Unsplash

Having identified a list of scenarios that can be exercised, we have to make a plan for how we can exercise effectively and safely. An effective exercise is as close to a real world situation as possible. We want every participant to feel the pressure and adrenaline as if the crisis is happening for real. This will expose the cracks in our plans, the errors in the documentation, and show us how difficult the task will really be. That being said, I do not think it is a great idea to throw your team into an exercise without them being prepared, at least not until you have done this a few times. We can make an exercise more realistic in several ways:

  • Set up a copy of the production environment, and spend some time loading it with realistic test data. This will help the team have a realistic experience when they start searching for a root cause. Risking deleting the data of "Test User 1483" might not strike as much fear into the team as deleting the data of "Ronald Johnson"
  • Simulate outside actors like customers, company executives and the media. Something as simple as creating realistic press clippings mentioning the issue, or an email from the CTO asking for an update, can go a long way. Taking it a step further, you can ask some of the higher-ups to participate.
  • When the team starts mastering exercises, throw in multiple failures as part of the same scenario. Sticking to the world of submarines: Conducting a missile launch drill at the same time as a fire drill should get your crew on their toes. (This is part of the plot of Crimson Tide, a must-watch 1995 submarine/cold war movie.)

A central part of the exercise planning will be the scenario design. This can be as simple as deciding that the exercise will start with the participants being informed of a service outage, or it can be a complex and convoluted playbook that simulates outside actors. If you go down the playbook road, you must be prepared to improvise, and some actor skills might come in handy. Your playbook might call for a simulated failure of a central component. But what if the participants have already shut down this part of the system? You might have to skip steps, move ahead in the scenario, or come up with new (but still realistic) problems on your feet. It is possible to have plans for some of these eventualities ahead of time, but it seldom makes sense to try to cover all outcomes.

When the exercise has been designed in a realistic manner, what remains is to assign roles. You need an exercise leader, who oversees the running of the exercise, and calls the shots. There should also be observers, who have the main task of - you guessed it - observing the exercise. You will need to know who said what, to whom and when. Did the database administrator use 4 attempts before executing the right scripts to perform a failover? These details will matter in the evaluation, and must be recorded by the observers.

Last but not least, you need the team that will perform the scenario in the exercise. This should be the people who are most likely to face the situation in real life. However, you should take care to not only include the people with the longest tenure and most experience. Your newcomers need the training as well, though there might be little value in throwing someone to the wolves on day 1.

Execution

Photo by Sigmund / Unsplash

Executing the exercise should be as easy as following your planed scenario. Usually, the exercise starts when an error is triggered in the system used for the test or a simulated external incident happens. At this point your team will start trying to identify and solve the issue. The main job of the exercise leader is to ensure that everything goes to plan (hint: it usually won't), and coordinate with the rest of the organization.

If your scenario has a clear playbook, the actions in the playbook will help drive the exercise forward. This might include a call from the CEO, an urgent message from the data center, or some files being deleted by a virus.

Keeping the exercise realistic must be balanced against the need for the exercise to be safe. In live exercises there is always the danger that the exercise will bleed over into the "real world". Your team might pick up the phone and ask the data center to reboot a bunch of servers, and accidentally include a production machine. The on-call might see an alarm and interpret it as a real system failure. An engineer might connect to the wrong environment and delete the production database. There is a very real risk that an exercise will interfere with day-to-day operations. As such, there must be some ground rules:

  • The entire organization must be informed that an exercise will take place. This will make people think twice before acting on surprising information.
  • The exercise must be lead by someone with full visibility, and that is available to the organization. Should anyone have any uncertainty about what is real or not, the exercise leader must be able to clarify this immediately.
  • The beginning of all written and verbal communication that is part of the exercise must start with the phrase "Exercise Exercise Exercise". This will get tedious, but it will prevent misunderstandings. When someone calls the internal emergency hotline with a real issue during the exercise, it is crucial that no time is wasted figuring out if it is a real situation or not. It is recommended that real world emergencies are communicated with the phrase "No play", to indicate that this is not part of the exercise.
  • All participants must be available at all time, and ready to abort the exercise. If a real world issue happens, something goes wrong, or there are other reasons to stop the scenario, the exercise leader will broadcast the message "Bulldog Bulldog Bulldog - Exercise aborted". (I have no idea why the word "Bulldog" has been chosen for this, but it is the internationally agreed upon word to stop or abort an exercise 🤷‍♂️ My best guess is that it is a word that is difficult to mistake for other words when transmitted by radio.)

Evaluation

Collaborative Meeting
Photo by Redd F / Unsplash

After the exercise is completed, everyone deserves a nice break, perhaps some snacks or an extra delicious lunch. But after that, it's time for perhaps the most important step: Evaluation. There is little value to the exercise if we can not systematically learn something from it.

Your evaluation should bring everyone together, and give all participants room to talk about their experience. What did the scenario look like from their perspective? What did they find problematic, and what tools or training where they missing? The observers should share their experience after the participants, to ensure that their observations don't colour the recollections of the people performing the tasks.

With everyone's take one what happened, we have to identify the learning points. Where did we go wrong, what could have been easier, and what worked well. The learning must be taken back to the teams, and brought up with leadership to ensure that actual change is made to ensure that a real life scenario will be handled even better, and that the teams are prepared.


Having regular exercises is both fun and invaluable learning for your team! There is no doubt in my mind that any organization that runs a system that is critical for the business should do exercises regularly.