Browse By

Recovery Testing: The Neglected Child of QA!

Failure… the word that sounds bad, everyone hates it. Don’t bother to say the whole word, even the first letter has a negative meaning in a gradebook.

We’re so afraid of failure, we always try to avoid it, even up to the point that we fool ourselves saying that our system will never fail.

But let’s just face the fact, everything that can fail will fail at some point of time, says Murphy’s law.

This is where Recovery Testing comes to play, we need to test our system how it handles failure (and it should be better than how some us, humans, do).

Let’s see what do we need to know to do recovery testing.

1. Assess how much downtime can you afford

The most common way of measuring this is by “nines”.

Nines Availability % Downtime per year
one nine 0.9 90% 36.5 days
two nines 0.99 99% 3.65 days
three nines 0.999 99.9% 8.76 hours
four nines 0.9999 99.99% 52.56 minutes
five nines 0.99999 99.999% 5.26 minutes
six nines 0.999999 99.9999% 31.5 seconds


Different type of organization can afford different downtime periods. Although ideally a software should never go down, due to some resources restriction, not every organization can afford it. But since not all organization needs a very high availability, it’s probably acceptable to have some downtimes.

Some organization that require high availability includes:

  • Finance and Banking
  • Large size retail
  • Health care
  • Public service
  • Security and Defense
2. What are the failures that might happen?

Next in the checklist is to identify the different kind of failures that can happen to our system. Although there are several common failures that apply to most of systems, they can vary greatly depending on the location or nature of the system. So, as always in software testing, this step requires creativity.

Most common failures to handle:

  • System process or transaction failure
  • Network failure
  • Database failure
  • Backup failure
  • Power failure
  • Natural disaster
3. Create a plan on handling these failures

Having identify the problems that are likely to happen, we now can plan on how to handle these failures. Some of the plan maybe writing better and more robust codes, some maybe more on the infrastructure implementation side, and some maybe on the management of how the people should response to the disaster.

4. Test the plans

If all of the previous steps have been faithfully done, we now need to test whether the system (and the team responsible for it) can handle failures as it supposed to.

This can range from different tests such as:

  • Removing the network connection during system is running.
  • Restarting the machine(s) during a process.
  • Restore data from backup media
  • Test the system in case of power interruption
5. Repeat the process from step 1

Passing the recovery testing is good, but it’s a continuous process. As the business grows, they may afford less and less downtime, and more and more risk is involved that can fail the system. So this will require new plan(s), and also new testing plan(s) as well.

Always Ready to Handle Failures

As we can see, recovery testing is another part of non-functional requirements that requires good planning and most importantly, good execution to pass.

A good habit is to always ready to handle failures and Netflix can be a very good example. The company built “Chaos Monkey” that randomly kills instances and services within Netflix’s AWS (Amazon Web Services) infrastructure. This keeps the developers to make sure each individual components can handle these problems as they occur.

Moral of the story: Never neglect to face your chaos monkey(s), have recovery testing! 🙂

Wikipedia – Recovery testing
Wikipedia – Nines
Netflix Chaos Monkey
Disaster Recovery Journal

Leave a Reply

Your email address will not be published. Required fields are marked *