The Importance of Resilience Testing

Software Testing usually involves various techniques and methodologies in order to test every aspect of the software including the functionality, performance, issues, and security. This is where Resilience Testing comes into play.

Resilience Testing is vital to ensure that applications work well in real-life or chaotic conditions. It tests the application’s resiliency and ability to withstand challenging situations.

Thus, we have asked experts in the industry to explore the role and importance of resilience testing.


What is resilience testing?

According to Paul Davison, Managing Director at Seriös Group, resilience testing is a non-functional test technique, which involves testing the ability of a solution to continue to provide an acceptable level of service to the business when under stress and/or if there should be issues impacting one or more of the system components. It can also help to ensure we’re better equipped to deal with and recover from failures.

Annarita De Biase, QA Manager at Soldo, adds that resilience testing is a particular test category whose target is the observation of systems we work on, in certain limit conditions.

For example, do you always know what happens if just one of your core services is no more available? And what if it happened the same with a not “core” service? Or if a server is disconnected for whatever reason? Or if data are not more accessible?

These are real nightmares for a company (developers, QA engineers, sysadmin, business people), but by simulating them somehow and observing how our platform reacts, we can be ready to face issues and make up for them.


How to test resilience?

Annarita points out that Resilience testing is based on observability in particular conditions.

First, it requires the analysis of the business. Which are the core functionalities of your platform? Which kind of data do you need for them?

Once this info is clear, it’s possible to start planning resilience tests. Well, “planning” is a particularly weird word in this case, because we would like to “plan the chaos” or at least the greatest number of random incidents.

The next step is the chaos simulation whose target is to break the system. So, hopefully, in a dedicated environment very similar to production one, people start shutting down some servers, or they inject malicious code into the system to simulate a hacker attack or they can make some core data unavailable, or they can do whatever they think could cause problems.

After these tests, all the simulation consequences data are collected and analyzed to plan activities to face them.

On the other hand, Paul shares an analogy in relation to testing resilience: If you get a flat tire on your car and you know you have a spare tire in the boot but:

  • Do you have the tools needed to change it?
  • Do you know how to change it?
  • Is the tire inflated?

You could undertake a ‘dry run’ and change the wheel to test this out on your drive, but would it be the same trying to do it in the rain in the dark by the side of a busy road?

Hence, resilience testing means exploring the ‘what if’ scenarios to determine what the impacts would be to the system capability should failures be experienced. There are many different ways to do this.

Netflix took an interesting approach to test resilience by building Chaos Monkey, a tool that randomly disables production instances to test common types of failure without customer impact. The name came from the idea of unleashing a wild monkey in a data center to bring down servers and chew through cables.

Perhaps more interesting, he continues, is their approach to running this testing during a business day, with engineers on standby to address problems. This allows them to learn lessons and build automatic recovery mechanisms for those ‘what if’ scenarios that would cause them significant problems. The success of this approach inspired them to extend the concept and they created a virtual ‘Simian Army’ including tools that test latency, identify non-conformant components, and health check components.

By testing to identify weaknesses, they can proactively remove these potential failure points, such as components not configured to auto-scale, correct them and bring them back into service before they cause issues or failures.

Other organisations still opt for a more structured approach using non-functional requirements. Typically, this involves testing each potential point of failure within a solution to validate that when any component fails any requests which are being routed to it are redirected to alternate component(s) that perform the same function.

These tests would involve bringing down elements of the solution to simulate a failure and would include scenarios to ensure that:

  • The alternate component(s) could handle the required volume of requests within the required timescales
  • Any relevant monitoring/alerting tooling reacts as expected
  • Recovery actions can be undertaken to bring the failed component back into service in a timely manner


Why and when use resilience test?

No system or application will run without failure forever, regardless of how well it is built/designed, Paul points out.

Indeed, the key is to understand how well, or even if, the solution will function under failure conditions. This means that Service Management teams are in an informed position in terms of understanding how long and how well the service can function without any given component being available. The testing also ensures the team understands what needs to be done to recover from the failure and has proven the capability to do so.

Resilience is essential within any IT solution these days, he states, but individual organisations will determine how critical it is to their IT strategy based on their approach to risk. A retail shop selling 2% of its products online may not want to plow a lot of money into building fully resilient infrastructure and testing it.

Conversely, financial institutions that could be hugely impacted reputationally should their systems be seen to fail are highly likely to prioritise resilience. In Public Sector areas, such as welfare or healthcare, a lack (or failure) of resilience within a solution could lead to human hardship or loss of life which would again make it a high priority area from a test perspective.

For Annarita, there are no valid reasons not to use resilience testing. Maybe in the case of extremely simple platforms in very young organizations they can be postponed, but with an architecture of more than a couple of services or have a number of users higher than 10, “resilience” becomes a real need.


The benefits & the challenges

Resilience testing gives you deep knowledge of your platform as well as allows testers to take quick actions in case of problems, Annarita notes.

You can be well prepared for a super simple release and then, during the deployment in a production environment, or soon after, “something” happened destroying all the smiles on everyone involved. In test environments, all unit tests, all functional tests, all integration tests can be ok, and still have something happen because of “something”, of a random incident we could not prevent.

Well, “chaos” cannot be totally planned, but resilience testing can contribute to make our platforms more stable and “ready” and to make our company more efficient with fast reactions.

Moreover, Annarita underlines that, like every kind of test, the most challenging part is the planning.

As a technical QA, Annarita likes to be into architecture and DevOps stuff, and even if she thinks that planning is the most difficult and complex part of the thing, it’s also the most interesting one. Developers, QA engineers, DevOps, business people, etc, must try to simulate the chaos, and even though they will never do 100% of possible scenarios (like in all the other kinds of tests), they need to try. Collaboration is key to make sense of the ‘chaos’ everyone is trying to face.

In a world where consumer expectations are ever-increasing, it is critical that organisations ensure any issues/failures of their software or service cause minimal disruption and wherever possible are invisible to the end-user or customer. Hence, Paul shows that undertaking resilience testing won’t prevent failures, but it does mean that when they occur you know how the system will perform and what corrective actions are necessary.

Many organisations still see the fact resilience is built into the design as sufficient, if there are two suitably sized load balancers why do we need to test? They don’t consider the implications of network bandwidth or other downstream components.

Even in organizations where the need to undertake resilience is fully understood, there can be challenges. This is normally due to the fact testing is unlikely to be wholly undertaken by the test team. Resilience testing is a significant undertaking requiring a representative environment and skilled, knowledgeable resources to support the testing. The resources needed are often from the platform and service management teams who are busy keeping the light on for the production platform.


The future of resilience testing

For Paul, consumer expectations are ever-increasing, organisations providing systems or services which experience frequent outages or reduced function risk losing customers. This coupled with the move to cloud-based infrastructure which brings new options in terms of resilience capabilities will help to ensure that resilience testing remains an area of growth.

It is critical that organisations understand the areas of risk within their solution and how they deal with those ‘what if’ scenarios.

Annarita believes that ‘resilience’ is more important than ever. We, as human beings, faced a time during which we made incredible sacrifices to go on with our lives even if there was a pandemic affecting the entire world. Now, we have learned how to live in a potentially different way, with the precise goal to be back to normal life, but with the awareness of being able to live in different conditions.

Well, the same thing is for our platforms (software & hardware). We have to act so that they can be ready to face unexpected events, so that they can be strong enough, and that their final users can go on using them.


Special thanks to Paul Davison and Annarita De Biase for their insights on the topic!