Resilience engineering: an introduction

-

Over the past years, people at Luminis have become involved in the design and development of new types of systems. Systems that implement requirements beyond stability and robustness. Systems that need to be resilient. Systems that can cope with unforeseen disturbances.

What is resilience?

If you look up the terms resilient, resilience, or resiliency, a whole new world of definitions and interpretations emerge before your eyes. Clearly, resilience goes back a long way. Starting in the world of materials and ecosystems before finding its way into psychology and even systems theory.

So yes, buzzwords. Who can live without them? This got me thinking about the term: what does resilience really mean, and is it used correctly?

For many developers, architects, and analysts, resilience appears as something new. It isn’t. System safety and cybersecurity engineers have been leading the field of resiliency. Combined with the academic research on this topic, their work has proven to be foundational. In fact, if you’re interested in this field or are planning to design and develop a resilient system, you might as well start at that part of the domain.

The standard definition of resilience is a systems’ ability to recover quickly from disturbances. This definition works OK in general terms but falls short when describing the resilience of complex systems. Even more so in the case of systems of systems. And what if you were to consider the resiliency of an energy network or networked defense system? Which characteristics come to mind?

I think that resilience does not just refer to how well a system can handle disturbances and variations that fall outside its designed adaptive mechanisms. It’s also about the design of the mechanisms to recover from these situations. Strictly speaking: in a perfect world, truly resilient systems can handle both anticipated and unanticipated change.

Now we have a better definition. Resilience is about:

  1. The ability to recover from faults.
  2. The persistence of service reliability when facing change.

This is the essence of adaptative behavior.

What is resilience engineering?

Now that we have a good definition of resilience let’s dive into the specifics of designing and developing resilient systems. Enter resilience engineering.

Hanging out under the same umbrella as chaos engineering, resilience engineering is a way of building systems designed to withstand failure and change. Resilience engineering acknowledges this reality of constant and sometimes unexpected change and faults. Through engineering, it provides a way for systems to cope with this.

Good resilience engineering produces systems that can activate the appropriate adaptive behavior. Resilience can be built into a system. It also offers perspectives on critical areas like cybersecurity, safety, and operations.

Here are some example results of resilience engineering:

  • A system that autonomously starts using the next best CPU after the cloud provider stops providing the previous one.
  • A system that can scale up by acquiring servers from a different availability zone in the same region when all the servers in its zone fail.
  • A system that adapts to hardware configuration changes by maintaining its features without human intervention.

What is autonomous adaptive behavior?

The defining trait of resilience is autonomous adaptive behavior. If you provide a SaaS product and your systems go down, you no longer have a product.

Humans have long been the primary agent in making systems adapt. Investigating failure, getting things up and running again, and thus making systems resilient to failure. It used to be the work of on-the-ready people.

But human labor is too costly, error-prone, and slow in the age of cloud systems.How Complex Systems Fail by Richard Cook presents an interesting overview of common ways that systems fail. 50% of the choices have to do with human error or the necessity of human intervention.

Fortunately, software comes to the rescue. Software has proven to be the goto-mechanism to implement adaptive behavior. With the emergence of cloud computing and infrastructure abstractions like Docker and Kubernetes, more and more software is taking over people’s work.

How do you design resilience?

When thinking about resilience, it is usually difficult to avoid terms like failure prevention and robustness. Stronger even, one of our best practices is to find weak spots in systems or designs and reinforce them. It’s the way we arrive at mechanisms like redundancy, firewalls, and endless procedures.

Resilience takes another approach. Instead of reducing failure, it strengthens success. Furthermore, assuming it is possible to build supporting adaptive mechanisms, the system can display emergent, resilient behavior.

Here are a few rules for building resilience into systems.

Always react to failures and (unforeseen) change

When errors occur, teams need to understand what normal, desired behavior is and act accordingly to restore that. When a failure or unforeseen change occurs, and there is no correct response, you are not adapting.

(Note: not responding to failures or disturbances is a key characteristic of the organizational death spiral.)

Always log correctly in a comprehensive logging infrastructure

The key to successfully treating failures and disturbances is identifying them as early as possible and finding their root cause. Having a comprehensive logging infrastructure in place that allows you to build good logging messages is essential. The underlying infrastructure should be able to:

  • Help identify errors quickly.
  • Support an initial root cause analysis.
  • Allow staff to handle and treat the errors with ease.

Design a metrics system and stick to it

One of the biggest problems when encountering faults or disturbances is subjective interpretations of information. Therefore, it is important to base your designed resiliency on a limited set of important metrics. Mainstream metrics in the world of cloud computing are:

  • Service or system availability.
  • Mean time between failures (MTBF).
  • Mean down time (MDT).

And let’s not forget good old metrics like response time, latency, and bandwidth.

(Note: mean time to failure (MTTF) and mean time to repair (MTTR) are metrics belonging to the domain of non-repairable systems.)

How do you test resilience?

The easiest way to test a resilient system is to wait for failure to happen and then hope for the best. A better way is to simulate failure before it happens in real life. In other words: simulating chaos.

Like I mentioned before, resilience engineering and chaos engineering have a relationship. The latter is often used to test the resilience of a system:

Chaos engineering, a practice developed at Netflix, aims to help test a system’s resiliency by proactively throwing common and unexpected failures at a system. The original idea was an experiment: how can engineers build the system to be more resilient before bad things happen, instead of waiting until after the event?

This led to the creation of Chaos Monkey, a tool that simulated common failures in the system’s infrastructure. Like its namesake, the tool acts like a monkey rampaging through a data center, unplugging and cutting cords wherever it goes.

The next step is moving away from tests and ensuring resilience thinking is baked into the way everything is built. Ways to segue into that state of mind are creating a chaos engineering team, doing chaos architecture, and organizing game days where teams focus all their efforts on testing their systems’ resiliency.

Conclusion

Resilience engineering is all about adapting to expected and unexpected changes. A resilient system expects failure and can find ways to keep on working reliably.