Building resilient connections using a circuit breaker

-

Software systems are more and more integrated systems. They integrate with datastores, with log and monitoring systems, with services provided by other components (micro-services) and with SAAS from external parties. When using the cloud to run software components, they need to be as stateless as possible, easy to scale, and resilient to other components’ failures. One way of creating more resilience is by using decoupling, for example, by using asynchronous actions or queues. In some situations having synchronous communication is required. In cases where the user needs direct feedback, and the component needs to integrate with another system, we can put a circuit breaker in our component between the connection component to the other system. This circuit breaker monitors the calls to the external system. It prevents us from continuously trying to connect or call the system even when it is down. But it also provides us with information about the current state and some basic stats.

This blog post summarises what we need from a circuit breaker. Through a demo using the excellent library Resilience4j, we show some aspects of the circuit breaker.

Important circuit breaker concepts

The most known circuit breaker is the one that everybody has in the house. According to Wikipedia, an electrical switch opens up in case of an overload or a short circuit. The switch prevents this overload or short circuit to damage appliances connected to the switch by opening up.

In this post, Martin Fowler explains the circuit breaker’s concepts and why this is interesting for software components. A circuit breaker monitors for failures (short circuit) and slow responses (overload). If the rate of failures or slow responses becomes too high, the circuit breaker opens up, and no calls can get past it any longer. After a cool-down period, the circuit breaker lets a few calls pass to scan for improvements. If the calls go through, the circuit breaker’s state goes back to closed, and all calls can go through again.

With a circuit breaker in place, we want insights in the state, and all passed calls. Information about the current failure rate, slow rate, successful calls, failed calls. Next to monitoring, we also want to configure thresholds, rates, recovery time to suit our needs.

Introduction of the Resilience4j library

Googling for java and circuit breaker gives a lot of posts. In the past, I used a project called Hystrix. Hystrix integrates well with Spring, but it also needed a lot of libraries. Another thing is that Hystrix is now in maintenance mode, so discouraged to use. Another library that is mentioned a lot is Resilience4j. Resilience4j is a library with additional components for a retry mechanism, a rate limiter, and a time limiter. For the monitoring part, it has integrations with Micrometer and Grafana. It also comes with integrations for Spring Boot and Spring Cloud.

I am going to use Resilience4j to demo the concepts. You can find the sample code on Github. The demo contains a basic spring boot application to integrate with through REST. The REST application has endpoints to ask for an error, success, slow, timeout.

The source code here focusses on the Resilience4j code. You can find all the other code in Github. The project contains a readme file to get started.

First, we configure the circuit breaker. Some configuration options deal with thresholds, for example, the number of failures in a window. Other deal with the open state and how fast we can go back to the half-open state. You can also configure the exceptions that should trip the circuit breaker and which should not. As the test I am doing is fast, the values are low. We need these to make sure we see all the different conditions.

CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()    
    .failureRateThreshold(10) // 10% of requests result in an error
    .slowCallRateThreshold(50) // 50% of calls are to slow
    .waitDurationInOpenState(Duration.ofMillis(10)) // Wait 10 milliseconds to go into half open state
    .slowCallDurationThreshold(Duration.ofMillis(50)) // After 50 milliseconds of response time a call is slow
    .permittedNumberOfCallsInHalfOpenState(5) // Do a maximum of 5 calls in half open state
    .minimumNumberOfCalls(10) // Have at least 10 calls in the window to calculate rates
    .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.TIME_BASED) // Use time based, not number based
    .slidingWindowSize(5) // Record 5 seconds of requests for the window
    .recordExceptions(UniformInterfaceException.class) // Exception thrown by REST client, used as failure
    .ignoreExceptions(DummyException.class) // Business exception that is not a failure for circuit breaker
    .build();
this.circuitBreaker = CircuitBreaker.of("dummyBreaker", circuitBreakerConfig);

With the configured circuit breaker we can start wrapping method calls. Resilience4j uses the java functionals API to wrap method calls.

public void call(String message, LocalDateTime timestamp) {
    Supplier<String> stringSupplier = circuitBreaker.decorateSupplier(
        () -> dummyEndpoint.executeCall(message, LocalDateTime.now())
    );
    try {
        String test = stringSupplier.get();
        LOGGER.info("{}: {}", timestamp.toString(), test);
    } catch (UniformInterfaceException e) {
        LOGGER.info("We have found an exception with message: {}", message);
    } catch (CallNotPermittedException e) {
        LOGGER.info("The circuitbreaker is now Open, so calls are not permitted");
    }
    printMetrics();
}

Of course, you want to know what is going on in the circuit breaker. You can request metrics using the Metrics API, and there is a way to get callbacks about successes, failures, state changes, etc.

private void printMetrics() {
    CircuitBreaker.Metrics metrics = this.circuitBreaker.getMetrics();
    String rates = String.format("Rate of failures: %.2f, slow calls: %.2f",
    metrics.getFailureRate(),
    metrics.getSlowCallRate());
    String calls = String.format("Calls: success %d, failed %d, not permitted %d, buffered %d",
        metrics.getNumberOfSuccessfulCalls(),
        metrics.getNumberOfFailedCalls(),
        metrics.getNumberOfNotPermittedCalls(),
        metrics.getNumberOfBufferedCalls()
    );
    String slow = String.format("Slow: total %d, success %d, failed %d",
        metrics.getNumberOfSlowCalls(),
        metrics.getNumberOfSlowFailedCalls(),
        metrics.getNumberOfSlowSuccessfulCalls()
    );
    LOGGER.info(rates);
    LOGGER.info(calls);
    LOGGER.info(slow);
}
private void addLogging(CircuitBreaker circuitBreaker) {
    circuitBreaker.getEventPublisher()
        .onSuccess(event -> LOGGER.info("SUCCESS"))
        .onError(event -> LOGGER.info("ERROR - {}", event.getThrowable().getMessage()))
        .onIgnoredError(event -> LOGGER.info("IGNORED_ERROR - {}",      event.getThrowable().getMessage()))
        .onReset(event -> LOGGER.info("RESET"))
        .onStateTransition(event -> LOGGER.info("STATE_TRANSITION - {} > {}",
            event.getStateTransition().getFromState(), event.getStateTransition().getToState()));
}

That’s it. Now you know what you need to know to use the Resillience4j library and create a circuit breaker. I leave the other features that Resillience4j has for you to explore. The final bit is about running the sample and see that it works.

Running the sample

You can find the source code in the following GitHub project:

https://github.com/jettro/resilience4j-demo

The sample does a loop over three different calls. One to the dummy endpoint, one to the error endpoint, and one to the slow endpoint. We configured a minimum of 10 requests in the window. In the beginning, the following is logged:

Rate of failures: -1,00, slow calls: -1,00
Calls: success 5, failed 2, not permitted 0, buffered 7
Slow: total 2, success 0, failed 2

No rates are recorded yet, but after the next error call it all changes. We see we have enough requests now, the failure rate is above the configured threshold of 10%. Therefore the circuit breaker moves into the OPEN state.

STATE_TRANSITION - CLOSED > OPEN
Rate of failures: 30,00, slow calls: 30,00
Calls: success 7, failed 3, not permitted 0, buffered 10
Slow: total 3, success 0, failed 3

After 10 milliseconds the circuit breaker changes state into half-open

STATE_TRANSITION - OPEN > HALF_OPEN

But as we keep getting errors, it goes back into OPEN after a few calls

References

https://en.wikipedia.org/wiki/Circuit_breaker

https://martinfowler.com/bliki/CircuitBreaker.html

https://resilience4j.readme.io/docs/circuitbreaker

https://github.com/Netflix/Hystrix/wiki/How-it-Works