Traceable tests

-

Have you ever spent hours or days trying to figure out why some API test is failing? Whenever something like that happens to me, my immediate thought is: what could have helped me find this problem faster?

Unit tests are much better at this, when they fail, you just run them again locally, set a breakpoint if you have to, and find out what failed. But with tests that run against your API (yes, you need those too!) it can be much harder to figure out exactly what code got executed. There are all sorts of boundaries that make this hard, especially when those tests run as part of your pipeline (yes, you need to do that!). You can’t just set a breakpoint in your pipeline and even if you have logging in your backend, how do you know which log entries resulted from which test scenario?

I’ve been playing with the idea of linking backend logs to tests for some time. Years back, I had a plan to add the name of the test to a custom HTTP header when calling the API from tests. In the backend I could add that value to the MDC context that we attach to log entries. The result would be that for each log entry, you could see which test triggered it.

Déjà vu

Somehow I never got around to implementing that idea. Last year though, I helped out a team working on an API-only service that ran into the very same problem. Their system is comprised of a set of microservices and some of the API calls go through several of these services to handle the request. Next to unit tests, one of their other automated validations is a regression test suite that runs against the API to validate expected behavior.

So far, so good, and that test suite was really valuable. It allowed the team to quickly implement new functionality without spending a lot of time checking that existing functionality kept working.

You know what’s coming now… whenever a test would fail, it was a nightmare to figure out why.

Luckily, I had just finished hooking up al the services to an Elastic Cloud environment, with APM-powered tracing and all the good stuff. The traces gave excellent insight into what happened inside and between the micro services. However, it was still really hard to link a specific test failure to a set of traces.

OpenTelemetry FTW 😃

I still had that idea from before in my head, and when I saw the team struggling with these failing tests, a new idea popped in: what if I linked the traces to tests?

The typical method to pass trace ids across service calls is by using a trace header. OpenTelemetry provides the traceparent header for that, and the Elastic APM agent being used in the services picks that up out-of-the-box.

My first idea was to return the traceparent header in the API responses and then log them with each test. However, that’s not really how the OpenTelemetry traceparent header is supposed to be used. You’re supposed to include the header in the calling request. So that’s what I tried next.

Show me the code!

First thing I did was locate how the test suite was making the API calls. Turned out this was done with HttpClient, which offers a nice way to inject additional headers

import static TraceInstrumentationInit.openTelemetry;
import static TraceInstrumentationInit.tracer;

public class HttpClientTraceInstrumentation implements HttpProcessor {
  @Override
  public void process(HttpRequest request, HttpContext context) {
    injectTraceHeaders(request);
  }

  @Override
  public void process(HttpResponse response, HttpContext context) {
  }

  private static void injectTraceHeaders(HttpRequest request) {
    // TODO
  }

  public static HttpClient registerTraceInstrumentation(
        DefaultHttpClient httpClient) {
    HttpClientTraceInstrumentation traceInstrumentation = 
      new HttpClientTraceInstrumentation();
    httpClient.addRequestInterceptor(traceInstrumentation);
    httpClient.addResponseInterceptor(traceInstrumentation);
    return httpClient;
  }
}

Ok, so that gives us a way to inject the traceparent header. First I figured I could generate the trace id myself. However, it turns out the traceparent value is not just a trivial UUID, it has a very specific structure and I wouldn’t advise trying to generate it yourself.

Luckily, OpenTelemetry has excellent SDKs that make it easy to generate and inject these headers. So we need a dependency on the OpenTelemetry API and SDK.

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>io.opentelemetry</groupId>
      <artifactId>opentelemetry-bom</artifactId>
      <version>1.18.0</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>
<dependencies>
  <dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-api</artifactId>
  </dependency>
  <dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-sdk</artifactId>
  </dependency>
</dependencies>

Next, we had to initiate a trace for each scenario, and a sub-span for each API call being made.

After looking at the codebase, we saw we could add that to an already present base class used by all tests to report scenario results to TestRail. Personally I’m not a big fan of using base classes for tests as it can be hard to maintain a sensible hierarchy and the use of inheritance creates tight coupling between tests, which we don’t want. In JUnit for example, I would use @ExtendWith to keep such responsibilities separated and easier to re-use and to avoid the use of static instances as seen here. But we didn’t want to overhaul the complete test suite just to introduce this, better not to mix a big refactoring like that. The main goal at this point was to validate that traces could work.

As you see, we also passed the traceId to TestRail, more on that later.

public class Hooks extends BaseSteps {
  private static final TestRailReporter testRailReporter =
    TestRailReporterFactory.get();
  private static final ScenarioTraceInstrumentation instrumentation =
    new ScenarioTraceInstrumentation();

  @Before
  public void setup(Scenario scenario) {
    instrumentation.startTrace(scenario);
  }

  @After
  public void teardown(Scenario scenario) {
    String traceId = instrumentation.endTrace();
    testRailReporter.reportResultToTestRail(scenario, traceId);
  }
}

For every scenario we start a new span, including attributes to link it to the scenario. We also do some rudimentary logging for every scenario, making it easy to spot in the logs which scenario ran at which point and what the trace id was for that scenario run.

import static TraceInstrumentationInit.tracer;

public class ScenarioTraceInstrumentation {
  private final ThreadLocalSpan spanState = new ThreadLocalSpan();

  public void startTrace(Scenario scenario) {
    Span span = spanState.start(
      tracer.spanBuilder(scenario.getName())
            .setSpanKind(SpanKind.CLIENT)
    );
    span.setAttribute("test.scenario_id", scenario.getId());

    System.out.println("|---------\n"
      + "| SCENARIO: " + scenario.getId() + " - " + scenario.getName() + "\n"
      + "| trace.id: " + span.getSpanContext().getTraceId());
  }

  public String endTrace() {
    String traceId = spanState.traceId();
    spanState.end();
    return traceId;
  }
}

To keep track of span state, I used a ThreadLocal, exposing a simple API to start and end the span that we could use from the ScenarioTraceInstrumentation and HttpClientTraceInstrumentation. As you see I did get lazy a bit here, using a Pair to store both the Span and Scope, using a record for that would make the code easier to read.

class ThreadLocalSpan {
  private final ThreadLocal<Pair<Span, Scope>> state = new ThreadLocal<>();

  Span start(SpanBuilder spanBuilder) {
    Span span = spanBuilder.startSpan();
    Scope scope = span.makeCurrent();
    state.set(Pair.of(span, scope));
    return span;
  }

  String traceId() {
    return state.get().getLeft().getSpanContext().getTraceId();
  }

  void end() {
    Pair<Span, Scope> pair = state.get();
    pair.getRight().close();
    pair.getLeft().end();
  }
}

The OpenTelemetry SDK also needs some initialization. Keeping it simple for this first experiment, and since some of the other code was static already, we just created two package scoped statics to be used from the ScenarioTraceInstrumentation and HttpClientTraceInstrumentation.

class TraceInstrumentationInit {
  static final OpenTelemetry openTelemetry = initOpenTelemetry();
  static final Tracer tracer = openTelemetry.getTracer("regressionTest");

  private static OpenTelemetry initOpenTelemetry() {
    SdkTracerProvider sdkTracerProvider = SdkTracerProvider
        .builder().build();
    OpenTelemetrySdk sdk = OpenTelemetrySdk.builder()
        .setTracerProvider(sdkTracerProvider)
        .setPropagators(ContextPropagators.create(
          W3CTraceContextPropagator.getInstance()
        ))
        .build();
    Runtime.getRuntime().addShutdownHook(
      new Thread(sdkTracerProvider::close)
    );
    return sdk;
  }
}

And finally, we could put all the the pieces together in the HttpClientTraceInstrumentation, see below.

We create an additional span for the request [1] and then inject the traceheader into the request [2].

The OpenTelemetry SDK code is very generic, allowing you to inject headers into pretty much anything, from HTTP requests to Messages with all sorts of client libraries. The extremely generic code does make it a bit hard to read, but essentially you need to pass it a method that accepts two string arguments, the header name and header value. The HttpRequest from HttpClient has exactly such a method: setHeader [3].

Last but not least, we also log every request being made [4], giving a nice overview in the test logs of what API calls are being made for each scenario.

import static TraceInstrumentationInit.openTelemetry;
import static TraceInstrumentationInit.tracer;

public class HttpClientTraceInstrumentation implements HttpProcessor {
  private static final String W3C_TRACEPARENT_HEADER = "Traceparent";

  private static final TextMapPropagator textMapPropagator =
    openTelemetry.getPropagators().getTextMapPropagator();

  private static final TextMapSetter setter = 
    HttpRequest::setHeader; // [3] http client setHeader method

  private final ThreadLocalSpan spanState = new ThreadLocalSpan();

  @Override
  public void process(HttpRequest request, HttpContext context) {
    Span span = spanState.start(
      tracer.spanBuilder("/")
          .setSpanKind(SpanKind.CLIENT)
    );
    try (Scope ignored = span.makeCurrent()) {
      // [1] add span details
      span.setAttribute(HTTP_METHOD, request.getRequestLine().getMethod());
      span.setAttribute(HTTP_URL, request.getRequestLine().getUri());
      span.setAttribute("component", "http");

      injectTraceHeader(request);
      logRequest(request);
    }
  }

  @Override
  public void process(HttpResponse response, HttpContext context) {
    spanState.end();
  }

  private static void injectTraceHeader(HttpRequest request) {
    // [2] let opentelemetry sdk propagate any required headers
    textMapPropagator.inject(Context.current(), request, setter);
  }

  private static void logRequest(HttpRequest request) {
    System.out.println( // [4]
      "|- " + request.getRequestLine() + "\n" +
      "|  ^- " + request.getLastHeader(W3C_TRACEPARENT_HEADER));
  }
}

Everything coming together

Now when tests are run, their logs clearly show each of the scenarios, their trace id and the API calls being invoked. And for every API call, the complete traceparent header is shown. With these details being propagating it to our backend, we can now lookup all the calls for a scenario in Elastic, or even search there for the span id to locate the trace for a specific API call.

|---------
| SCENARIO: T2748705 - Validate if method PUT was added to controller and it's possible to use it
| trace.id: d28a64332015de5d324cb3e0f0380eba
|- PUT http://.../api/...
| ^- traceparent: ...
...
...
|---------
| SCENARIO: ...
| trace.id: ...
|- GET http://.../api/...
| ^- traceparent: ...
...
...

Integrate existing tools

At this client, the team used TestRail to centrally view reports of all test runs. To find the cause for a failing test, wouldn’t it be nice if we could skip spitting through logs. As you saw before, we had the trace id available when we posted results to TestRail, so we added a link there to take you directly to the corresponding trace in Elastic.

Link to Elastic trace integrated into TestRail

Results!

When you follow that link to Elastic, it shows you all the API calls (transactions) that were part of that test scenario (trace).Trace in Elastic

From there you can navigate to each API call and look at individual traces. Elastic shows a lot of useful info, like a latency distribution chart that makes it super easy to quickly select all slow calls or all failing calls.

Trace details in Elastic

There is also a dedicated errors section, which allows you to quickly locate details for any errors that occurred as part of the scenario traces.

Trace errors in Elastic

Tip: If you want to take a closer look at what’s in these kind of traces, there is a nice free Elastic demo environment where you can play around with traces in the APM section.

Conclusion

With just a little bit of code, we managed to reduce the amount of time needed to pinpoint test failures dramatically. Which shows again that optimizing for quick work loops and short feedback cycles are crucial for improving software deliver performance.