Tracing Services, OpenTracing and Jaeger


    In our projects we use microservice architecture. If performance bottlenecks arise, a lot of time is spent on monitoring and analyzing logs. When logging timings of individual operations into a log file, it is usually difficult to understand what led to the call of these operations, to track the sequence of actions or the time offset of one operation relative to another in different services.

    To minimize manual labor, we decided to use one of the tracing tools. About how and for what it is possible to use tracing and how we did it, and we will discuss this article.

    What problems can be solved with trace

    1. Find performance bottlenecks both within a single service and in the entire execution tree between all participating services. For instance:

      • Many short consecutive calls between services, for example, to geocoding or to a database.
      • Long waiting for input input, for example, transferring data over a network or reading from disk.
      • Long data parsing.
      • Long operations requiring cpu.
      • Pieces of code that are not needed to get the final result and can be deleted or run delayed.
    2. Clearly understand in what sequence what is called and what happens when the operation is performed.

      It can be seen that, for example, the Request came to the WS service -> the WS service supplemented the data through the R service -> then sent the request to the V service -> the V service loaded a lot of data from the R service -> went to the P service -> the P service went off again to the service R -> service V ignored the result and went to the service J -> and only then returned the answer to the WS service, while continuing to calculate something else in the background.

      Without such a trace or detailed documentation for the whole process, it is very difficult to understand what is happening the first time you look at the code, and the code is scattered across different services and hidden behind a bunch of bins and interfaces.
    3. Collection of information about the execution tree for subsequent pending analysis. At each stage of execution, you can add information to the trace that is available at this stage and then figure out what input data led to a similar scenario. For instance:

      • User ID
      • Rights
      • Type of method selected
      • Log or execution error
    4. Turn traces into a subset of metrics and further analysis as metrics.

    What can trace logging. Span

    In tracing, there is the concept of span, it is an analogue of one log to the console. The span has:

    • The name, usually the name of the method that was executed
    • The name of the service in which the span was generated
    • Own unique ID
    • Some meta information in the form of key / value, which was pledged to it. For example, method parameters or the method ended with an error or not
    • The start and end times of this span
    • Parent span id

    Each span is sent to the span collector for saving to the database for later viewing as soon as it has completed its execution. In the future, you can build a tree of all spans by connecting by the id of the parent. In the analysis, you can find, for example, all spans in some service that took more than some time. Further, going to a specific span, see the whole tree above and below this span.


    Opentracing, Jagger and how we implemented it for our projects

    There is a general Opentracing standard that describes how and what should be assembled without being tied to a specific implementation in any language. For example, in Java, all work with traces is done through the general Opentracing API, and under it, for example, Jaeger or an empty default implementation that does nothing can be hidden.

    We use Jaeger as an implementation of Opentracing. It consists of several components:


    • Jaeger-agent is a local agent that usually stands on each machine and services are logged into it on the local default port. If there is no agent, then the traces of all services on this machine are usually turned off
    • Jaeger-collector - all agents send collected traces to it, and he puts them in the selected database
    • The database is their preferred cassandra, but we use elasticsearch, there are implementations for a couple of other databases and an in memory implementation that does not save anything to disk
    • Jaeger-query is a service that goes to the database and gives already collected traces for analysis
    • Jaeger-ui is a web interface for searching and viewing traces, it goes to jaeger-query


    A separate component is the implementation of opentracing jaeger for specific languages, through which spans are sent to jaeger-agent.

    Connecting Jagger in Java comes down to simulating the io.opentracing.Tracer interface, after which all the traces through it will fly off to the real agent.


    You can also connect opentracing-spring-cloud-starter and an implementation from Jaeger opentracing-spring-jaeger-cloud-starter which configures automatically tracing for everything that passes through these components, for example, http requests to controllers, database requests via jdbc etc.

    Tracing Logging in Java

    Somewhere at the very highest level, the first Span should be created, this can be done automatically, for example, by the spring controller when a request is received, or manually if there is none. Further it is transmitted through Scope below. If some method below wants to add Span, it takes the current activeSpan from Scope, creates a new Span and says that its parent received activeSpan, and makes the new Span active. When external services are called, the current active span is transferred to them, and those services create new spans with reference to this span.

    All work goes through the Tracer instance, you can get it through the DI mechanism, or GlobalTracer.get () as a global variable if the DI mechanism does not work. By default, if the tracer was not initialized, NoopTracer will return which does nothing.

    Further, the current scope is obtained from the tracer through the ScopeManager, a new scope is created from the current one with the binding of the new span, and then the created Scope is closed, which closes the created span and returns the previous Scope to the active state. Scope is tied to a stream, so when multi-threaded programming, you must not forget to transfer the active span to another stream, for further activation of Scope of another stream with reference to this span.

    io.opentracing.Tracer tracer = ...; // GlobalTracer.get()
    void DoSmth () {
       try (Scope scope = tracer.buildSpan("DoSmth").startActive(true)) {
    void DoOther () {
        Span span = tracer.buildSpan("someWork").start();
        try (Scope scope = tracer.scopeManager().activate(span, false)) {
            // Do things.
        } catch(Exception ex) {
            Tags.ERROR.set(span, true);
            span.log(Map.of(Fields.EVENT, "error", Fields.ERROR_OBJECT, ex, Fields.MESSAGE, ex.getMessage()));
        } finally {
    void DoAsync () {
        try (Scope scope = tracer.buildSpan("ServiceHandlerSpan").startActive(false)) {
            final Span span = scope.span();
            doAsyncWork(() -> {
                // STEP 2 ABOVE: reactivate the Span in the callback, passing true to
                // startActive() if/when the Span must be finished.
                try (Scope scope = tracer.scopeManager().activate(span, false)) {

    For multi-threaded programming, there is also a TracedExecutorService and similar wrappers that automatically forward the current span to the stream when launching asynchronously tasks:

    private ExecutorService executor = new TracedExecutorService(
        Executors.newFixedThreadPool(10), GlobalTracer.get()

    For external http requests, there is TracingHttpClient

    HttpClient httpClient = new TracingHttpClientBuilder().build();

    The problems we are facing

    • Beans and DI do not always work if tracer is not used in a service or component, then Autowired Tracer may not work and you will have to use GlobalTracer.get ().
    • Annotations do not work if it is not a component or service, or if a method call comes from a neighboring method of the same class. You need to be careful, check what works, and use the manual creation of the trace if @Traced does not work. You can also screw an additional compiler for java annotations, then they should work everywhere.
    • In the old spring and spring boot, the auto-configuration of opentraing spring cloud does not work due to bugs in DI, then if you want traces in the spring components to work automatically, you can do it by analogy with master / opentracing-spring-jaeger-starter / src / main / java / io / opentracing / contrib / java / spring / jaeger / starter /
    • Try with resources does not work in groovy, you must use try finally.
    • Each service should have its own under which the traces will be logged. What does a separate name for sale and test, so as not to interfere with them together.
    • If you use GlobalTracer and tomcat, then all services running in this tomcat have one GlobalTracer, so they will all have the same service name.
    • When adding traces to a method, you need to be sure that it is not called in the loop many times. It is necessary to add one common trace for all calls, which guarantees the total time of work. Otherwise, excess load will be created.
    • Once in jaeger-ui they made too large requests for a large number of traces and since they did not wait for an answer they did it again. As a result, jaeger-query began to eat a lot of memory and slow down the elastic. Helped restart jaeger-query

    Sampling, storing and viewing traces

    There are three types of trace sampling :

    1. Const which sends and saves all traces.
    2. Probabilistic which filters traces with some given probability.
    3. Ratelimiting which limits the number of traces per second. You can configure these options on the client, either on the jaeger-agent or in the collector. Now we have const 1 in the stack of valuators, since there are not very many requests, but they take a long time. In the future, if this will exert excessive load on the system, you can limit it.

    If you use cassandra then by default it stores traces in only two days. We use elasticsearch and traces are stored for the whole time and are not deleted. A separate index is created for each day, for example, jaeger-service-2019-03-04. In the future, you need to configure the automatic cleaning of old traces.

    In order to watch the courses you need:

    • Choose a service by which you want to filter the traces, for example tomcat7-default for a service that is running on a tomato and cannot have a name.
    • Next, select the operation, the time interval and the minimum operation time, for example from 10 seconds, to take only long runs.
    • Go to one of the tracks and see what was slowing down there.

    Also, if some id of the request is known, then you can find a trace by this id through a tag search, if this id is logged in the span of the trace.




    Also popular now: