Trace one request across the whole system

Scenario

A checkout takes 8 seconds. Logs exist in six services, Kafka, and the payment API—but no one line tells the story. You need a single trace: gateway → order service → inventory → DB → Kafka consumer → partner HTTP, with timing on each hop, so you can see where time went and why retries look like different requests.

After reading, you should be able to:

Why — logs alone do not show the critical path

In microservices, one user action fans out across processes and machines. Each service logs independently; timestamps and thread names do not line up across hosts. A distributed trace is one tree of spans sharing a trace_id, each span recording an operation (HTTP call, DB query, message publish) with start time and duration. The waterfall view answers: which hop consumed the 8 seconds?

Core concepts

TermMeaning
TraceEnd-to-end journey; one trace_id (128-bit hex)
SpanOne unit of work; has span_id, parent, service name, attributes
Root spanEntry point (e.g. gateway received request)
Child spanWork triggered by parent (DB call inside service B)
Context propagationPass trace id + parent span id to the next hop

Where traces break in real systems

What — follow one request end to end

  1. Get a trace id — from support ticket, access log (X-Trace-Id), or slow-request alert in APM (Jaeger, Tempo, Honeycomb, Datadog, etc.).
  2. Open the trace waterfall — sort spans by start time; find the longest bar (critical path).
  3. Read span attributeshttp.route, db.statement (sanitized), messaging.destination, error, http.status_code.
  4. Check for gaps — parent HTTP client span 3s long but no child in service C → propagation failure or C not instrumented.
  5. Separate retries — multiple sibling HTTP spans with same route often mean retries; tie to inconsistent outcomes.
  6. Cross-link to logs — filter logs by trace_id (same value in MDC); one narrative across services.
  7. Compare normal vs slow trace — same endpoint, diff trace ids: which span extra or longer?

Propagation (W3C Trace Context)

Standard HTTP header:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             │  └─ trace_id (32 hex)          └─ parent span_id   └─ flags (sampled)

tracestate: vendor-specific hints (optional)

Every outbound call must inject context; every inbound call extracts and creates a child span. Same for Kafka record headers.

What a healthy trace looks like (checkout)

[gateway] POST /checkout          8200ms total
  └─ [order-svc] createOrder      8100ms
       ├─ [order-svc] INSERT …     45ms
       ├─ [order-svc] HTTP inventory  120ms
       │    └─ [inventory] reserve      110ms
       ├─ [order-svc] publish Kafka     8ms
       └─ [payment-consumer] charge    7800ms  ← bottleneck
            └─ [payment] POST /charge   7790ms

Tools (pick one backend)

How — instrument and operate tracing in production

Minimum viable platform

  1. Adopt OpenTelemetry for all new services.
  2. Gateway injects or accepts traceparent; forwards to origins.
  3. Log pattern includes trace_id and span_id (MDC).
  4. Collector exports to trace store; retention 7–14 days (cost vs debuggability).
  5. Runbook: “get trace id from log → open UI.”

Java / Spring (conceptual)

// Auto: OTel Java agent -javaagent:opentelemetry-javaagent.jar
//   OTEL_SERVICE_NAME=order-svc
//   OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317

// Manual span for business step
Span span = tracer.spanBuilder("applyDiscount").startSpan();
try (Scope scope = span.makeCurrent()) {
  // work
} finally {
  span.end();
}

Kafka propagation

Sampling strategy

StrategyTradeoff
Head-based (decide at start)Cheap; may drop the one slow trace you need
Tail-based (keep errors/slow)Better for incidents; needs collector support
Always sample stagingFull fidelity for debugging
100% on canaryShort window during risky deploy

Custom spans worth adding

Verify instrumentation

  1. Single test request: trace shows all expected services, no orphans.
  2. Log line trace id matches UI trace id.
  3. Kafka path: producer and consumer share trace id.
  4. Load test: collector and backend handle span volume; alerts on export failures.

Interview one-liner

“I use one trace id propagated with W3C traceparent through HTTP and Kafka, auto-instrument JDBC and clients with OpenTelemetry, and read the waterfall to find the longest span on the critical path—then jump to logs with the same trace id.”

Related scenarios