Trace one request across the whole system

Scenario

A checkout takes 8 seconds. Logs exist in six services, Kafka, and the payment API—but no one line tells the story. You need a single trace: gateway → order service → inventory → DB → Kafka consumer → partner HTTP, with timing on each hop, so you can see where time went and why retries look like different requests.

After reading, you should be able to:

Explain trace, span, parent/child, and W3C traceparent propagation.
Follow one trace_id through HTTP, messaging, and JDBC spans.
Spot broken propagation (orphan spans, missing downstream segments).
Instrument Java services with OpenTelemetry and tie traces to logs.

Why — logs alone do not show the critical path

In microservices, one user action fans out across processes and machines. Each service logs independently; timestamps and thread names do not line up across hosts. A distributed trace is one tree of spans sharing a trace_id, each span recording an operation (HTTP call, DB query, message publish) with start time and duration. The waterfall view answers: which hop consumed the 8 seconds?

Core concepts

Term	Meaning
Trace	End-to-end journey; one `trace_id` (128-bit hex)
Span	One unit of work; has `span_id`, parent, service name, attributes
Root span	Entry point (e.g. gateway received request)
Child span	Work triggered by parent (DB call inside service B)
Context propagation	Pass trace id + parent span id to the next hop

Where traces break in real systems

Header not forwarded (gateway strips traceparent).
Async thread without context attach — logs lose correlation.
Kafka consumer starts a new trace instead of continuing the producer trace.
Partner API black box — no child span unless you wrap the client.
Heavy sampling drops the slow trace you need.
Clock skew between hosts (less common with relative span duration).

What — follow one request end to end

Get a trace id — from support ticket, access log (X-Trace-Id), or slow-request alert in APM (Jaeger, Tempo, Honeycomb, Datadog, etc.).
Open the trace waterfall — sort spans by start time; find the longest bar (critical path).
Read span attributes — http.route, db.statement (sanitized), messaging.destination, error, http.status_code.
Check for gaps — parent HTTP client span 3s long but no child in service C → propagation failure or C not instrumented.
Separate retries — multiple sibling HTTP spans with same route often mean retries; tie to inconsistent outcomes.
Cross-link to logs — filter logs by trace_id (same value in MDC); one narrative across services.
Compare normal vs slow trace — same endpoint, diff trace ids: which span extra or longer?

Propagation (W3C Trace Context)

Standard HTTP header:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             │  └─ trace_id (32 hex)          └─ parent span_id   └─ flags (sampled)

tracestate: vendor-specific hints (optional)

Every outbound call must inject context; every inbound call extracts and creates a child span. Same for Kafka record headers.

What a healthy trace looks like (checkout)

[gateway] POST /checkout          8200ms total
  └─ [order-svc] createOrder      8100ms
       ├─ [order-svc] INSERT …     45ms
       ├─ [order-svc] HTTP inventory  120ms
       │    └─ [inventory] reserve      110ms
       ├─ [order-svc] publish Kafka     8ms
       └─ [payment-consumer] charge    7800ms  ← bottleneck
            └─ [payment] POST /charge   7790ms

Tools (pick one backend)

OpenTelemetry — vendor-neutral SDK + collector → Jaeger, Tempo, Zipkin, cloud APM.
Java — OpenTelemetry Java agent (auto-instrument JDBC, HTTP); or Spring Boot 3 + Micrometer Tracing.
Query — by trace id, service, duration, error flag.

How — instrument and operate tracing in production

Minimum viable platform

Adopt OpenTelemetry for all new services.
Gateway injects or accepts traceparent; forwards to origins.
Log pattern includes trace_id and span_id (MDC).
Collector exports to trace store; retention 7–14 days (cost vs debuggability).
Runbook: “get trace id from log → open UI.”

Java / Spring (conceptual)

// Auto: OTel Java agent -javaagent:opentelemetry-javaagent.jar
//   OTEL_SERVICE_NAME=order-svc
//   OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317

// Manual span for business step
Span span = tracer.spanBuilder("applyDiscount").startSpan();
try (Scope scope = span.makeCurrent()) {
  // work
} finally {
  span.end();
}

Kafka propagation

Producer: inject traceparent into record headers.
Consumer: extract context before processing; child span linked to producer span.
Consumer lag shows as gap between publish span end and consume span start.

Sampling strategy

Strategy	Tradeoff
Head-based (decide at start)	Cheap; may drop the one slow trace you need
Tail-based (keep errors/slow)	Better for incidents; needs collector support
Always sample staging	Full fidelity for debugging
100% on canary	Short window during risky deploy

Custom spans worth adding

Feature-flag evaluation, cache get/put, idempotency check, fraud rules engine.
Anything not visible in auto HTTP/DB spans but on the critical path.

Verify instrumentation

Single test request: trace shows all expected services, no orphans.
Log line trace id matches UI trace id.
Kafka path: producer and consumer share trace id.
Load test: collector and backend handle span volume; alerts on export failures.

Interview one-liner

“I use one trace id propagated with W3C traceparent through HTTP and Kafka, auto-instrument JDBC and clients with OpenTelemetry, and read the waterfall to find the longest span on the critical path—then jump to logs with the same trace id.”