Trace one request across the whole system
Scenario
A checkout takes 8 seconds. Logs exist in six services, Kafka, and the payment API—but no one line tells the story. You need a single trace: gateway → order service → inventory → DB → Kafka consumer → partner HTTP, with timing on each hop, so you can see where time went and why retries look like different requests.
After reading, you should be able to:
- Explain trace, span, parent/child, and W3C
traceparentpropagation. - Follow one
trace_idthrough HTTP, messaging, and JDBC spans. - Spot broken propagation (orphan spans, missing downstream segments).
- Instrument Java services with OpenTelemetry and tie traces to logs.
Why — logs alone do not show the critical path
In microservices, one user action fans out across processes and machines.
Each service logs independently; timestamps and thread names do not line up across hosts.
A distributed trace is one tree of spans sharing a trace_id, each span recording an operation (HTTP call, DB query, message publish) with start time and duration.
The waterfall view answers: which hop consumed the 8 seconds?
Core concepts
| Term | Meaning |
|---|---|
| Trace | End-to-end journey; one trace_id (128-bit hex) |
| Span | One unit of work; has span_id, parent, service name, attributes |
| Root span | Entry point (e.g. gateway received request) |
| Child span | Work triggered by parent (DB call inside service B) |
| Context propagation | Pass trace id + parent span id to the next hop |
Where traces break in real systems
- Header not forwarded (gateway strips
traceparent). - Async thread without context attach — logs lose correlation.
- Kafka consumer starts a new trace instead of continuing the producer trace.
- Partner API black box — no child span unless you wrap the client.
- Heavy sampling drops the slow trace you need.
- Clock skew between hosts (less common with relative span duration).
What — follow one request end to end
-
Get a trace id
— from support ticket, access log (
X-Trace-Id), or slow-request alert in APM (Jaeger, Tempo, Honeycomb, Datadog, etc.). - Open the trace waterfall — sort spans by start time; find the longest bar (critical path).
-
Read span attributes
—
http.route,db.statement(sanitized),messaging.destination,error,http.status_code. - Check for gaps — parent HTTP client span 3s long but no child in service C → propagation failure or C not instrumented.
- Separate retries — multiple sibling HTTP spans with same route often mean retries; tie to inconsistent outcomes.
-
Cross-link to logs
— filter logs by
trace_id(same value in MDC); one narrative across services. - Compare normal vs slow trace — same endpoint, diff trace ids: which span extra or longer?
Propagation (W3C Trace Context)
Standard HTTP header:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
│ └─ trace_id (32 hex) └─ parent span_id └─ flags (sampled)
tracestate: vendor-specific hints (optional)
Every outbound call must inject context; every inbound call extracts and creates a child span. Same for Kafka record headers.
What a healthy trace looks like (checkout)
[gateway] POST /checkout 8200ms total
└─ [order-svc] createOrder 8100ms
├─ [order-svc] INSERT … 45ms
├─ [order-svc] HTTP inventory 120ms
│ └─ [inventory] reserve 110ms
├─ [order-svc] publish Kafka 8ms
└─ [payment-consumer] charge 7800ms ← bottleneck
└─ [payment] POST /charge 7790ms
Tools (pick one backend)
- OpenTelemetry — vendor-neutral SDK + collector → Jaeger, Tempo, Zipkin, cloud APM.
- Java — OpenTelemetry Java agent (auto-instrument JDBC, HTTP); or Spring Boot 3 + Micrometer Tracing.
- Query — by trace id, service, duration, error flag.
How — instrument and operate tracing in production
Minimum viable platform
- Adopt OpenTelemetry for all new services.
- Gateway injects or accepts
traceparent; forwards to origins. - Log pattern includes
trace_idandspan_id(MDC). - Collector exports to trace store; retention 7–14 days (cost vs debuggability).
- Runbook: “get trace id from log → open UI.”
Java / Spring (conceptual)
// Auto: OTel Java agent -javaagent:opentelemetry-javaagent.jar
// OTEL_SERVICE_NAME=order-svc
// OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
// Manual span for business step
Span span = tracer.spanBuilder("applyDiscount").startSpan();
try (Scope scope = span.makeCurrent()) {
// work
} finally {
span.end();
}
Kafka propagation
- Producer: inject
traceparentinto record headers. - Consumer: extract context before processing; child span linked to producer span.
- Consumer lag shows as gap between publish span end and consume span start.
Sampling strategy
| Strategy | Tradeoff |
|---|---|
| Head-based (decide at start) | Cheap; may drop the one slow trace you need |
| Tail-based (keep errors/slow) | Better for incidents; needs collector support |
| Always sample staging | Full fidelity for debugging |
| 100% on canary | Short window during risky deploy |
Custom spans worth adding
- Feature-flag evaluation, cache get/put, idempotency check, fraud rules engine.
- Anything not visible in auto HTTP/DB spans but on the critical path.
Verify instrumentation
- Single test request: trace shows all expected services, no orphans.
- Log line trace id matches UI trace id.
- Kafka path: producer and consumer share trace id.
- Load test: collector and backend handle span volume; alerts on export failures.
Interview one-liner
“I use one trace id propagated with W3C traceparent through HTTP and Kafka, auto-instrument JDBC and clients with OpenTelemetry, and read the waterfall to find the longest span on the critical path—then jump to logs with the same trace id.”