Support cannot correlate complaints to logs
Scenario
A user reports “payment failed at 2:15pm.” Support searches logs by email and sees thousands of lines across twelve services—none clearly tied to that click. Engineering greps user id and guesses. Incidents drag on because the platform never standardized one id per request in headers, logs, and responses. You need a platform-level fix, not another ad-hoc field in one service.
After reading, you should be able to:
- Define correlation id vs request id vs trace id and when they align.
- List what to add at gateway, services, messaging, and log pipeline.
- Implement propagation in Java (MDC, filters, async).
- Give support a reliable lookup path and engineering indexed search.
Why — without a shared key, every team invents one
A correlation id is an opaque identifier that labels one logical attempt through the system. It lets humans and tools stitch HTTP access logs, application logs, audit events, and traces into one story. Without it, you rely on fuzzy matches (user id + timestamp) that collide under load and cause false “inconsistency”.
IDs you will see (align the platform)
| ID | Scope | Typical source |
|---|---|---|
| Correlation id | Business/support-facing; may equal trace id | Gateway generates or accepts from client |
| Request id | One HTTP request attempt | Per inbound call; new id on retry |
| Trace id (OpenTelemetry) | Full distributed trace tree | Tracer; W3C traceparent |
| Span id | One operation inside a service | Auto from tracer |
Practical standard: expose one id to support (e.g. correlation_id) that matches trace_id when tracing is enabled; always log both request id and trace id for engineering.
What breaks today
- Gateway does not generate or forward ids.
- Services log only
userId— high cardinality grep, not unique per click. - Async workers and Kafka consumers start “fresh” with no parent id.
- Mobile app does not display the id returned by API.
- Log platform does not index
trace_id— searches time out.
What — add to the platform (checklist)
-
Pick standard header names (document in API guide)
X-Request-ID— per HTTP attempt (UUID)X-Correlation-ID— end-user/support reference (may span related calls)traceparent— W3C trace propagation — tracing guide
- Edge / API gateway — generate ids, attach to upstream request, return in response header and error JSON body.
- Every service inbound filter — extract headers → MDC / OpenTelemetry context; reject nothing; never drop silently.
- Outbound HTTP & messaging — inject same headers on RestTemplate/WebClient/Feign; Kafka record headers.
-
Structured logs (required fields)
trace_id, span_id, request_id, correlation_id, service, pod, user_id (if auth), tenant_id, http.route, outcome
-
Log indexer
— Loki/ELK/Datadog: index
trace_idandcorrelation_idas keyword fields. - Return id to clients — response header + field in error payload so support can ask user to copy it.
- Support console or runbook — “Search logs where correlation_id = …” and “Open trace in Jaeger.”
- Fallback when id missing (legacy) — narrow time window + user id + endpoint; acknowledge ambiguity.
Support workflow (target state)
- User provides correlation id from app error screen or email receipt.
- Support searches log platform → all services, one timeline.
- Escalation opens trace waterfall with same id.
How — implement in Java and roll out
Inbound filter (Spring sketch)
@Component
public class CorrelationFilter extends OncePerRequestFilter {
@Override
protected void doFilterInternal(HttpServletRequest req, HttpServletResponse res, FilterChain chain) {
String requestId = headerOrUuid(req, "X-Request-ID");
String correlationId = headerOrUuid(req, "X-Correlation-ID");
MDC.put("request_id", requestId);
MDC.put("correlation_id", correlationId);
res.setHeader("X-Request-ID", requestId);
res.setHeader("X-Correlation-ID", correlationId);
try { chain.doFilter(req, res); }
finally { MDC.clear(); }
}
}
With OpenTelemetry, prefer the tracer’s trace_id in MDC via Micrometer bridge so logs and traces always match.
Async and thread pools
- Spring
TaskDecoratorcopies MDC to worker threads. @Async,CompletableFuture, reactive: use context propagation libraries (Micrometer Context Propagation).- Never log from a bare
new Thread()without context.
Logback pattern example
%d{ISO8601} [%thread] %-5level %logger - trace=%X{trace_id} req=%X{request_id} corr=%X{correlation_id} - %msg%n
Rollout plan
| Phase | Deliverable |
|---|---|
| 1 | Gateway + top 3 services log trace_id |
| 2 | Indexer fields + support runbook |
| 3 | All producers/consumers on Kafka |
| 4 | Client apps show correlation id on errors |
| 5 | CI lint: integration test asserts log line contains id |
Privacy and security
- Use random UUIDs — do not embed email, account number, or PAN in correlation ids.
- Restrict log search RBAC; correlation id is not authentication.
Verify
- One API call → same id in gateway access log, service log, downstream service log.
- Support test ticket: find full path in < 2 minutes.
- Async job log includes parent request’s ids.
Interview one-liner
“I generate a correlation id at the gateway, propagate it on every hop and Kafka message, put trace_id and request_id in structured logs and MDC—including async—and return the id to the client so support can search one key across all services.”