Support cannot correlate complaints to logs

Scenario

A user reports “payment failed at 2:15pm.” Support searches logs by email and sees thousands of lines across twelve services—none clearly tied to that click. Engineering greps user id and guesses. Incidents drag on because the platform never standardized one id per request in headers, logs, and responses. You need a platform-level fix, not another ad-hoc field in one service.

After reading, you should be able to:

Define correlation id vs request id vs trace id and when they align.
List what to add at gateway, services, messaging, and log pipeline.
Implement propagation in Java (MDC, filters, async).
Give support a reliable lookup path and engineering indexed search.

Why — without a shared key, every team invents one

A correlation id is an opaque identifier that labels one logical attempt through the system. It lets humans and tools stitch HTTP access logs, application logs, audit events, and traces into one story. Without it, you rely on fuzzy matches (user id + timestamp) that collide under load and cause false “inconsistency”.

IDs you will see (align the platform)

ID	Scope	Typical source
Correlation id	Business/support-facing; may equal trace id	Gateway generates or accepts from client
Request id	One HTTP request attempt	Per inbound call; new id on retry
Trace id (OpenTelemetry)	Full distributed trace tree	Tracer; W3C `traceparent`
Span id	One operation inside a service	Auto from tracer

Practical standard: expose one id to support (e.g. correlation_id) that matches trace_id when tracing is enabled; always log both request id and trace id for engineering.

What breaks today

Gateway does not generate or forward ids.
Services log only userId — high cardinality grep, not unique per click.
Async workers and Kafka consumers start “fresh” with no parent id.
Mobile app does not display the id returned by API.
Log platform does not index trace_id — searches time out.

What — add to the platform (checklist)

Pick standard header names (document in API guide)
- X-Request-ID — per HTTP attempt (UUID)
- X-Correlation-ID — end-user/support reference (may span related calls)
- traceparent — W3C trace propagation — tracing guide
Rule: if client sends id, validate format and length; else generate at edge.
Edge / API gateway — generate ids, attach to upstream request, return in response header and error JSON body.
Every service inbound filter — extract headers → MDC / OpenTelemetry context; reject nothing; never drop silently.
Outbound HTTP & messaging — inject same headers on RestTemplate/WebClient/Feign; Kafka record headers.

Structured logs (required fields)

trace_id, span_id, request_id, correlation_id,
service, pod, user_id (if auth), tenant_id, http.route, outcome

Log indexer — Loki/ELK/Datadog: index trace_id and correlation_id as keyword fields.
Return id to clients — response header + field in error payload so support can ask user to copy it.
Support console or runbook — “Search logs where correlation_id = …” and “Open trace in Jaeger.”
Fallback when id missing (legacy) — narrow time window + user id + endpoint; acknowledge ambiguity.

Support workflow (target state)

User provides correlation id from app error screen or email receipt.
Support searches log platform → all services, one timeline.
Escalation opens trace waterfall with same id.

How — implement in Java and roll out

Inbound filter (Spring sketch)

@Component
public class CorrelationFilter extends OncePerRequestFilter {
  @Override
  protected void doFilterInternal(HttpServletRequest req, HttpServletResponse res, FilterChain chain) {
    String requestId = headerOrUuid(req, "X-Request-ID");
    String correlationId = headerOrUuid(req, "X-Correlation-ID");
    MDC.put("request_id", requestId);
    MDC.put("correlation_id", correlationId);
    res.setHeader("X-Request-ID", requestId);
    res.setHeader("X-Correlation-ID", correlationId);
    try { chain.doFilter(req, res); }
    finally { MDC.clear(); }
  }
}

With OpenTelemetry, prefer the tracer’s trace_id in MDC via Micrometer bridge so logs and traces always match.

Async and thread pools

Spring TaskDecorator copies MDC to worker threads.
@Async, CompletableFuture, reactive: use context propagation libraries (Micrometer Context Propagation).
Never log from a bare new Thread() without context.

Logback pattern example

%d{ISO8601} [%thread] %-5level %logger - trace=%X{trace_id} req=%X{request_id} corr=%X{correlation_id} - %msg%n

Rollout plan

Phase	Deliverable
1	Gateway + top 3 services log `trace_id`
2	Indexer fields + support runbook
3	All producers/consumers on Kafka
4	Client apps show correlation id on errors
5	CI lint: integration test asserts log line contains id

Privacy and security

Use random UUIDs — do not embed email, account number, or PAN in correlation ids.
Restrict log search RBAC; correlation id is not authentication.

Verify

One API call → same id in gateway access log, service log, downstream service log.
Support test ticket: find full path in < 2 minutes.
Async job log includes parent request’s ids.

Interview one-liner

“I generate a correlation id at the gateway, propagate it on every hop and Kafka message, put trace_id and request_id in structured logs and MDC—including async—and return the id to the client so support can search one key across all services.”