Support cannot correlate complaints to logs

Scenario

A user reports “payment failed at 2:15pm.” Support searches logs by email and sees thousands of lines across twelve services—none clearly tied to that click. Engineering greps user id and guesses. Incidents drag on because the platform never standardized one id per request in headers, logs, and responses. You need a platform-level fix, not another ad-hoc field in one service.

After reading, you should be able to:

Why — without a shared key, every team invents one

A correlation id is an opaque identifier that labels one logical attempt through the system. It lets humans and tools stitch HTTP access logs, application logs, audit events, and traces into one story. Without it, you rely on fuzzy matches (user id + timestamp) that collide under load and cause false “inconsistency”.

IDs you will see (align the platform)

IDScopeTypical source
Correlation idBusiness/support-facing; may equal trace idGateway generates or accepts from client
Request idOne HTTP request attemptPer inbound call; new id on retry
Trace id (OpenTelemetry)Full distributed trace treeTracer; W3C traceparent
Span idOne operation inside a serviceAuto from tracer

Practical standard: expose one id to support (e.g. correlation_id) that matches trace_id when tracing is enabled; always log both request id and trace id for engineering.

What breaks today

What — add to the platform (checklist)

  1. Pick standard header names (document in API guide)
    • X-Request-ID — per HTTP attempt (UUID)
    • X-Correlation-ID — end-user/support reference (may span related calls)
    • traceparent — W3C trace propagation — tracing guide
    Rule: if client sends id, validate format and length; else generate at edge.
  2. Edge / API gateway — generate ids, attach to upstream request, return in response header and error JSON body.
  3. Every service inbound filter — extract headers → MDC / OpenTelemetry context; reject nothing; never drop silently.
  4. Outbound HTTP & messaging — inject same headers on RestTemplate/WebClient/Feign; Kafka record headers.
  5. Structured logs (required fields)
    trace_id, span_id, request_id, correlation_id,
    service, pod, user_id (if auth), tenant_id, http.route, outcome
  6. Log indexer — Loki/ELK/Datadog: index trace_id and correlation_id as keyword fields.
  7. Return id to clients — response header + field in error payload so support can ask user to copy it.
  8. Support console or runbook — “Search logs where correlation_id = …” and “Open trace in Jaeger.”
  9. Fallback when id missing (legacy) — narrow time window + user id + endpoint; acknowledge ambiguity.

Support workflow (target state)

  1. User provides correlation id from app error screen or email receipt.
  2. Support searches log platform → all services, one timeline.
  3. Escalation opens trace waterfall with same id.

How — implement in Java and roll out

Inbound filter (Spring sketch)

@Component
public class CorrelationFilter extends OncePerRequestFilter {
  @Override
  protected void doFilterInternal(HttpServletRequest req, HttpServletResponse res, FilterChain chain) {
    String requestId = headerOrUuid(req, "X-Request-ID");
    String correlationId = headerOrUuid(req, "X-Correlation-ID");
    MDC.put("request_id", requestId);
    MDC.put("correlation_id", correlationId);
    res.setHeader("X-Request-ID", requestId);
    res.setHeader("X-Correlation-ID", correlationId);
    try { chain.doFilter(req, res); }
    finally { MDC.clear(); }
  }
}

With OpenTelemetry, prefer the tracer’s trace_id in MDC via Micrometer bridge so logs and traces always match.

Async and thread pools

Logback pattern example

%d{ISO8601} [%thread] %-5level %logger - trace=%X{trace_id} req=%X{request_id} corr=%X{correlation_id} - %msg%n

Rollout plan

PhaseDeliverable
1Gateway + top 3 services log trace_id
2Indexer fields + support runbook
3All producers/consumers on Kafka
4Client apps show correlation id on errors
5CI lint: integration test asserts log line contains id

Privacy and security

Verify

  1. One API call → same id in gateway access log, service log, downstream service log.
  2. Support test ticket: find full path in < 2 minutes.
  3. Async job log includes parent request’s ids.

Interview one-liner

“I generate a correlation id at the gateway, propagate it on every hop and Kafka message, put trace_id and request_id in structured logs and MDC—including async—and return the id to the client so support can search one key across all services.”

Related scenarios