Observability & Production Readiness

Production Spring apps need three pillars: metrics (how fast and how much), traces (where time went across services), and logs (what happened with context). Boot wires Micrometer, Actuator, and tracing bridges so you instrument once and export to Prometheus, Zipkin, or your log aggregator.

mid senior Spring Boot 3.x

Micrometer & metrics

Micrometer is a vendor-neutral metrics facade—like SLF4J for logging. You instrument with MeterRegistry once; Boot auto-configures exporters for Prometheus, Datadog, CloudWatch, and more via classpath dependencies.

flowchart LR
  App[Spring Boot app] --> MR[MeterRegistry]
  MR --> JVM[JVM / Tomcat / JDBC metrics]
  MR --> Custom[Your counters & timers]
  MR --> Prom[Prometheus registry]
  Prom --> EP["/actuator/prometheus"]
  EP --> Graf[Grafana dashboards]
Dependencies
implementation "org.springframework.boot:spring-boot-starter-actuator"
runtimeOnly "io.micrometer:micrometer-registry-prometheus"
application.yml — expose endpoints
management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  metrics:
    tags:
      application: ${spring.application.name}
      environment: ${spring.profiles.active}

Boot already registers HTTP server metrics, JVM memory/GC, datasource pool stats, and cache metrics (from the Caching chapter). Add business metrics where SLIs are defined—orders placed, payments failed, queue depth.

MeterRegistry, Counter, Gauge, Timer, DistributionSummary

MeterPurposeExample
CounterMonotonically increasing countorders.created, cache.misses
GaugeCurrent value (up or down)queue.size, active.sessions
TimerDuration + count of eventshttp.server.requests, payment.latency
DistributionSummaryDistribution of values (not time)payload.bytes, batch.size
Custom metrics
@Service
class OrderService {
  private final Counter ordersCreated;
  private final Timer checkoutTimer;
  private final AtomicInteger pendingOrders = new AtomicInteger();

  OrderService(MeterRegistry registry) {
    ordersCreated = registry.counter("orders.created", "channel", "web");
    checkoutTimer = registry.timer("checkout.duration");
    registry.gauge("orders.pending", pendingOrders);
  }

  Order placeOrder(OrderRequest req) {
    return checkoutTimer.record(() -> {
      pendingOrders.incrementAndGet();
      try {
        Order order = doCheckout(req);
        ordersCreated.increment();
        return order;
      } finally {
        pendingOrders.decrementAndGet();
      }
    });
  }
}

@Timed and @Counted

Aspect-oriented metrics via Micrometer's observation support or legacy annotations (with micrometer-observation / AOP on classpath). Prefer explicit timers in hot paths where you need custom tags; use annotations for broad method coverage.

@Timed / @Counted
@Timed(value = "payment.charge", description = "Payment gateway round-trip")
@Counted(value = "payment.attempts")
public PaymentResult charge(ChargeRequest req) {
  return gateway.charge(req);
}

// Enable aspect (if not using Boot auto-config for it)
@Configuration
@EnableAspectJAutoProxy
class MetricsAspectConfig {
  @Bean
  TimedAspect timedAspect(MeterRegistry registry) {
    return new TimedAspect(registry);
  }
}

Custom tags — avoid high cardinality

Tags (labels) multiply time series: http.server.requests{uri="/users/123"} creates one series per user ID. Prometheus and Grafana choke on millions of unique label combinations.

Safe tagsUnsafe tags (high cardinality)
method, status, outcomeuserId, orderId, raw URL paths
exception (class name, not message)stackTrace, SQL text
Normalized URI templates: /api/orders/{id}Full query strings with unique params
⚠ Pitfall

Boot's HTTP metrics use URI templates when possible. If you add custom tags in filters, never tag with unbounded values. Use logs or traces for per-request detail—not metric labels.

Prometheus integration

With micrometer-registry-prometheus, scrape GET /actuator/prometheus for text exposition format. Prometheus pulls metrics on an interval; your app does not push (unless you add Pushgateway for batch jobs).

prometheus.yml scrape config
scrape_configs:
  - job_name: order-service
    metrics_path: /actuator/prometheus
    static_configs:
      - targets: ["order-service:8080"]

Example scraped line:

Prometheus exposition (sample)
# HELP http_server_requests_seconds Duration of HTTP server request handling
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds_bucket{method="GET",uri="/api/orders/{id}",status="200",le="0.1"} 842.0
http_server_requests_seconds_count{method="GET",uri="/api/orders/{id}",status="200"} 1204.0

Grafana dashboard basics

  • RED method — Rate (requests/sec), Errors (5xx ratio), Duration (p50/p95/p99 latency) per service
  • JVM panels — heap used, GC pause, thread count from jvm.* metrics
  • USE for dependencies — Utilization, Saturation, Errors on DB pool (hikaricp.connections.*)
  • Variables$application, $environment from common tags
  • Alerts — p99 latency > SLO, error rate > 1%, pod restarts—wire to PagerDuty/Slack
💡 Pro Tip

Import Grafana dashboard ID 4701 (JVM Micrometer) or 12900 (Spring Boot 2.1+ statistics) as a starting point—then customize for your SLIs. Correlate batch job metrics from Spring Batch with nightly run windows.

Distributed tracing

Metrics tell you that latency spiked; traces tell you which hop—API gateway, auth service, database, external payment API. A trace is a tree of spans, each with timing and metadata, propagated across HTTP and messaging boundaries.

sequenceDiagram
  participant C as Client
  participant A as API Service
  participant B as Payment Service
  participant D as Database
  C->>A: traceparent: 00-abc-span1
  A->>B: traceparent: 00-abc-span2
  B->>D: JDBC span
  B-->>A: response
  A-->>C: response

Trace, Span, and W3C TraceContext

ConceptDescription
TraceEnd-to-end request journey; shared traceId
SpanSingle operation (HTTP call, DB query, Kafka publish) with spanId, parent reference, duration
PropagationHeaders carry context to downstream services—W3C traceparent / tracestate
W3C traceparent header
// version-traceId-spanId-flags
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Micrometer Tracing (Spring Boot 3)

Micrometer Tracing replaces Spring Cloud Sleuth in Boot 3. It provides a tracing API; you choose a bridge implementation on the classpath.

BridgeBackendDependency
BraveZipkin (default in many Boot apps)micrometer-tracing-bridge-brave + zipkin-reporter-brave
OpenTelemetryOTLP collector → Jaeger, Tempo, vendor APMmicrometer-tracing-bridge-otel + OTLP exporter
Brave + Zipkin setup
management:
  tracing:
    sampling:
      probability: 0.1        # 10% in prod; 1.0 in dev
  zipkin:
    tracing:
      endpoint: http://zipkin:9411/api/v2/spans
OpenTelemetry OTLP
management:
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces

Boot auto-instruments Spring MVC, WebFlux, RestTemplate, WebClient, JDBC, Kafka, and scheduled tasks. Custom spans:

Custom span with Observation API
@Service
class FraudCheckService {
  private final ObservationRegistry observations;

  boolean check(Order order) {
    return Observation.createNotStarted("fraud.check", observations)
        .lowCardinalityKeyValue("payment.method", order.paymentMethod())
        .observe(() -> runRules(order));
  }
}
📌 Version Note

Spring Cloud Sleuth 3.x is for Boot 2.x only. Migrating to Boot 3: remove Sleuth, add micrometer-tracing bridge, update config keys from spring.sleuth.* to management.tracing.*.

Baggage propagation

Baggage is key-value context that rides along with the trace across services—tenant ID, experiment flag, support tier. Unlike span tags, baggage is forwarded to downstream calls (with size limits).

Baggage fields
management:
  tracing:
    baggage:
      remote-fields: tenant-id, experiment
      correlation:
        fields: tenant-id, experiment
Set baggage in a filter
@Component
class TenantBaggageFilter implements WebFilter {
  @Override
  public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
    String tenant = exchange.getRequest().getHeaders().getFirst("X-Tenant-Id");
    if (tenant != null) {
      BaggageField.create("tenant-id").updateValue(tenant);
    }
    return chain.filter(exchange);
  }
}
⚠ Pitfall

Baggage propagates on every outbound call—keep it small (tenant, locale). Never put PII or JWTs in baggage; it may leak to third-party services in the trace path.

🎯 Interview

"How do traces relate to logs?" — Same traceId appears in MDC log fields and trace UI. Jump from Grafana log line → Tempo/Zipkin trace → slow span. That's the observability golden path.

Next: Logging →

Logging best practices

Logs are the narrative of production—errors, audit events, and debug trails. Spring Boot defaults to SLF4J (API) + Logback (implementation). Structure them for machines (JSON) while keeping human-readable local dev output.

SLF4J + Logback vs Log4j2

Logback (Boot default)Log4j2
SetupZero config out of the boxExclude Logback, add spring-boot-starter-log4j2
AsyncAsyncAppenderAsyncLogger (LMAX disruptor—very fast)
When to switchMost apps stay hereHigh-throughput logging, existing Log4j2 ecosystem
SLF4J usage — parameterized messages
@Slf4j
@Service
class PaymentService {
  void charge(String orderId, BigDecimal amount) {
    log.info("Charging order {} amount {}", orderId, amount);
    try {
      gateway.charge(amount);
    } catch (PaymentException ex) {
      log.warn("Payment failed for order {}: {}", orderId, ex.getMessage());
      throw ex;
    }
  }
}

Structured logging — JSON for aggregation

Plain text logs break parsers in ELK, Loki, or CloudWatch. JSON lines with fixed fields (timestamp, level, message, traceId) enable reliable queries.

logstash-logback-encoder
implementation "net.logstash.logback:logstash-logback-encoder:7.4"
JSON log line (production)
{"@timestamp":"2026-06-04T10:15:00.123Z","level":"INFO","logger":"c.a.OrderService",
 "message":"Order created","orderId":"ord_42","traceId":"4bf92f3577b34da6",
 "spanId":"00f067aa0ba902b7","tenant-id":"acme"}

MDC — Mapped Diagnostic Context

MDC is a per-thread (or per-reactive-context) map injected into every log line. Populate at request entry: request ID, user, trace ID—clear on exit to avoid leaks in thread pools.

Servlet filter — MDC
@Component
class RequestMdcFilter extends OncePerRequestFilter {
  @Override
  protected void doFilterInternal(HttpServletRequest req, HttpServletResponse res,
                                  FilterChain chain) throws ServletException, IOException {
    try {
      MDC.put("requestId", req.getHeader("X-Request-Id") != null
          ? req.getHeader("X-Request-Id") : UUID.randomUUID().toString());
      MDC.put("userId", resolveUserId(req).orElse("anonymous"));
      // traceId/spanId auto-populated by Micrometer Tracing when correlation enabled
      chain.doFilter(req, res);
    } finally {
      MDC.clear();
    }
  }
}

For WebFlux, use ContextRegistry / Reactor hooks—MDC does not propagate across reactive threads automatically without Micrometer's context propagation.

Log levels

LevelUse
TRACEFine-grained flow (rare in prod)
DEBUGDev troubleshooting; enable per-package in prod briefly
INFOBusiness events, startup, healthy state changes
WARNRecoverable issues, deprecated usage, retry succeeded
ERRORFailures requiring attention; include context, not stack traces for expected cases
💡 Pro Tip

Log at INFO for state changes auditors care about (order placed, refund issued). Log at DEBUG for developer detail. Never log secrets, full card numbers, or raw JWTs—mask or omit.

Async logging appender

Logging synchronously blocks the caller on disk I/O. An async appender buffers to a background thread—critical for latency-sensitive paths.

Logback configuration — logback-spring.xml

Use logback-spring.xml (not logback.xml) so Boot can apply springProfile extensions.

src/main/resources/logback-spring.xml
<configuration>
  <include resource="org/springframework/boot/logging/logback/defaults.xml"/>

  <springProfile name="local">
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
      <encoder>
        <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} trace=%X{traceId} - %msg%n</pattern>
      </encoder>
    </appender>
    <root level="INFO">
      <appender-ref ref="CONSOLE"/>
    </root>
    <logger name="com.acme" level="DEBUG"/>
  </springProfile>

  <springProfile name="prod">
    <appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
      <encoder class="net.logstash.logback.encoder.LogstashEncoder">
        <includeMdcKeyName>traceId</includeMdcKeyName>
        <includeMdcKeyName>spanId</includeMdcKeyName>
        <includeMdcKeyName>requestId</includeMdcKeyName>
        <includeMdcKeyName>userId</includeMdcKeyName>
      </encoder>
    </appender>
    <appender name="ASYNC" class="ch.qos.logback.classic.AsyncAppender">
      <queueSize>8192</queueSize>
      <discardingThreshold>0</discardingThreshold>
      <appender-ref ref="JSON"/>
    </appender>
    <root level="INFO">
      <appender-ref ref="ASYNC"/>
    </root>
  </springProfile>
</configuration>
Profile-specific levels in YAML
# application-prod.yml
logging:
  level:
    root: INFO
    com.acme: INFO
    org.springframework.web: WARN
    org.hibernate.SQL: WARN
🌍 Real World

In Kubernetes, log to stdout only—let the node agent (Fluent Bit, Promtail) ship to your aggregator. Rotate files only for bare-metal legacy deploys. Set pod labels for app, version in Loki/ELK queries alongside log traceId.

Health & readiness

Actuator's health endpoint tells orchestrators whether your pod should receive traffic. Since Boot 2.3+, liveness and readiness are separate probes—Kubernetes uses them to restart vs remove from load balancing.

ProbeQuestionK8s action if fail
LivenessIs the process alive (not deadlocked)?Restart container
ReadinessCan it accept traffic (DB up, cache warm)?Remove from Service endpoints
Startup (optional)Has slow initialization finished?Hold liveness until pass
Enable probe endpoints
management:
  endpoint:
    health:
      probes:
        enabled: true
      show-details: when_authorized
  endpoints:
    web:
      exposure:
        include: health,prometheus

Endpoints:

  • GET /actuator/health/liveness — always UP unless app is broken
  • GET /actuator/health/readiness — DOWN when dependencies unavailable
  • GET /actuator/health — aggregate view
Kubernetes probes
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

Custom HealthIndicator

Register beans implementing HealthIndicator or ReactiveHealthIndicator (WebFlux). They appear under /actuator/health and affect readiness when grouped.

Custom health check
@Component
class PaymentGatewayHealth implements HealthIndicator {
  private final PaymentClient client;

  @Override
  public Health health() {
    try {
      client.ping();
      return Health.up().withDetail("gateway", "reachable").build();
    } catch (Exception ex) {
      return Health.down(ex).withDetail("gateway", "unreachable").build();
    }
  }
}

// Group only critical deps for readiness
@Configuration
class ReadinessGroups {
  @Bean
  HealthContributorRegistryCustomizer readinessGroups() {
    return registry -> registry.registerContributorGroup("readiness",
        List.of("db", "redis", "paymentGatewayHealth"));
  }
}
⚠ Pitfall

Do not put slow external checks on liveness—a flaky partner API would restart your pods in a loop. Liveness should be lightweight (JVM responsive). Degrade readiness when dependencies fail.

Graceful shutdown

On SIGTERM (K8s pod deletion, rolling deploy), stop accepting new work, drain in-flight requests, then exit. Boot 2.3+ supports graceful shutdown for embedded web servers.

Graceful shutdown config
server:
  shutdown: graceful

spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s
sequenceDiagram
  participant K8s as Kubernetes
  participant Pod as Spring Boot Pod
  participant LB as Load Balancer
  K8s->>Pod: SIGTERM
  Pod->>LB: Readiness DOWN (stop new traffic)
  Pod->>Pod: Complete in-flight requests (≤30s)
  Pod->>Pod: Close connections, flush logs
  Pod->>K8s: Process exit 0

Combine with:

  • terminationGracePeriodSeconds on the pod ≥ shutdown timeout + buffer
  • PreStop hook optional delay so readiness propagates before SIGTERM traffic
  • Kafka consumers: @PreDestroy or SmartLifecycle to commit offsets cleanly
  • Batch jobs: let running steps finish or use Spring Batch restart semantics
Listen for shutdown phase
@Component
class DrainOnShutdown {
  @EventListener
  void onShutdown(ContextClosedEvent event) {
    log.info("Shutdown initiated — draining work");
    // stop schedulers, close custom thread pools
  }
}
🎯 Interview

"Liveness vs readiness?" — Liveness: restart if app is stuck. Readiness: stop sending traffic if app cannot serve (DB down). Wrong liveness check = restart storm; wrong readiness = 503s to users.

🔬 Under the Hood

Graceful shutdown registers with Spring's lifecycle: web server stops accepting connections first, then beans destroy in reverse dependency order. timeout-per-shutdown-phase forces exit if drain exceeds limit—balance with K8s terminationGracePeriodSeconds.