Observability & Production Readiness

Production Spring apps need three pillars: metrics (how fast and how much), traces (where time went across services), and logs (what happened with context). Boot wires Micrometer, Actuator, and tracing bridges so you instrument once and export to Prometheus, Zipkin, or your log aggregator.

mid senior Spring Boot 3.x

Micrometer & metrics

Micrometer is a vendor-neutral metrics facade—like SLF4J for logging. You instrument with MeterRegistry once; Boot auto-configures exporters for Prometheus, Datadog, CloudWatch, and more via classpath dependencies.

flowchart LR
  App[Spring Boot app] --> MR[MeterRegistry]
  MR --> JVM[JVM / Tomcat / JDBC metrics]
  MR --> Custom[Your counters & timers]
  MR --> Prom[Prometheus registry]
  Prom --> EP["/actuator/prometheus"]
  EP --> Graf[Grafana dashboards]

implementation "org.springframework.boot:spring-boot-starter-actuator"
runtimeOnly "io.micrometer:micrometer-registry-prometheus"

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  metrics:
    tags:
      application: ${spring.application.name}
      environment: ${spring.profiles.active}

Boot already registers HTTP server metrics, JVM memory/GC, datasource pool stats, and cache metrics (from the Caching chapter). Add business metrics where SLIs are defined—orders placed, payments failed, queue depth.

MeterRegistry, Counter, Gauge, Timer, DistributionSummary

Meter	Purpose	Example
Counter	Monotonically increasing count	orders.created, cache.misses
Gauge	Current value (up or down)	queue.size, active.sessions
Timer	Duration + count of events	http.server.requests, payment.latency
DistributionSummary	Distribution of values (not time)	payload.bytes, batch.size

@Service
class OrderService {
  private final Counter ordersCreated;
  private final Timer checkoutTimer;
  private final AtomicInteger pendingOrders = new AtomicInteger();

  OrderService(MeterRegistry registry) {
    ordersCreated = registry.counter("orders.created", "channel", "web");
    checkoutTimer = registry.timer("checkout.duration");
    registry.gauge("orders.pending", pendingOrders);
  }

  Order placeOrder(OrderRequest req) {
    return checkoutTimer.record(() -> {
      pendingOrders.incrementAndGet();
      try {
        Order order = doCheckout(req);
        ordersCreated.increment();
        return order;
      } finally {
        pendingOrders.decrementAndGet();
      }
    });
  }
}

@Timed and @Counted

Aspect-oriented metrics via Micrometer's observation support or legacy annotations (with micrometer-observation / AOP on classpath). Prefer explicit timers in hot paths where you need custom tags; use annotations for broad method coverage.

@Timed(value = "payment.charge", description = "Payment gateway round-trip")
@Counted(value = "payment.attempts")
public PaymentResult charge(ChargeRequest req) {
  return gateway.charge(req);
}

// Enable aspect (if not using Boot auto-config for it)
@Configuration
@EnableAspectJAutoProxy
class MetricsAspectConfig {
  @Bean
  TimedAspect timedAspect(MeterRegistry registry) {
    return new TimedAspect(registry);
  }
}

Custom tags — avoid high cardinality

Tags (labels) multiply time series: http.server.requests{uri="/users/123"} creates one series per user ID. Prometheus and Grafana choke on millions of unique label combinations.

Safe tags	Unsafe tags (high cardinality)
method, status, outcome	userId, orderId, raw URL paths
exception (class name, not message)	stackTrace, SQL text
Normalized URI templates: /api/orders/{id}	Full query strings with unique params

⚠ Pitfall

Boot's HTTP metrics use URI templates when possible. If you add custom tags in filters, never tag with unbounded values. Use logs or traces for per-request detail—not metric labels.

Prometheus integration

With micrometer-registry-prometheus, scrape GET /actuator/prometheus for text exposition format. Prometheus pulls metrics on an interval; your app does not push (unless you add Pushgateway for batch jobs).

scrape_configs:
  - job_name: order-service
    metrics_path: /actuator/prometheus
    static_configs:
      - targets: ["order-service:8080"]

Example scraped line:

# HELP http_server_requests_seconds Duration of HTTP server request handling
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds_bucket{method="GET",uri="/api/orders/{id}",status="200",le="0.1"} 842.0
http_server_requests_seconds_count{method="GET",uri="/api/orders/{id}",status="200"} 1204.0

Grafana dashboard basics

RED method — Rate (requests/sec), Errors (5xx ratio), Duration (p50/p95/p99 latency) per service
JVM panels — heap used, GC pause, thread count from jvm.* metrics
USE for dependencies — Utilization, Saturation, Errors on DB pool (hikaricp.connections.*)
Variables — $application, $environment from common tags
Alerts — p99 latency > SLO, error rate > 1%, pod restarts—wire to PagerDuty/Slack

💡 Pro Tip

Import Grafana dashboard ID 4701 (JVM Micrometer) or 12900 (Spring Boot 2.1+ statistics) as a starting point—then customize for your SLIs. Correlate batch job metrics from Spring Batch with nightly run windows.

Distributed tracing

Metrics tell you that latency spiked; traces tell you which hop—API gateway, auth service, database, external payment API. A trace is a tree of spans, each with timing and metadata, propagated across HTTP and messaging boundaries.

sequenceDiagram
  participant C as Client
  participant A as API Service
  participant B as Payment Service
  participant D as Database
  C->>A: traceparent: 00-abc-span1
  A->>B: traceparent: 00-abc-span2
  B->>D: JDBC span
  B-->>A: response
  A-->>C: response

Trace, Span, and W3C TraceContext

Concept	Description
Trace	End-to-end request journey; shared traceId
Span	Single operation (HTTP call, DB query, Kafka publish) with spanId, parent reference, duration
Propagation	Headers carry context to downstream services—W3C traceparent / tracestate

// version-traceId-spanId-flags
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Micrometer Tracing (Spring Boot 3)

Micrometer Tracing replaces Spring Cloud Sleuth in Boot 3. It provides a tracing API; you choose a bridge implementation on the classpath.

Bridge	Backend	Dependency
Brave	Zipkin (default in many Boot apps)	micrometer-tracing-bridge-brave + zipkin-reporter-brave
OpenTelemetry	OTLP collector → Jaeger, Tempo, vendor APM	micrometer-tracing-bridge-otel + OTLP exporter

management:
  tracing:
    sampling:
      probability: 0.1        # 10% in prod; 1.0 in dev
  zipkin:
    tracing:
      endpoint: http://zipkin:9411/api/v2/spans

management:
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces

Boot auto-instruments Spring MVC, WebFlux, RestTemplate, WebClient, JDBC, Kafka, and scheduled tasks. Custom spans:

@Service
class FraudCheckService {
  private final ObservationRegistry observations;

  boolean check(Order order) {
    return Observation.createNotStarted("fraud.check", observations)
        .lowCardinalityKeyValue("payment.method", order.paymentMethod())
        .observe(() -> runRules(order));
  }
}

📌 Version Note

Spring Cloud Sleuth 3.x is for Boot 2.x only. Migrating to Boot 3: remove Sleuth, add micrometer-tracing bridge, update config keys from spring.sleuth.* to management.tracing.*.

Baggage propagation

Baggage is key-value context that rides along with the trace across services—tenant ID, experiment flag, support tier. Unlike span tags, baggage is forwarded to downstream calls (with size limits).

management:
  tracing:
    baggage:
      remote-fields: tenant-id, experiment
      correlation:
        fields: tenant-id, experiment

@Component
class TenantBaggageFilter implements WebFilter {
  @Override
  public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
    String tenant = exchange.getRequest().getHeaders().getFirst("X-Tenant-Id");
    if (tenant != null) {
      BaggageField.create("tenant-id").updateValue(tenant);
    }
    return chain.filter(exchange);
  }
}

⚠ Pitfall

Baggage propagates on every outbound call—keep it small (tenant, locale). Never put PII or JWTs in baggage; it may leak to third-party services in the trace path.

🎯 Interview

"How do traces relate to logs?" — Same traceId appears in MDC log fields and trace UI. Jump from Grafana log line → Tempo/Zipkin trace → slow span. That's the observability golden path.

Next: Logging →

Logging best practices

Logs are the narrative of production—errors, audit events, and debug trails. Spring Boot defaults to SLF4J (API) + Logback (implementation). Structure them for machines (JSON) while keeping human-readable local dev output.

SLF4J + Logback vs Log4j2

	Logback (Boot default)	Log4j2
Setup	Zero config out of the box	Exclude Logback, add spring-boot-starter-log4j2
Async	AsyncAppender	AsyncLogger (LMAX disruptor—very fast)
When to switch	Most apps stay here	High-throughput logging, existing Log4j2 ecosystem

@Slf4j
@Service
class PaymentService {
  void charge(String orderId, BigDecimal amount) {
    log.info("Charging order {} amount {}", orderId, amount);
    try {
      gateway.charge(amount);
    } catch (PaymentException ex) {
      log.warn("Payment failed for order {}: {}", orderId, ex.getMessage());
      throw ex;
    }
  }
}

Structured logging — JSON for aggregation

Plain text logs break parsers in ELK, Loki, or CloudWatch. JSON lines with fixed fields (timestamp, level, message, traceId) enable reliable queries.

implementation "net.logstash.logback:logstash-logback-encoder:7.4"

{"@timestamp":"2026-06-04T10:15:00.123Z","level":"INFO","logger":"c.a.OrderService",
 "message":"Order created","orderId":"ord_42","traceId":"4bf92f3577b34da6",
 "spanId":"00f067aa0ba902b7","tenant-id":"acme"}

MDC — Mapped Diagnostic Context

MDC is a per-thread (or per-reactive-context) map injected into every log line. Populate at request entry: request ID, user, trace ID—clear on exit to avoid leaks in thread pools.

@Component
class RequestMdcFilter extends OncePerRequestFilter {
  @Override
  protected void doFilterInternal(HttpServletRequest req, HttpServletResponse res,
                                  FilterChain chain) throws ServletException, IOException {
    try {
      MDC.put("requestId", req.getHeader("X-Request-Id") != null
          ? req.getHeader("X-Request-Id") : UUID.randomUUID().toString());
      MDC.put("userId", resolveUserId(req).orElse("anonymous"));
      // traceId/spanId auto-populated by Micrometer Tracing when correlation enabled
      chain.doFilter(req, res);
    } finally {
      MDC.clear();
    }
  }
}

For WebFlux, use ContextRegistry / Reactor hooks—MDC does not propagate across reactive threads automatically without Micrometer's context propagation.

Log levels

Level	Use
TRACE	Fine-grained flow (rare in prod)
DEBUG	Dev troubleshooting; enable per-package in prod briefly
INFO	Business events, startup, healthy state changes
WARN	Recoverable issues, deprecated usage, retry succeeded
ERROR	Failures requiring attention; include context, not stack traces for expected cases

💡 Pro Tip

Log at INFO for state changes auditors care about (order placed, refund issued). Log at DEBUG for developer detail. Never log secrets, full card numbers, or raw JWTs—mask or omit.

Async logging appender

Logging synchronously blocks the caller on disk I/O. An async appender buffers to a background thread—critical for latency-sensitive paths.

Logback configuration — logback-spring.xml

Use logback-spring.xml (not logback.xml) so Boot can apply springProfile extensions.

<configuration>
  <include resource="org/springframework/boot/logging/logback/defaults.xml"/>

  <springProfile name="local">
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
      <encoder>
        <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} trace=%X{traceId} - %msg%n</pattern>
      </encoder>
    </appender>
    <root level="INFO">
      <appender-ref ref="CONSOLE"/>
    </root>
    <logger name="com.acme" level="DEBUG"/>
  </springProfile>

  <springProfile name="prod">
    <appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
      <encoder class="net.logstash.logback.encoder.LogstashEncoder">
        <includeMdcKeyName>traceId</includeMdcKeyName>
        <includeMdcKeyName>spanId</includeMdcKeyName>
        <includeMdcKeyName>requestId</includeMdcKeyName>
        <includeMdcKeyName>userId</includeMdcKeyName>
      </encoder>
    </appender>
    <appender name="ASYNC" class="ch.qos.logback.classic.AsyncAppender">
      <queueSize>8192</queueSize>
      <discardingThreshold>0</discardingThreshold>
      <appender-ref ref="JSON"/>
    </appender>
    <root level="INFO">
      <appender-ref ref="ASYNC"/>
    </root>
  </springProfile>
</configuration>

# application-prod.yml
logging:
  level:
    root: INFO
    com.acme: INFO
    org.springframework.web: WARN
    org.hibernate.SQL: WARN

🌍 Real World

In Kubernetes, log to stdout only—let the node agent (Fluent Bit, Promtail) ship to your aggregator. Rotate files only for bare-metal legacy deploys. Set pod labels for app, version in Loki/ELK queries alongside log traceId.

Health & readiness

Actuator's health endpoint tells orchestrators whether your pod should receive traffic. Since Boot 2.3+, liveness and readiness are separate probes—Kubernetes uses them to restart vs remove from load balancing.

Probe	Question	K8s action if fail
Liveness	Is the process alive (not deadlocked)?	Restart container
Readiness	Can it accept traffic (DB up, cache warm)?	Remove from Service endpoints
Startup (optional)	Has slow initialization finished?	Hold liveness until pass

management:
  endpoint:
    health:
      probes:
        enabled: true
      show-details: when_authorized
  endpoints:
    web:
      exposure:
        include: health,prometheus

Endpoints:

GET /actuator/health/liveness — always UP unless app is broken
GET /actuator/health/readiness — DOWN when dependencies unavailable
GET /actuator/health — aggregate view

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

Custom HealthIndicator

Register beans implementing HealthIndicator or ReactiveHealthIndicator (WebFlux). They appear under /actuator/health and affect readiness when grouped.

@Component
class PaymentGatewayHealth implements HealthIndicator {
  private final PaymentClient client;

  @Override
  public Health health() {
    try {
      client.ping();
      return Health.up().withDetail("gateway", "reachable").build();
    } catch (Exception ex) {
      return Health.down(ex).withDetail("gateway", "unreachable").build();
    }
  }
}

// Group only critical deps for readiness
@Configuration
class ReadinessGroups {
  @Bean
  HealthContributorRegistryCustomizer readinessGroups() {
    return registry -> registry.registerContributorGroup("readiness",
        List.of("db", "redis", "paymentGatewayHealth"));
  }
}

⚠ Pitfall

Do not put slow external checks on liveness—a flaky partner API would restart your pods in a loop. Liveness should be lightweight (JVM responsive). Degrade readiness when dependencies fail.

Graceful shutdown

On SIGTERM (K8s pod deletion, rolling deploy), stop accepting new work, drain in-flight requests, then exit. Boot 2.3+ supports graceful shutdown for embedded web servers.

server:
  shutdown: graceful

spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s

sequenceDiagram
  participant K8s as Kubernetes
  participant Pod as Spring Boot Pod
  participant LB as Load Balancer
  K8s->>Pod: SIGTERM
  Pod->>LB: Readiness DOWN (stop new traffic)
  Pod->>Pod: Complete in-flight requests (≤30s)
  Pod->>Pod: Close connections, flush logs
  Pod->>K8s: Process exit 0

Combine with:

terminationGracePeriodSeconds on the pod ≥ shutdown timeout + buffer
PreStop hook optional delay so readiness propagates before SIGTERM traffic
Kafka consumers: @PreDestroy or SmartLifecycle to commit offsets cleanly
Batch jobs: let running steps finish or use Spring Batch restart semantics

@Component
class DrainOnShutdown {
  @EventListener
  void onShutdown(ContextClosedEvent event) {
    log.info("Shutdown initiated — draining work");
    // stop schedulers, close custom thread pools
  }
}

🎯 Interview

"Liveness vs readiness?" — Liveness: restart if app is stuck. Readiness: stop sending traffic if app cannot serve (DB down). Wrong liveness check = restart storm; wrong readiness = 503s to users.

🔬 Under the Hood

Graceful shutdown registers with Spring's lifecycle: web server stops accepting connections first, then beans destroy in reverse dependency order. timeout-per-shutdown-phase forces exit if drain exceeds limit—balance with K8s terminationGracePeriodSeconds.