Observability & Production Readiness
Production Spring apps need three pillars: metrics (how fast and how much), traces (where time went across services), and logs (what happened with context). Boot wires Micrometer, Actuator, and tracing bridges so you instrument once and export to Prometheus, Zipkin, or your log aggregator.
Micrometer & metrics
Micrometer is a vendor-neutral metrics facade—like SLF4J for logging. You instrument with MeterRegistry once; Boot auto-configures exporters for Prometheus, Datadog, CloudWatch, and more via classpath dependencies.
flowchart LR App[Spring Boot app] --> MR[MeterRegistry] MR --> JVM[JVM / Tomcat / JDBC metrics] MR --> Custom[Your counters & timers] MR --> Prom[Prometheus registry] Prom --> EP["/actuator/prometheus"] EP --> Graf[Grafana dashboards]
implementation "org.springframework.boot:spring-boot-starter-actuator"
runtimeOnly "io.micrometer:micrometer-registry-prometheus"
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
metrics:
tags:
application: ${spring.application.name}
environment: ${spring.profiles.active}
Boot already registers HTTP server metrics, JVM memory/GC, datasource pool stats, and cache metrics (from the Caching chapter). Add business metrics where SLIs are defined—orders placed, payments failed, queue depth.
MeterRegistry, Counter, Gauge, Timer, DistributionSummary
| Meter | Purpose | Example |
|---|---|---|
| Counter | Monotonically increasing count | orders.created, cache.misses |
| Gauge | Current value (up or down) | queue.size, active.sessions |
| Timer | Duration + count of events | http.server.requests, payment.latency |
| DistributionSummary | Distribution of values (not time) | payload.bytes, batch.size |
@Service
class OrderService {
private final Counter ordersCreated;
private final Timer checkoutTimer;
private final AtomicInteger pendingOrders = new AtomicInteger();
OrderService(MeterRegistry registry) {
ordersCreated = registry.counter("orders.created", "channel", "web");
checkoutTimer = registry.timer("checkout.duration");
registry.gauge("orders.pending", pendingOrders);
}
Order placeOrder(OrderRequest req) {
return checkoutTimer.record(() -> {
pendingOrders.incrementAndGet();
try {
Order order = doCheckout(req);
ordersCreated.increment();
return order;
} finally {
pendingOrders.decrementAndGet();
}
});
}
}
@Timed and @Counted
Aspect-oriented metrics via Micrometer's observation support or legacy annotations (with micrometer-observation / AOP on classpath). Prefer explicit timers in hot paths where you need custom tags; use annotations for broad method coverage.
@Timed(value = "payment.charge", description = "Payment gateway round-trip")
@Counted(value = "payment.attempts")
public PaymentResult charge(ChargeRequest req) {
return gateway.charge(req);
}
// Enable aspect (if not using Boot auto-config for it)
@Configuration
@EnableAspectJAutoProxy
class MetricsAspectConfig {
@Bean
TimedAspect timedAspect(MeterRegistry registry) {
return new TimedAspect(registry);
}
}
Custom tags — avoid high cardinality
Tags (labels) multiply time series: http.server.requests{uri="/users/123"} creates one series per user ID. Prometheus and Grafana choke on millions of unique label combinations.
| Safe tags | Unsafe tags (high cardinality) |
|---|---|
| method, status, outcome | userId, orderId, raw URL paths |
| exception (class name, not message) | stackTrace, SQL text |
| Normalized URI templates: /api/orders/{id} | Full query strings with unique params |
Boot's HTTP metrics use URI templates when possible. If you add custom tags in filters, never tag with unbounded values. Use logs or traces for per-request detail—not metric labels.
Prometheus integration
With micrometer-registry-prometheus, scrape GET /actuator/prometheus for text exposition format. Prometheus pulls metrics on an interval; your app does not push (unless you add Pushgateway for batch jobs).
scrape_configs:
- job_name: order-service
metrics_path: /actuator/prometheus
static_configs:
- targets: ["order-service:8080"]
Example scraped line:
# HELP http_server_requests_seconds Duration of HTTP server request handling
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds_bucket{method="GET",uri="/api/orders/{id}",status="200",le="0.1"} 842.0
http_server_requests_seconds_count{method="GET",uri="/api/orders/{id}",status="200"} 1204.0
Grafana dashboard basics
- RED method — Rate (requests/sec), Errors (5xx ratio), Duration (p50/p95/p99 latency) per service
- JVM panels — heap used, GC pause, thread count from jvm.* metrics
- USE for dependencies — Utilization, Saturation, Errors on DB pool (hikaricp.connections.*)
- Variables — $application, $environment from common tags
- Alerts — p99 latency > SLO, error rate > 1%, pod restarts—wire to PagerDuty/Slack
Import Grafana dashboard ID 4701 (JVM Micrometer) or 12900 (Spring Boot 2.1+ statistics) as a starting point—then customize for your SLIs. Correlate batch job metrics from Spring Batch with nightly run windows.
Distributed tracing
Metrics tell you that latency spiked; traces tell you which hop—API gateway, auth service, database, external payment API. A trace is a tree of spans, each with timing and metadata, propagated across HTTP and messaging boundaries.
sequenceDiagram participant C as Client participant A as API Service participant B as Payment Service participant D as Database C->>A: traceparent: 00-abc-span1 A->>B: traceparent: 00-abc-span2 B->>D: JDBC span B-->>A: response A-->>C: response
Trace, Span, and W3C TraceContext
| Concept | Description |
|---|---|
| Trace | End-to-end request journey; shared traceId |
| Span | Single operation (HTTP call, DB query, Kafka publish) with spanId, parent reference, duration |
| Propagation | Headers carry context to downstream services—W3C traceparent / tracestate |
// version-traceId-spanId-flags
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
Micrometer Tracing (Spring Boot 3)
Micrometer Tracing replaces Spring Cloud Sleuth in Boot 3. It provides a tracing API; you choose a bridge implementation on the classpath.
| Bridge | Backend | Dependency |
|---|---|---|
| Brave | Zipkin (default in many Boot apps) | micrometer-tracing-bridge-brave + zipkin-reporter-brave |
| OpenTelemetry | OTLP collector → Jaeger, Tempo, vendor APM | micrometer-tracing-bridge-otel + OTLP exporter |
management:
tracing:
sampling:
probability: 0.1 # 10% in prod; 1.0 in dev
zipkin:
tracing:
endpoint: http://zipkin:9411/api/v2/spans
management:
otlp:
tracing:
endpoint: http://otel-collector:4318/v1/traces
Boot auto-instruments Spring MVC, WebFlux, RestTemplate, WebClient, JDBC, Kafka, and scheduled tasks. Custom spans:
@Service
class FraudCheckService {
private final ObservationRegistry observations;
boolean check(Order order) {
return Observation.createNotStarted("fraud.check", observations)
.lowCardinalityKeyValue("payment.method", order.paymentMethod())
.observe(() -> runRules(order));
}
}
Spring Cloud Sleuth 3.x is for Boot 2.x only. Migrating to Boot 3: remove Sleuth, add micrometer-tracing bridge, update config keys from spring.sleuth.* to management.tracing.*.
Baggage propagation
Baggage is key-value context that rides along with the trace across services—tenant ID, experiment flag, support tier. Unlike span tags, baggage is forwarded to downstream calls (with size limits).
management:
tracing:
baggage:
remote-fields: tenant-id, experiment
correlation:
fields: tenant-id, experiment
@Component
class TenantBaggageFilter implements WebFilter {
@Override
public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
String tenant = exchange.getRequest().getHeaders().getFirst("X-Tenant-Id");
if (tenant != null) {
BaggageField.create("tenant-id").updateValue(tenant);
}
return chain.filter(exchange);
}
}
Baggage propagates on every outbound call—keep it small (tenant, locale). Never put PII or JWTs in baggage; it may leak to third-party services in the trace path.
"How do traces relate to logs?" — Same traceId appears in MDC log fields and trace UI. Jump from Grafana log line → Tempo/Zipkin trace → slow span. That's the observability golden path.
Logging best practices
Logs are the narrative of production—errors, audit events, and debug trails. Spring Boot defaults to SLF4J (API) + Logback (implementation). Structure them for machines (JSON) while keeping human-readable local dev output.
SLF4J + Logback vs Log4j2
| Logback (Boot default) | Log4j2 | |
|---|---|---|
| Setup | Zero config out of the box | Exclude Logback, add spring-boot-starter-log4j2 |
| Async | AsyncAppender | AsyncLogger (LMAX disruptor—very fast) |
| When to switch | Most apps stay here | High-throughput logging, existing Log4j2 ecosystem |
@Slf4j
@Service
class PaymentService {
void charge(String orderId, BigDecimal amount) {
log.info("Charging order {} amount {}", orderId, amount);
try {
gateway.charge(amount);
} catch (PaymentException ex) {
log.warn("Payment failed for order {}: {}", orderId, ex.getMessage());
throw ex;
}
}
}
Structured logging — JSON for aggregation
Plain text logs break parsers in ELK, Loki, or CloudWatch. JSON lines with fixed fields (timestamp, level, message, traceId) enable reliable queries.
implementation "net.logstash.logback:logstash-logback-encoder:7.4"
{"@timestamp":"2026-06-04T10:15:00.123Z","level":"INFO","logger":"c.a.OrderService",
"message":"Order created","orderId":"ord_42","traceId":"4bf92f3577b34da6",
"spanId":"00f067aa0ba902b7","tenant-id":"acme"}
MDC — Mapped Diagnostic Context
MDC is a per-thread (or per-reactive-context) map injected into every log line. Populate at request entry: request ID, user, trace ID—clear on exit to avoid leaks in thread pools.
@Component
class RequestMdcFilter extends OncePerRequestFilter {
@Override
protected void doFilterInternal(HttpServletRequest req, HttpServletResponse res,
FilterChain chain) throws ServletException, IOException {
try {
MDC.put("requestId", req.getHeader("X-Request-Id") != null
? req.getHeader("X-Request-Id") : UUID.randomUUID().toString());
MDC.put("userId", resolveUserId(req).orElse("anonymous"));
// traceId/spanId auto-populated by Micrometer Tracing when correlation enabled
chain.doFilter(req, res);
} finally {
MDC.clear();
}
}
}
For WebFlux, use ContextRegistry / Reactor hooks—MDC does not propagate across reactive threads automatically without Micrometer's context propagation.
Log levels
| Level | Use |
|---|---|
| TRACE | Fine-grained flow (rare in prod) |
| DEBUG | Dev troubleshooting; enable per-package in prod briefly |
| INFO | Business events, startup, healthy state changes |
| WARN | Recoverable issues, deprecated usage, retry succeeded |
| ERROR | Failures requiring attention; include context, not stack traces for expected cases |
Log at INFO for state changes auditors care about (order placed, refund issued). Log at DEBUG for developer detail. Never log secrets, full card numbers, or raw JWTs—mask or omit.
Async logging appender
Logging synchronously blocks the caller on disk I/O. An async appender buffers to a background thread—critical for latency-sensitive paths.
Logback configuration — logback-spring.xml
Use logback-spring.xml (not logback.xml) so Boot can apply springProfile extensions.
<configuration>
<include resource="org/springframework/boot/logging/logback/defaults.xml"/>
<springProfile name="local">
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} trace=%X{traceId} - %msg%n</pattern>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="CONSOLE"/>
</root>
<logger name="com.acme" level="DEBUG"/>
</springProfile>
<springProfile name="prod">
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<includeMdcKeyName>traceId</includeMdcKeyName>
<includeMdcKeyName>spanId</includeMdcKeyName>
<includeMdcKeyName>requestId</includeMdcKeyName>
<includeMdcKeyName>userId</includeMdcKeyName>
</encoder>
</appender>
<appender name="ASYNC" class="ch.qos.logback.classic.AsyncAppender">
<queueSize>8192</queueSize>
<discardingThreshold>0</discardingThreshold>
<appender-ref ref="JSON"/>
</appender>
<root level="INFO">
<appender-ref ref="ASYNC"/>
</root>
</springProfile>
</configuration>
# application-prod.yml
logging:
level:
root: INFO
com.acme: INFO
org.springframework.web: WARN
org.hibernate.SQL: WARN
In Kubernetes, log to stdout only—let the node agent (Fluent Bit, Promtail) ship to your aggregator. Rotate files only for bare-metal legacy deploys. Set pod labels for app, version in Loki/ELK queries alongside log traceId.
Health & readiness
Actuator's health endpoint tells orchestrators whether your pod should receive traffic. Since Boot 2.3+, liveness and readiness are separate probes—Kubernetes uses them to restart vs remove from load balancing.
| Probe | Question | K8s action if fail |
|---|---|---|
| Liveness | Is the process alive (not deadlocked)? | Restart container |
| Readiness | Can it accept traffic (DB up, cache warm)? | Remove from Service endpoints |
| Startup (optional) | Has slow initialization finished? | Hold liveness until pass |
management:
endpoint:
health:
probes:
enabled: true
show-details: when_authorized
endpoints:
web:
exposure:
include: health,prometheus
Endpoints:
- GET /actuator/health/liveness — always UP unless app is broken
- GET /actuator/health/readiness — DOWN when dependencies unavailable
- GET /actuator/health — aggregate view
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
Custom HealthIndicator
Register beans implementing HealthIndicator or ReactiveHealthIndicator (WebFlux). They appear under /actuator/health and affect readiness when grouped.
@Component
class PaymentGatewayHealth implements HealthIndicator {
private final PaymentClient client;
@Override
public Health health() {
try {
client.ping();
return Health.up().withDetail("gateway", "reachable").build();
} catch (Exception ex) {
return Health.down(ex).withDetail("gateway", "unreachable").build();
}
}
}
// Group only critical deps for readiness
@Configuration
class ReadinessGroups {
@Bean
HealthContributorRegistryCustomizer readinessGroups() {
return registry -> registry.registerContributorGroup("readiness",
List.of("db", "redis", "paymentGatewayHealth"));
}
}
Do not put slow external checks on liveness—a flaky partner API would restart your pods in a loop. Liveness should be lightweight (JVM responsive). Degrade readiness when dependencies fail.
Graceful shutdown
On SIGTERM (K8s pod deletion, rolling deploy), stop accepting new work, drain in-flight requests, then exit. Boot 2.3+ supports graceful shutdown for embedded web servers.
server:
shutdown: graceful
spring:
lifecycle:
timeout-per-shutdown-phase: 30s
sequenceDiagram participant K8s as Kubernetes participant Pod as Spring Boot Pod participant LB as Load Balancer K8s->>Pod: SIGTERM Pod->>LB: Readiness DOWN (stop new traffic) Pod->>Pod: Complete in-flight requests (≤30s) Pod->>Pod: Close connections, flush logs Pod->>K8s: Process exit 0
Combine with:
- terminationGracePeriodSeconds on the pod ≥ shutdown timeout + buffer
- PreStop hook optional delay so readiness propagates before SIGTERM traffic
- Kafka consumers: @PreDestroy or SmartLifecycle to commit offsets cleanly
- Batch jobs: let running steps finish or use Spring Batch restart semantics
@Component
class DrainOnShutdown {
@EventListener
void onShutdown(ContextClosedEvent event) {
log.info("Shutdown initiated — draining work");
// stop schedulers, close custom thread pools
}
}
"Liveness vs readiness?" — Liveness: restart if app is stuck. Readiness: stop sending traffic if app cannot serve (DB down). Wrong liveness check = restart storm; wrong readiness = 503s to users.
Graceful shutdown registers with Spring's lifecycle: web server stops accepting connections first, then beans destroy in reverse dependency order. timeout-per-shutdown-phase forces exit if drain exceeds limit—balance with K8s terminationGracePeriodSeconds.