Observability in Distributed Systems
Twelve services handle one checkout click—when latency spikes at 2 a.m., you cannot SSH into a single JVM and tail a file. Observability exports logs, metrics, and traces so you can ask arbitrary questions about production: where time went, which service errored, and what one request did—joined by correlation IDs and SLOs that turn on-call from guesswork into math.
Observability vs monitoring
Monitoring reacts to known failure modes; observability lets you investigate unknown failures by exploring high-cardinality telemetry you did not predefine dashboards for.
Monitoring is reactive and dashboard-driven: CPU above 90%, disk full, health check failed, queue depth over threshold. You anticipated the failure modes, wired alerts, and on-call knows what each page means. It works when the system is stable and failure shapes are finite.
Observability is proactive investigation: the ability to understand internal system state from exported outputs without redeploying debug code. Microservices multiply unknowns—partial degradation, retry storms, one slow dependency in a fan-out graph, tenant-specific routing bugs. After deploy you will ask questions you did not plan for: “Why did only EU tenants see 503 on checkout between 03:12 and 03:18?” Observability assumes those questions and exports enough context—logs, metrics, traces—to answer them.
| Dimension | Monitoring | Observability |
|---|---|---|
| Mindset | Known unknowns — alert when X | Unknown unknowns — explore why |
| Cardinality | Low — aggregates only | High — per-request, per-trace drill-down |
| Primary tool | Threshold alerts on metrics | Ad-hoc queries across signals |
| On-call use | “Something broke — run runbook” | “Users hurt — find root cause fast” |
Mature platforms combine both: SLO-based alerts (monitoring) plus trace and log correlation (observability). Click a spike in Grafana → drill to exemplar trace → jump to correlated JSON log lines—a workflow impossible with metrics alone.
Define observability as “ability to understand internal state from external outputs.” Story: support ticket with trace_id → Jaeger waterfall → slow Inventory DB query → missing index. Monitoring would only show elevated p99.
The three pillars — logs, metrics, traces
Each signal answers different questions. Using only one leaves blind spots during incidents; all three together form a complete picture.
Logs are discrete, timestamped events—narrative records of what happened on one code path. They carry arbitrary context: order IDs, error stack traces, business decisions. Strength: forensic detail for a single request. Weakness: volume and cost—grep does not scale without indexing; high-cardinality search is expensive.
Metrics are numeric aggregates over time—counters, gauges, histograms with low-cardinality labels. They answer “how much” and “how fast” at fleet scale: requests per second, error ratio, p99 latency. Strength: cheap storage, long retention, fast alerting. Weakness: you cannot metric every user ID—aggregates hide individual failures.
Traces are trees of spans showing causality and timing across services—one trace follows one request end-to-end. They answer “where did the time go?” and “which hop failed?” Strength: critical path across ten services. Weakness: sampling required at scale; instrumentation effort; async gaps without careful propagation.
| Signal | Best for | Typical store | Alerting? |
|---|---|---|---|
| Logs | Debug one request, audit trail, error context | Elasticsearch, Loki, CloudWatch Logs | Rarely — too noisy |
| Metrics | SLOs, capacity, fleet health, burn rate | Prometheus, Datadog, CloudWatch Metrics | Primary alert source |
| Traces | Latency breakdown, dependency map, root cause | Jaeger, Tempo, Zipkin, Honeycomb | Via derived metrics or tail sampling |
flowchart LR APP[Microservices] --> OTEL[OpenTelemetry SDK] OTEL --> LOG[Log backend] OTEL --> PROM[Prometheus] OTEL --> TRACE[Trace backend] PROM --> GRAF[Grafana] TRACE --> GRAF LOG --> GRAF
Why all three: metrics fire the alert (“checkout p99 > 2s”); traces show the slow span (Payment provider timeout); logs explain the business context (Stripe returned 402, card declined). Without metrics you discover outages from Twitter. Without traces you grep twelve services by timestamp. Without logs you see a slow span but not why it retried three times.
Golden rule: express fleet health as metrics; express per-request forensics as logs and traces. Never log PII or secrets—see Security → API hardening.
Trace context propagation — W3C and B3
A trace only works if every hop forwards the same context. Standard headers prevent each team inventing incompatible propagation.
W3C Trace Context (preferred)
The W3C standard defines traceparent and optional tracestate HTTP headers. traceparent format: version-trace_id-parent_id-flags (e.g. 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01). trace_id identifies the whole request tree; parent_id is the calling span; flags bit 0 indicates sampled. gRPC uses equivalent metadata keys; message queues embed the same values in record headers.
tracestate carries vendor-specific hints (sampling priority, tenant routing)—keep it small; never put secrets in baggage or tracestate.
B3 (Zipkin legacy, still common)
B3 predates W3C and remains in older Spring Cloud Sleuth estates and some proxies. Headers include: X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId, X-B3-Sampled (1/0), and single-header b3 compact form. OpenTelemetry Collector can translate B3 ↔ W3C at ingress so mixed fleets interoperate during migration.
| Header | Standard | Purpose |
|---|---|---|
| traceparent | W3C | trace_id + parent span_id + sampled flag |
| tracestate | W3C | Vendor extensions, sampling hints |
| X-B3-TraceId | B3 | 128-bit trace identifier |
| X-B3-SpanId | B3 | 64-bit current span id |
| X-B3-Sampled | B3 | Whether trace is recorded |
sequenceDiagram participant GW as API Gateway participant Ord as Order Service participant Inv as Inventory Service GW->>Ord: traceparent W3C Ord->>Inv: forward traceparent Note over GW,Inv: Same trace_id in every span
Load balancers or API gateways that strip unknown headers break traces silently. Allowlist traceparent, tracestate, and B3 headers in every proxy config.
Sampling strategies — head, tail, and adaptive
100% trace capture at production traffic volume bankrupts storage and adds latency. Sampling decides which traces to keep while preserving debuggability.
Head-based sampling
Decision at trace start—usually the ingress gateway or first service. “Keep 10% of all traces” via random or consistent hash. Pros: simple, predictable cost, no buffering. Cons: you may discard the one slow/error trace you needed—bad luck on a 1% sample rate. Spring Boot: management.tracing.sampling.probability: 0.1.
Tail-based sampling
Decision after trace completes—buffer spans in OpenTelemetry Collector, then keep traces matching rules: status=error, duration > 2s, specific attribute (tenant=enterprise), or random remainder to fill quota. Pros: always retain interesting traces. Cons: memory buffering, complexity, slight export delay. Essential for high-traffic prod where head-only sampling misses rare failures.
Adaptive sampling
Dynamically adjust rate based on traffic volume, error rate, or SLO burn—Honeycomb and Datadog offer this natively; self-hosted stacks approximate with Collector processors plus rate limits per service. During incidents, temporarily raise sample rate for affected services; lower during steady state.
| Strategy | When to use | Trade-off |
|---|---|---|
| Head 1–10% | Default prod baseline | May miss rare paths |
| Head 100% | Staging, load test, low-traffic services | Storage cost |
| Tail keep errors | Prod high traffic | Collector memory |
| Adaptive | Variable traffic, incident mode | Vendor or custom logic |
E-commerce teams often run 5–10% head sampling in prod plus tail rules: keep all 5xx, all traces > 2s, 1% random baseline. Staging keeps 100% for regression comparison.
OpenTelemetry — the standard
Instrument once, export anywhere. OTel unifies traces, metrics, and logs under one vendor-neutral API—the CNCF standard every backend speaks.
Components
- API — interfaces in application code (Span, Meter, Logger)
- SDK — implementation: sampling, batching, resource attributes (service.name, deployment.environment)
- Instrumentation libraries — auto hooks for Spring Web, JDBC, Kafka, gRPC, HTTP clients
- Collector — receive OTLP, process (filter, sample, enrich), export to Jaeger, Tempo, Prometheus, Loki
Java teams often start with opentelemetry-javaagent.jar attached to the JVM—zero-code HTTP, DB, and messaging spans. Add manual spans for business operations (place-order, charge-payment) where auto-instrumentation stops at framework boundaries. Spring Boot 3 integrates via Micrometer Tracing bridge exporting OTLP.
flowchart LR SVC[Spring Boot pods] -->|OTLP gRPC 4317| COL[OTel Collector] COL --> TEMPO[Tempo or Jaeger] COL --> PROM[Prometheus remote write] COL --> LOG[Loki exporter]
processors:
tail_sampling:
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 2000 }
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 5 }
@WithSpan("place-order")
public OrderId placeOrder(PlaceOrderCommand cmd) {
Span.current().setAttribute("order.line_count", cmd.lines().size());
return orderRepository.save(cmd.toOrder()).id();
}
Baggage propagates optional key-values (tenant region, experiment flag) alongside trace context—use sparingly, never for secrets or large payloads.
Jaeger, Zipkin, and Tempo as backends
All ingest OTLP from the Collector; they differ in storage model, ops burden, and Grafana integration.
| Backend | Storage | Strengths | Typical fit |
|---|---|---|---|
| Jaeger | Cassandra, Elasticsearch, Badger, memory | Mature UI, K8s operator, query by tags | Teams wanting dedicated trace UI |
| Zipkin | In-memory, Elasticsearch, Cassandra | Simple, lightweight, B3 native | Legacy Sleuth, small deployments |
| Grafana Tempo | Object storage — S3, GCS, Azure Blob | Cheap at scale, native Grafana, TraceQL | Cloud-native, LGTM stack (Loki Grafana Tempo Mimir) |
Jaeger offers service dependency graphs, comparison UI, and adaptive sampling plugins. Operational cost rises with Elasticsearch/Cassandra unless you use object-storage backends via Jaeger v2 components.
Tempo stores blocks in object storage—cost scales with retention GB, not indexed span count. Query via Grafana Explore or TraceQL ({ span.service.name = "order-service" && duration > 1s }). Pair with Loki for “logs for this trace_id” and Prometheus for exemplars linking metrics to traces.
Indexed trace stores (Elasticsearch-backed Jaeger) enable rich search but explode cost at billions of spans. Tempo trades ad-hoc search for object-storage economics—know your query patterns before choosing.
Trace-based debugging — root cause across ten services
A trace is a tree of spans. The waterfall view answers which hop ate 900 ms of a 1 s budget—and whether errors propagated or were masked by retries.
Span kinds: SERVER (incoming HTTP), CLIENT (outbound call), PRODUCER/CONSUMER (messaging), INTERNAL (in-process). Parent-child links preserve causality; links connect async work started before parent span ended. Tag spans with business context: order.id, payment.provider—not PII.
Incident workflow with traces
- Alert fires — checkout p99 SLO burn (from Prometheus)
- Grafana exemplar or Loki log line yields trace_id
- Jaeger/Tempo waterfall — Payment CLIENT span 820 ms, Inventory SERVER 15 ms
- Drill Payment span — Stripe timeout after retry; circuit breaker half-open
- Mitigate — extend timeout temporarily, scale payment pods, disable promotion flag
flowchart TB GW[Gateway 12ms] --> Ord[Order 45ms] Ord --> Cat[Catalog 8ms] Ord --> Inv[Inventory 22ms] Ord --> Pay[Payment 820ms] Ord --> Notif[Notification async] Pay --> Stripe[Stripe API timeout]
Critical path analysis: Catalog and Inventory run parallel—longest branch (Payment) dominates user latency. Optimize Payment first; caching Catalog does nothing if Payment p99 is 800 ms. During canary deploys, compare trace latency distributions between stable and canary versions—pairs with Service Mesh → Canary traffic split.
Common patterns in multi-service traces: retry amplification (one user request → five downstream attempts visible as repeated CLIENT spans), missing spans (service not instrumented—gap in waterfall), clock skew (child starts before parent—use relative duration not absolute timestamps), and fire-and-forget async (trace ends at gateway while Kafka consumer span appears orphaned without link).
Correlation ID pattern — one thread through the system
Support sends “order 8f2a failed”—you need every log and span for that journey, not grep by timestamp hoping clock skew cooperates.
Request ID (X-Request-Id) — human-friendly identifier generated at the edge gateway or accepted from client if UUID-shaped and validated. Propagate on every outbound HTTP header and Kafka message envelope. Appears in support tools and API responses for user-facing correlation.
Trace ID — from W3C traceparent; ties all spans and should appear in every structured log via MDC. Best practice: populate log MDC from OpenTelemetry context automatically—Java agent or Micrometer tracing bridge sets trace_id and span_id without manual filter code in every service.
sequenceDiagram participant GW as API Gateway participant Ord as Order Service participant Inv as Inventory participant K as Kafka GW->>Ord: traceparent plus X-Request-Id Ord->>Inv: forward headers Ord->>K: headers in record Note over GW,K: Same trace_id in logs and spans
@Component
public class CorrelationFilter implements WebFilter {
@Override
public Mono<Void> filter(ServerWebExchange ex, WebFilterChain chain) {
String requestId = Optional.ofNullable(ex.getRequest().getHeaders().getFirst("X-Request-Id"))
.filter(id -> id.matches("[0-9a-f-]{36}"))
.orElseGet(() -> UUID.randomUUID().toString());
ex.getResponse().getHeaders().add("X-Request-Id", requestId);
return chain.filter(ex)
.contextWrite(ctx -> ctx.put("request_id", requestId));
}
}
Gateway should reject or replace malformed IDs—never trust client-supplied IDs for auth, only correlation. Document headers in OpenAPI per Service Design API standards.
RED and USE — what to measure
Two mnemonic frameworks prevent dashboard sprawl: RED for request-driven services, USE for resources (CPU, disk, queues, pools).
RED — for services
Every synchronous microservice exposing HTTP or gRPC should dashboard these three:
- Rate — requests per second; traffic volume and capacity planning input
- Errors — ratio of failed requests (5xx, timeouts, gRPC UNAVAILABLE); split by dependency when possible
- Duration — latency distribution: p50, p95, p99—never alert on average alone
Spring Boot + Micrometer expose http.server.requests with method, status, uri tags—normalize uri to templated paths (/orders/{id}) or cardinality explodes. Istio sidecars export equivalent RED without code—see Service Mesh → Observability.
USE — for resources
- Utilization — fraction of time busy: CPU, memory pressure, JDBC pool active connections
- Saturation — work waiting: queue depth, thread pool rejections, disk IO wait, Kafka consumer lag
- Errors — device/software errors: OOM kills, TCP retransmits, disk read failures
Healthy RED on Order Service while connection pool saturation climbs predicts outage in ten minutes—RED alone misses resource exhaustion. Watch bulkhead rejections and retry storms from Resilience → Tuning as leading indicators.
Dashboard layout per service: SLO burn strip on top, RED row middle, USE/resource row bottom. Same layout everywhere—on-call muscle memory at 3 a.m.
Prometheus — pull-based scraping and PromQL
Prometheus scrapes metrics HTTP endpoints on an interval, stores time series in TSDB, and powers PromQL alerts—the de facto standard in Kubernetes.
Pull model
Prometheus pulls from /actuator/prometheus or /metrics every 15–30s. Kubernetes: Prometheus Operator ServiceMonitor CRD selects services by label; PodMonitor for direct pod scrape. Short-lived batch jobs use Pushgateway sparingly—easy to misuse with stale metrics.
Metric types and PromQL basics
- Counter — monotonic (total requests); use rate() or increase() over a window
- Gauge — point-in-time (queue depth, memory)
- Histogram — buckets for latency SLOs; histogram_quantile(0.99, ...)
# Request rate — R in RED
sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))
# Error ratio — E in RED
sum(rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))
# p99 latency — D in RED
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (le))
Label hygiene and recording rules
Each unique label combination is a time series. uri="/orders/12345" creates millions of series—crash Prometheus. Recording rules pre-aggregate hot queries (job:http_requests:rate5m) for faster dashboards. Exemplars attach trace_id to histogram buckets—Grafana jumps from latency spike to example trace.
Alerting on log counts in Elasticsearch—expensive, laggy, duplicates Prometheus. Use the right signal: metrics for aggregates, logs for forensics.
Micrometer and Spring Boot Actuator
Micrometer is the metrics facade in Spring Boot 3—one API, export to Prometheus, OTLP, Datadog without rewriting instrumentation.
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
metrics:
tags:
application: ${spring.application.name}
environment: ${ENVIRONMENT:local}
tracing:
sampling:
probability: 0.1
otlp:
tracing:
endpoint: http://otel-collector:4318/v1/traces
Protect /actuator/prometheus with network policy—not public internet. Custom metrics: Counter for orders placed, Timer for payment latency, Gauge for queue depth. Low-cardinality tags only: region, payment_method—not user_id. Resilience4j circuit breaker metrics integrate automatically—alert when breaker stays OPEN.
Grafana — service dashboards and SLO dashboards
Grafana queries Prometheus, Loki, Tempo, and Jaeger in one UI—dashboards become shared language between dev, SRE, and product during incidents.
Service dashboards
Start with user journeys (Checkout, Search), not pod lists. Per service panel row: RED metrics with variables for namespace, cluster, service. Bottom row: USE—CPU, JVM heap, DB pool, Kafka lag. One dashboard template cloned per service keeps on-call consistent.
SLO dashboards
Dedicated board per critical journey: SLI gauge (current availability/latency), error budget remaining (%), burn rate over 1h/6h/24h windows. Multi-burn visualization from Google SRE workbook—fast burn pages, slow burn tickets. Link panels to runbooks: “If checkout error rate > 2% for 5m → check Stripe status, scale payment pods, RUNBOOK-042.”
| Alert type | Example | Action |
|---|---|---|
| Symptom | Checkout p99 > 2s for 10m | Page on-call — user impact |
| Cause | Inventory pod restart loop | Ticket — may explain symptom |
| Capacity | CPU > 70% sustained 1h | Scale HPA — see Deployment |
Route pages via Grafana unified alerting or Prometheus Alertmanager by severity and team. Page humans on symptom-based SLO burn—not every CPU blip. Silence windows during planned maintenance with documented annotations.
Alerting rules — SLI, SLO, SLA, and error budget
An SLO is a target users feel—99.9% of checkout requests succeed in under 2 seconds—not “three nines on CPU.” Error budgets translate reliability into shared currency.
| Term | Meaning | Example |
|---|---|---|
| SLI | Measurable indicator of service level | Ratio of checkout HTTP 200 with latency < 2s |
| SLO | Target SLI over rolling window | 99.5% of checkouts meet SLI over 30 days |
| SLA | Contract with customer penalties | 99.9% monthly or credits — legal/commercial |
| Error budget | Allowed unreliability = 100% − SLO | 0.5% budget ≈ 3.6 h bad minutes/month at 99.5% |
SLI implementation in Prometheus: ratio of good events to total over window. Good events: status=~"2..", le="2" on histogram bucket or dedicated success counter. SLO target 99.5% over 30d → alert when burn rate consumes budget too fast.
# Fast burn: 14.4x budget consumption in 1h → page
# Slow burn: 6x budget consumption in 6h → ticket
# When budget exhausted → freeze risky releases
flowchart TB
SLI[Measure SLI from metrics] --> SLO{Meet SLO?}
SLO -->|yes| SHIP[Allow feature releases]
SLO -->|burning| SLOW[Investigate and fix]
SLO -->|exhausted| FREEZE[Freeze risky changes]
Tie resilience timeouts to SLO math: checkout budget 2 s with five sequential hops—no single hop gets 2 s. Aligns with Resilience → Tuning. Canary promotion gates on SLO—see Deployment.
One SLI/SLO pair: “SLI = successful checkout under 2s; SLO = 99.5% over 30d; alert on 14.4× burn in 1h.” Shows implementation depth, not definitions only.
ELK Stack — Elasticsearch, Logstash, Kibana
The classic centralized logging stack: ingest, parse, index full text, search and visualize in Kibana. Powerful—and expensive at scale.
Elasticsearch stores inverted indexes of log fields—fast full-text search, aggregations, and complex filters (trace_id:abc AND level:ERROR AND service:order-service). Logstash (or Beats/Filebeat) ships logs from apps, parses grok/JSON, enriches with Kubernetes metadata, forwards to ES. Kibana provides Discover, dashboards, and alerting (prefer metric alerts in Prometheus; log alerts for security anomalies).
flowchart LR POD[App pods stdout] --> FB[Filebeat] FB --> LS[Logstash] LS --> ES[Elasticsearch] ES --> KB[Kibana]
Strengths: mature ecosystem, rich query DSL, security/compliance features (index lifecycle, frozen tiers). Weaknesses: indexing every field is costly—hot/warm/cold tier planning required; cardinality on high-volume INFO logs adds up fast. Mitigate with index templates mapping trace_id as keyword, sampling DEBUG, and ILM policies deleting indices after 7–30 days.
Elastic Agent and Fleet simplify deployment on Kubernetes—DaemonSet collects container logs, adds pod labels automatically. Alternative ingest: Fluent Bit lighter than Logstash for high-volume K8s estates.
ELK excels when you need full-text search across arbitrary message content. If queries are always by label (namespace, service, trace_id), Loki is often 10× cheaper.
Loki + Grafana — logs without indexing everything
Loki indexes labels (like Prometheus indexes metric names), not full log line content—dramatically cheaper at Kubernetes scale.
Logs stream to Loki with label sets: {namespace="prod", app="order-service", pod="order-7x2k"}. LogQL queries filter by labels first, then grep-like filter on line content—{app="order-service"} |= "trace_id=4bf92f". Pair with Grafana for unified view: metrics spike → same dashboard → Loki logs → Tempo trace via derived fields.
Promtail (or Grafana Alloy) tails container logs, extracts JSON fields into labels where cardinality allows— trace_id as label enables instant log→trace pivot; avoid high-cardinality labels like user_id. Structured metadata (Loki 2.9+) stores trace_id as searchable metadata without index blow-up.
{app="order-service"} | json | trace_id="4bf92f3577b34da6a3ce929d0e0e4736" | level="ERROR"
| Aspect | Elasticsearch | Loki |
|---|---|---|
| Index model | Full-text inverted index | Labels + compressed chunks |
| Cost at scale | Higher — every field indexed | Lower — object storage friendly |
| Best queries | Arbitrary text search | Label-filtered stream grep |
| Grafana integration | Via Elasticsearch datasource | Native — LGTM stack |
Structured logging — JSON with trace and service fields
Plain text logs parsed with regex break when someone adds a colon. JSON logs index reliably and join to traces via shared fields.
Every log line should carry structured fields: timestamp, level, service, trace_id, span_id, request_id, plus domain IDs (order_id) when known. Use SLF4J MDC populated from OpenTelemetry context in a servlet filter or WebFlux filter—clear MDC after request to prevent thread-pool leakage.
{
"timestamp": "2026-06-04T03:14:22.891Z",
"level": "ERROR",
"service": "order-service",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"request_id": "req-8f2a-991c",
"message": "Payment charge failed",
"order_id": "ord-7721",
"error.type": "StripeTimeoutException"
}
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<includeMdcKeyName>trace_id</includeMdcKeyName>
<includeMdcKeyName>span_id</includeMdcKeyName>
<includeMdcKeyName>request_id</includeMdcKeyName>
<customFields>{"service":"${spring.application.name}"}</customFields>
</encoder>
</appender>
Log levels: ERROR action-required, WARN recoverable anomaly, INFO business milestones, DEBUG off in prod (enable per-request via feature flag during incidents). Avoid full payloads—truncate and hash identifiers. OpenTelemetry Logs bridge (experimental) correlates logs to spans natively.
Logging inside tight loops or on every health check floods the pipeline and hides real errors. Sample debug or use metrics counters instead.
Log aggregation patterns in Kubernetes
Containers write to stdout/stderr; the platform must collect, label, and ship without app changes. Choose node-level, sidecar, or logging agent patterns deliberately.
Node-level collection (recommended default)
DaemonSet agent (Promtail, Fluent Bit, Filebeat) on every node tails /var/log/containers/*.log, enriches with Kubernetes API metadata (namespace, pod, container, labels), ships to Loki or Elasticsearch. Zero app change—apps log JSON to stdout. Lowest overhead for most microservices.
Sidecar pattern
Second container in pod tails shared volume or stdout relay—useful when app writes to file instead of stdout, or when you need local parsing before ship. Cost: extra container memory/CPU per pod—avoid fleet-wide unless required.
Direct export from app
App sends logs via OTLP or HTTP to Collector/Loki—useful for serverless or when DaemonSet access is restricted. Requires app library config; ensure backoff on collector outage so logging does not block requests.
| Pattern | Pros | Cons |
|---|---|---|
| DaemonSet | No app change, uniform labels | Node-level RBAC, shared fate on node |
| Sidecar | File-based apps, pod-local filter | Resource multiplier per pod |
| OTLP from app | Unified OTel pipeline | App must handle backpressure |
Standardize labels across cluster: app.kubernetes.io/name, environment, cluster name. Exclude kube-system noise at ingest. Rotate and cap log volume per namespace with quotas in multi-tenant clusters.
Tracing async flows — Kafka, outbox, broken traces
HTTP propagation is straightforward; message queues break traces unless you inject W3C context into record headers and start consumer spans explicitly.
When Order Service publishes OrderPlaced via transactional outbox to Kafka, embed traceparent in message headers (Spring Kafka + Micrometer tracing configures this). Inventory consumer starts CONSUMER span linked to producer span—trace UI shows async continuation, not orphan spans.
Without propagation, on-call sees disconnected spans and assumes Inventory is fast while missing 30s consumer lag. Complement traces with consumer lag metrics and DLQ depth alerts— see Communication → Kafka/outbox.
Batch consumers processing 500 records in one poll create one giant span—use child spans per message or attribute batch size on parent span.
Mesh telemetry vs application instrumentation
Istio sees L4/L7 bytes; apps see business operations—duplicate spans if both create SERVER spans for the same request unless coordinated.
Mesh-only: uniform RED across polyglot services, mTLS verified traffic, no code deploy for basic metrics. App-only: business spans, custom metrics, works without sidecars on VMs. Hybrid (recommended): mesh for network metrics and mTLS audit; OTel in app for domain spans and log correlation; disable duplicate HTTP server spans in one layer via telemetry config.
Full comparison: Service Mesh → Observability. Kiali answers topology; Grafana answers SLOs; Jaeger/Tempo answers latency—same incident, three lenses.
100% trace sampling bankrupts storage—tail sampling plus aggressive head sampling in prod; full sampling only in dev/staging load tests.
Production observability checklist
Gate new services on telemetry completeness before production traffic—not as a post-launch cleanup ticket.
- Structured JSON logs with trace_id, span_id, request_id, service name—MDC cleared per request
- RED metrics via Micrometer/Prometheus; actuator scrape endpoint not public
- OpenTelemetry with W3C propagation on HTTP, gRPC, and Kafka; B3 translation at legacy boundaries
- Head + tail sampling configured in Collector; staging at 100% for regression
- Trace backend (Jaeger/Tempo) with retention policy aligned to incident needs (7–14 days typical)
- Grafana service dashboards + SLO dashboard with multi-window burn alerts and runbook links
- Centralized logging: Loki or ELK with DaemonSet collection and Kubernetes metadata labels
- Consumer lag and DLQ alerts for async paths
- Label cardinality reviewed—no unbounded uri or user_id metric tags
- Health vs readiness probes distinct; probe traffic not logged at INFO
- Exemplars enabled linking histograms to trace IDs where supported
- Postmortem template; on-call rotation with escalation path