Observability in Distributed Systems

Observability vs monitoring

Monitoring reacts to known failure modes; observability lets you investigate unknown failures by exploring high-cardinality telemetry you did not predefine dashboards for.

Monitoring is reactive and dashboard-driven: CPU above 90%, disk full, health check failed, queue depth over threshold. You anticipated the failure modes, wired alerts, and on-call knows what each page means. It works when the system is stable and failure shapes are finite.

Observability is proactive investigation: the ability to understand internal system state from exported outputs without redeploying debug code. Microservices multiply unknowns—partial degradation, retry storms, one slow dependency in a fan-out graph, tenant-specific routing bugs. After deploy you will ask questions you did not plan for: “Why did only EU tenants see 503 on checkout between 03:12 and 03:18?” Observability assumes those questions and exports enough context—logs, metrics, traces—to answer them.

Dimension	Monitoring	Observability
Mindset	Known unknowns — alert when X	Unknown unknowns — explore why
Cardinality	Low — aggregates only	High — per-request, per-trace drill-down
Primary tool	Threshold alerts on metrics	Ad-hoc queries across signals
On-call use	“Something broke — run runbook”	“Users hurt — find root cause fast”

Mature platforms combine both: SLO-based alerts (monitoring) plus trace and log correlation (observability). Click a spike in Grafana → drill to exemplar trace → jump to correlated JSON log lines—a workflow impossible with metrics alone.

🎯 Interview Tip

Define observability as “ability to understand internal state from external outputs.” Story: support ticket with trace_id → Jaeger waterfall → slow Inventory DB query → missing index. Monitoring would only show elevated p99.

The three pillars — logs, metrics, traces

Each signal answers different questions. Using only one leaves blind spots during incidents; all three together form a complete picture.

Logs are discrete, timestamped events—narrative records of what happened on one code path. They carry arbitrary context: order IDs, error stack traces, business decisions. Strength: forensic detail for a single request. Weakness: volume and cost—grep does not scale without indexing; high-cardinality search is expensive.

Metrics are numeric aggregates over time—counters, gauges, histograms with low-cardinality labels. They answer “how much” and “how fast” at fleet scale: requests per second, error ratio, p99 latency. Strength: cheap storage, long retention, fast alerting. Weakness: you cannot metric every user ID—aggregates hide individual failures.

Traces are trees of spans showing causality and timing across services—one trace follows one request end-to-end. They answer “where did the time go?” and “which hop failed?” Strength: critical path across ten services. Weakness: sampling required at scale; instrumentation effort; async gaps without careful propagation.

Signal	Best for	Typical store	Alerting?
Logs	Debug one request, audit trail, error context	Elasticsearch, Loki, CloudWatch Logs	Rarely — too noisy
Metrics	SLOs, capacity, fleet health, burn rate	Prometheus, Datadog, CloudWatch Metrics	Primary alert source
Traces	Latency breakdown, dependency map, root cause	Jaeger, Tempo, Zipkin, Honeycomb	Via derived metrics or tail sampling

flowchart LR
  APP[Microservices] --> OTEL[OpenTelemetry SDK]
  OTEL --> LOG[Log backend]
  OTEL --> PROM[Prometheus]
  OTEL --> TRACE[Trace backend]
  PROM --> GRAF[Grafana]
  TRACE --> GRAF
  LOG --> GRAF

Why all three: metrics fire the alert (“checkout p99 > 2s”); traces show the slow span (Payment provider timeout); logs explain the business context (Stripe returned 402, card declined). Without metrics you discover outages from Twitter. Without traces you grep twelve services by timestamp. Without logs you see a slow span but not why it retried three times.

Golden rule: express fleet health as metrics; express per-request forensics as logs and traces. Never log PII or secrets—see Security → API hardening.

Trace context propagation — W3C and B3

A trace only works if every hop forwards the same context. Standard headers prevent each team inventing incompatible propagation.

W3C Trace Context (preferred)

The W3C standard defines traceparent and optional tracestate HTTP headers. traceparent format: version-trace_id-parent_id-flags (e.g. 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01). trace_id identifies the whole request tree; parent_id is the calling span; flags bit 0 indicates sampled. gRPC uses equivalent metadata keys; message queues embed the same values in record headers.

tracestate carries vendor-specific hints (sampling priority, tenant routing)—keep it small; never put secrets in baggage or tracestate.

B3 (Zipkin legacy, still common)

B3 predates W3C and remains in older Spring Cloud Sleuth estates and some proxies. Headers include: X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId, X-B3-Sampled (1/0), and single-header b3 compact form. OpenTelemetry Collector can translate B3 ↔ W3C at ingress so mixed fleets interoperate during migration.

Header	Standard	Purpose
traceparent	W3C	trace_id + parent span_id + sampled flag
tracestate	W3C	Vendor extensions, sampling hints
X-B3-TraceId	B3	128-bit trace identifier
X-B3-SpanId	B3	64-bit current span id
X-B3-Sampled	B3	Whether trace is recorded

sequenceDiagram
  participant GW as API Gateway
  participant Ord as Order Service
  participant Inv as Inventory Service
  GW->>Ord: traceparent W3C
  Ord->>Inv: forward traceparent
  Note over GW,Inv: Same trace_id in every span

⚠️ Pitfall

Load balancers or API gateways that strip unknown headers break traces silently. Allowlist traceparent, tracestate, and B3 headers in every proxy config.

Sampling strategies — head, tail, and adaptive

100% trace capture at production traffic volume bankrupts storage and adds latency. Sampling decides which traces to keep while preserving debuggability.

Head-based sampling

Decision at trace start—usually the ingress gateway or first service. “Keep 10% of all traces” via random or consistent hash. Pros: simple, predictable cost, no buffering. Cons: you may discard the one slow/error trace you needed—bad luck on a 1% sample rate. Spring Boot: management.tracing.sampling.probability: 0.1.

Tail-based sampling

Decision after trace completes—buffer spans in OpenTelemetry Collector, then keep traces matching rules: status=error, duration > 2s, specific attribute (tenant=enterprise), or random remainder to fill quota. Pros: always retain interesting traces. Cons: memory buffering, complexity, slight export delay. Essential for high-traffic prod where head-only sampling misses rare failures.

Adaptive sampling

Dynamically adjust rate based on traffic volume, error rate, or SLO burn—Honeycomb and Datadog offer this natively; self-hosted stacks approximate with Collector processors plus rate limits per service. During incidents, temporarily raise sample rate for affected services; lower during steady state.

Strategy	When to use	Trade-off
Head 1–10%	Default prod baseline	May miss rare paths
Head 100%	Staging, load test, low-traffic services	Storage cost
Tail keep errors	Prod high traffic	Collector memory
Adaptive	Variable traffic, incident mode	Vendor or custom logic

📦 Real World

E-commerce teams often run 5–10% head sampling in prod plus tail rules: keep all 5xx, all traces > 2s, 1% random baseline. Staging keeps 100% for regression comparison.

OpenTelemetry — the standard

Instrument once, export anywhere. OTel unifies traces, metrics, and logs under one vendor-neutral API—the CNCF standard every backend speaks.

Components

API — interfaces in application code (Span, Meter, Logger)
SDK — implementation: sampling, batching, resource attributes (service.name, deployment.environment)
Instrumentation libraries — auto hooks for Spring Web, JDBC, Kafka, gRPC, HTTP clients
Collector — receive OTLP, process (filter, sample, enrich), export to Jaeger, Tempo, Prometheus, Loki

Java teams often start with opentelemetry-javaagent.jar attached to the JVM—zero-code HTTP, DB, and messaging spans. Add manual spans for business operations (place-order, charge-payment) where auto-instrumentation stops at framework boundaries. Spring Boot 3 integrates via Micrometer Tracing bridge exporting OTLP.

flowchart LR
  SVC[Spring Boot pods] -->|OTLP gRPC 4317| COL[OTel Collector]
  COL --> TEMPO[Tempo or Jaeger]
  COL --> PROM[Prometheus remote write]
  COL --> LOG[Loki exporter]

processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 2000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

@WithSpan("place-order")
public OrderId placeOrder(PlaceOrderCommand cmd) {
  Span.current().setAttribute("order.line_count", cmd.lines().size());
  return orderRepository.save(cmd.toOrder()).id();
}

Baggage propagates optional key-values (tenant region, experiment flag) alongside trace context—use sparingly, never for secrets or large payloads.

Jaeger, Zipkin, and Tempo as backends

All ingest OTLP from the Collector; they differ in storage model, ops burden, and Grafana integration.

Backend	Storage	Strengths	Typical fit
Jaeger	Cassandra, Elasticsearch, Badger, memory	Mature UI, K8s operator, query by tags	Teams wanting dedicated trace UI
Zipkin	In-memory, Elasticsearch, Cassandra	Simple, lightweight, B3 native	Legacy Sleuth, small deployments
Grafana Tempo	Object storage — S3, GCS, Azure Blob	Cheap at scale, native Grafana, TraceQL	Cloud-native, LGTM stack (Loki Grafana Tempo Mimir)

Jaeger offers service dependency graphs, comparison UI, and adaptive sampling plugins. Operational cost rises with Elasticsearch/Cassandra unless you use object-storage backends via Jaeger v2 components.

Tempo stores blocks in object storage—cost scales with retention GB, not indexed span count. Query via Grafana Explore or TraceQL ({ span.service.name = "order-service" && duration > 1s }). Pair with Loki for “logs for this trace_id” and Prometheus for exemplars linking metrics to traces.

⚖️ Trade-off

Indexed trace stores (Elasticsearch-backed Jaeger) enable rich search but explode cost at billions of spans. Tempo trades ad-hoc search for object-storage economics—know your query patterns before choosing.

Trace-based debugging — root cause across ten services

A trace is a tree of spans. The waterfall view answers which hop ate 900 ms of a 1 s budget—and whether errors propagated or were masked by retries.

Span kinds: SERVER (incoming HTTP), CLIENT (outbound call), PRODUCER/CONSUMER (messaging), INTERNAL (in-process). Parent-child links preserve causality; links connect async work started before parent span ended. Tag spans with business context: order.id, payment.provider—not PII.

Incident workflow with traces

Alert fires — checkout p99 SLO burn (from Prometheus)
Grafana exemplar or Loki log line yields trace_id
Jaeger/Tempo waterfall — Payment CLIENT span 820 ms, Inventory SERVER 15 ms
Drill Payment span — Stripe timeout after retry; circuit breaker half-open
Mitigate — extend timeout temporarily, scale payment pods, disable promotion flag

flowchart TB
  GW[Gateway 12ms] --> Ord[Order 45ms]
  Ord --> Cat[Catalog 8ms]
  Ord --> Inv[Inventory 22ms]
  Ord --> Pay[Payment 820ms]
  Ord --> Notif[Notification async]
  Pay --> Stripe[Stripe API timeout]

Critical path analysis: Catalog and Inventory run parallel—longest branch (Payment) dominates user latency. Optimize Payment first; caching Catalog does nothing if Payment p99 is 800 ms. During canary deploys, compare trace latency distributions between stable and canary versions—pairs with Service Mesh → Canary traffic split.

Common patterns in multi-service traces: retry amplification (one user request → five downstream attempts visible as repeated CLIENT spans), missing spans (service not instrumented—gap in waterfall), clock skew (child starts before parent—use relative duration not absolute timestamps), and fire-and-forget async (trace ends at gateway while Kafka consumer span appears orphaned without link).

Correlation ID pattern — one thread through the system

Support sends “order 8f2a failed”—you need every log and span for that journey, not grep by timestamp hoping clock skew cooperates.

Request ID (X-Request-Id) — human-friendly identifier generated at the edge gateway or accepted from client if UUID-shaped and validated. Propagate on every outbound HTTP header and Kafka message envelope. Appears in support tools and API responses for user-facing correlation.

Trace ID — from W3C traceparent; ties all spans and should appear in every structured log via MDC. Best practice: populate log MDC from OpenTelemetry context automatically—Java agent or Micrometer tracing bridge sets trace_id and span_id without manual filter code in every service.

sequenceDiagram
  participant GW as API Gateway
  participant Ord as Order Service
  participant Inv as Inventory
  participant K as Kafka
  GW->>Ord: traceparent plus X-Request-Id
  Ord->>Inv: forward headers
  Ord->>K: headers in record
  Note over GW,K: Same trace_id in logs and spans

@Component
public class CorrelationFilter implements WebFilter {
  @Override
  public Mono<Void> filter(ServerWebExchange ex, WebFilterChain chain) {
    String requestId = Optional.ofNullable(ex.getRequest().getHeaders().getFirst("X-Request-Id"))
        .filter(id -> id.matches("[0-9a-f-]{36}"))
        .orElseGet(() -> UUID.randomUUID().toString());
    ex.getResponse().getHeaders().add("X-Request-Id", requestId);
    return chain.filter(ex)
        .contextWrite(ctx -> ctx.put("request_id", requestId));
  }
}

Gateway should reject or replace malformed IDs—never trust client-supplied IDs for auth, only correlation. Document headers in OpenAPI per Service Design API standards.

RED and USE — what to measure

Two mnemonic frameworks prevent dashboard sprawl: RED for request-driven services, USE for resources (CPU, disk, queues, pools).

RED — for services

Every synchronous microservice exposing HTTP or gRPC should dashboard these three:

Rate — requests per second; traffic volume and capacity planning input
Errors — ratio of failed requests (5xx, timeouts, gRPC UNAVAILABLE); split by dependency when possible
Duration — latency distribution: p50, p95, p99—never alert on average alone

Spring Boot + Micrometer expose http.server.requests with method, status, uri tags—normalize uri to templated paths (/orders/{id}) or cardinality explodes. Istio sidecars export equivalent RED without code—see Service Mesh → Observability.

USE — for resources

Utilization — fraction of time busy: CPU, memory pressure, JDBC pool active connections
Saturation — work waiting: queue depth, thread pool rejections, disk IO wait, Kafka consumer lag
Errors — device/software errors: OOM kills, TCP retransmits, disk read failures

Healthy RED on Order Service while connection pool saturation climbs predicts outage in ten minutes—RED alone misses resource exhaustion. Watch bulkhead rejections and retry storms from Resilience → Tuning as leading indicators.

💡 Pro Tip

Dashboard layout per service: SLO burn strip on top, RED row middle, USE/resource row bottom. Same layout everywhere—on-call muscle memory at 3 a.m.

Prometheus — pull-based scraping and PromQL

Prometheus scrapes metrics HTTP endpoints on an interval, stores time series in TSDB, and powers PromQL alerts—the de facto standard in Kubernetes.

Pull model

Prometheus pulls from /actuator/prometheus or /metrics every 15–30s. Kubernetes: Prometheus Operator ServiceMonitor CRD selects services by label; PodMonitor for direct pod scrape. Short-lived batch jobs use Pushgateway sparingly—easy to misuse with stale metrics.

Metric types and PromQL basics

Counter — monotonic (total requests); use rate() or increase() over a window
Gauge — point-in-time (queue depth, memory)
Histogram — buckets for latency SLOs; histogram_quantile(0.99, ...)

# Request rate — R in RED
sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))

# Error ratio — E in RED
sum(rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))

# p99 latency — D in RED
histogram_quantile(0.99,
  sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (le))

Label hygiene and recording rules

Each unique label combination is a time series. uri="/orders/12345" creates millions of series—crash Prometheus. Recording rules pre-aggregate hot queries (job:http_requests:rate5m) for faster dashboards. Exemplars attach trace_id to histogram buckets—Grafana jumps from latency spike to example trace.

🚫 Anti-Pattern

Alerting on log counts in Elasticsearch—expensive, laggy, duplicates Prometheus. Use the right signal: metrics for aggregates, logs for forensics.

Micrometer and Spring Boot Actuator

Micrometer is the metrics facade in Spring Boot 3—one API, export to Prometheus, OTLP, Datadog without rewriting instrumentation.

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  metrics:
    tags:
      application: ${spring.application.name}
      environment: ${ENVIRONMENT:local}
  tracing:
    sampling:
      probability: 0.1
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces

Protect /actuator/prometheus with network policy—not public internet. Custom metrics: Counter for orders placed, Timer for payment latency, Gauge for queue depth. Low-cardinality tags only: region, payment_method—not user_id. Resilience4j circuit breaker metrics integrate automatically—alert when breaker stays OPEN.

Grafana — service dashboards and SLO dashboards

Grafana queries Prometheus, Loki, Tempo, and Jaeger in one UI—dashboards become shared language between dev, SRE, and product during incidents.

Service dashboards

Start with user journeys (Checkout, Search), not pod lists. Per service panel row: RED metrics with variables for namespace, cluster, service. Bottom row: USE—CPU, JVM heap, DB pool, Kafka lag. One dashboard template cloned per service keeps on-call consistent.

SLO dashboards

Dedicated board per critical journey: SLI gauge (current availability/latency), error budget remaining (%), burn rate over 1h/6h/24h windows. Multi-burn visualization from Google SRE workbook—fast burn pages, slow burn tickets. Link panels to runbooks: “If checkout error rate > 2% for 5m → check Stripe status, scale payment pods, RUNBOOK-042.”

Alert type	Example	Action
Symptom	Checkout p99 > 2s for 10m	Page on-call — user impact
Cause	Inventory pod restart loop	Ticket — may explain symptom
Capacity	CPU > 70% sustained 1h	Scale HPA — see Deployment

Route pages via Grafana unified alerting or Prometheus Alertmanager by severity and team. Page humans on symptom-based SLO burn—not every CPU blip. Silence windows during planned maintenance with documented annotations.

Alerting rules — SLI, SLO, SLA, and error budget

An SLO is a target users feel—99.9% of checkout requests succeed in under 2 seconds—not “three nines on CPU.” Error budgets translate reliability into shared currency.

Term	Meaning	Example
SLI	Measurable indicator of service level	Ratio of checkout HTTP 200 with latency < 2s
SLO	Target SLI over rolling window	99.5% of checkouts meet SLI over 30 days
SLA	Contract with customer penalties	99.9% monthly or credits — legal/commercial
Error budget	Allowed unreliability = 100% − SLO	0.5% budget ≈ 3.6 h bad minutes/month at 99.5%

SLI implementation in Prometheus: ratio of good events to total over window. Good events: status=~"2..", le="2" on histogram bucket or dedicated success counter. SLO target 99.5% over 30d → alert when burn rate consumes budget too fast.

# Fast burn: 14.4x budget consumption in 1h → page
# Slow burn: 6x budget consumption in 6h → ticket
# When budget exhausted → freeze risky releases

flowchart TB
  SLI[Measure SLI from metrics] --> SLO{Meet SLO?}
  SLO -->|yes| SHIP[Allow feature releases]
  SLO -->|burning| SLOW[Investigate and fix]
  SLO -->|exhausted| FREEZE[Freeze risky changes]

Tie resilience timeouts to SLO math: checkout budget 2 s with five sequential hops—no single hop gets 2 s. Aligns with Resilience → Tuning. Canary promotion gates on SLO—see Deployment.

🎯 Interview Tip

One SLI/SLO pair: “SLI = successful checkout under 2s; SLO = 99.5% over 30d; alert on 14.4× burn in 1h.” Shows implementation depth, not definitions only.

ELK Stack — Elasticsearch, Logstash, Kibana

The classic centralized logging stack: ingest, parse, index full text, search and visualize in Kibana. Powerful—and expensive at scale.

Elasticsearch stores inverted indexes of log fields—fast full-text search, aggregations, and complex filters (trace_id:abc AND level:ERROR AND service:order-service). Logstash (or Beats/Filebeat) ships logs from apps, parses grok/JSON, enriches with Kubernetes metadata, forwards to ES. Kibana provides Discover, dashboards, and alerting (prefer metric alerts in Prometheus; log alerts for security anomalies).

flowchart LR
  POD[App pods stdout] --> FB[Filebeat]
  FB --> LS[Logstash]
  LS --> ES[Elasticsearch]
  ES --> KB[Kibana]

Strengths: mature ecosystem, rich query DSL, security/compliance features (index lifecycle, frozen tiers). Weaknesses: indexing every field is costly—hot/warm/cold tier planning required; cardinality on high-volume INFO logs adds up fast. Mitigate with index templates mapping trace_id as keyword, sampling DEBUG, and ILM policies deleting indices after 7–30 days.

Elastic Agent and Fleet simplify deployment on Kubernetes—DaemonSet collects container logs, adds pod labels automatically. Alternative ingest: Fluent Bit lighter than Logstash for high-volume K8s estates.

⚖️ Trade-off

ELK excels when you need full-text search across arbitrary message content. If queries are always by label (namespace, service, trace_id), Loki is often 10× cheaper.

Loki + Grafana — logs without indexing everything

Loki indexes labels (like Prometheus indexes metric names), not full log line content—dramatically cheaper at Kubernetes scale.

Logs stream to Loki with label sets: {namespace="prod", app="order-service", pod="order-7x2k"}. LogQL queries filter by labels first, then grep-like filter on line content—{app="order-service"} |= "trace_id=4bf92f". Pair with Grafana for unified view: metrics spike → same dashboard → Loki logs → Tempo trace via derived fields.

Promtail (or Grafana Alloy) tails container logs, extracts JSON fields into labels where cardinality allows— trace_id as label enables instant log→trace pivot; avoid high-cardinality labels like user_id. Structured metadata (Loki 2.9+) stores trace_id as searchable metadata without index blow-up.

{app="order-service"} | json | trace_id="4bf92f3577b34da6a3ce929d0e0e4736" | level="ERROR"

Aspect	Elasticsearch	Loki
Index model	Full-text inverted index	Labels + compressed chunks
Cost at scale	Higher — every field indexed	Lower — object storage friendly
Best queries	Arbitrary text search	Label-filtered stream grep
Grafana integration	Via Elasticsearch datasource	Native — LGTM stack

Structured logging — JSON with trace and service fields

Plain text logs parsed with regex break when someone adds a colon. JSON logs index reliably and join to traces via shared fields.

Every log line should carry structured fields: timestamp, level, service, trace_id, span_id, request_id, plus domain IDs (order_id) when known. Use SLF4J MDC populated from OpenTelemetry context in a servlet filter or WebFlux filter—clear MDC after request to prevent thread-pool leakage.

{
  "timestamp": "2026-06-04T03:14:22.891Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "request_id": "req-8f2a-991c",
  "message": "Payment charge failed",
  "order_id": "ord-7721",
  "error.type": "StripeTimeoutException"
}

<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
  <encoder class="net.logstash.logback.encoder.LogstashEncoder">
    <includeMdcKeyName>trace_id</includeMdcKeyName>
    <includeMdcKeyName>span_id</includeMdcKeyName>
    <includeMdcKeyName>request_id</includeMdcKeyName>
    <customFields>{"service":"${spring.application.name}"}</customFields>
  </encoder>
</appender>

Log levels: ERROR action-required, WARN recoverable anomaly, INFO business milestones, DEBUG off in prod (enable per-request via feature flag during incidents). Avoid full payloads—truncate and hash identifiers. OpenTelemetry Logs bridge (experimental) correlates logs to spans natively.

⚠️ Pitfall

Logging inside tight loops or on every health check floods the pipeline and hides real errors. Sample debug or use metrics counters instead.

Log aggregation patterns in Kubernetes

Containers write to stdout/stderr; the platform must collect, label, and ship without app changes. Choose node-level, sidecar, or logging agent patterns deliberately.

Node-level collection (recommended default)

DaemonSet agent (Promtail, Fluent Bit, Filebeat) on every node tails /var/log/containers/*.log, enriches with Kubernetes API metadata (namespace, pod, container, labels), ships to Loki or Elasticsearch. Zero app change—apps log JSON to stdout. Lowest overhead for most microservices.

Sidecar pattern

Second container in pod tails shared volume or stdout relay—useful when app writes to file instead of stdout, or when you need local parsing before ship. Cost: extra container memory/CPU per pod—avoid fleet-wide unless required.

Direct export from app

App sends logs via OTLP or HTTP to Collector/Loki—useful for serverless or when DaemonSet access is restricted. Requires app library config; ensure backoff on collector outage so logging does not block requests.

Pattern	Pros	Cons
DaemonSet	No app change, uniform labels	Node-level RBAC, shared fate on node
Sidecar	File-based apps, pod-local filter	Resource multiplier per pod
OTLP from app	Unified OTel pipeline	App must handle backpressure

Standardize labels across cluster: app.kubernetes.io/name, environment, cluster name. Exclude kube-system noise at ingest. Rotate and cap log volume per namespace with quotas in multi-tenant clusters.

Tracing async flows — Kafka, outbox, broken traces

HTTP propagation is straightforward; message queues break traces unless you inject W3C context into record headers and start consumer spans explicitly.

When Order Service publishes OrderPlaced via transactional outbox to Kafka, embed traceparent in message headers (Spring Kafka + Micrometer tracing configures this). Inventory consumer starts CONSUMER span linked to producer span—trace UI shows async continuation, not orphan spans.

Without propagation, on-call sees disconnected spans and assumes Inventory is fast while missing 30s consumer lag. Complement traces with consumer lag metrics and DLQ depth alerts— see Communication → Kafka/outbox.

⚠️ Pitfall

Batch consumers processing 500 records in one poll create one giant span—use child spans per message or attribute batch size on parent span.

Mesh telemetry vs application instrumentation

Istio sees L4/L7 bytes; apps see business operations—duplicate spans if both create SERVER spans for the same request unless coordinated.

Mesh-only: uniform RED across polyglot services, mTLS verified traffic, no code deploy for basic metrics. App-only: business spans, custom metrics, works without sidecars on VMs. Hybrid (recommended): mesh for network metrics and mTLS audit; OTel in app for domain spans and log correlation; disable duplicate HTTP server spans in one layer via telemetry config.

Full comparison: Service Mesh → Observability. Kiali answers topology; Grafana answers SLOs; Jaeger/Tempo answers latency—same incident, three lenses.

⚖️ Trade-off

100% trace sampling bankrupts storage—tail sampling plus aggressive head sampling in prod; full sampling only in dev/staging load tests.

Production observability checklist

Gate new services on telemetry completeness before production traffic—not as a post-launch cleanup ticket.

Structured JSON logs with trace_id, span_id, request_id, service name—MDC cleared per request
RED metrics via Micrometer/Prometheus; actuator scrape endpoint not public
OpenTelemetry with W3C propagation on HTTP, gRPC, and Kafka; B3 translation at legacy boundaries
Head + tail sampling configured in Collector; staging at 100% for regression
Trace backend (Jaeger/Tempo) with retention policy aligned to incident needs (7–14 days typical)
Grafana service dashboards + SLO dashboard with multi-window burn alerts and runbook links
Centralized logging: Loki or ELK with DaemonSet collection and Kubernetes metadata labels
Consumer lag and DLQ alerts for async paths
Label cardinality reviewed—no unbounded uri or user_id metric tags
Health vs readiness probes distinct; probe traffic not logged at INFO
Exemplars enabled linking histograms to trace IDs where supported
Postmortem template; on-call rotation with escalation path