Observability: Metrics, Logs & Tracing
A pod in CrashLoopBackOff tells you something failed—not why. Production Kubernetes observability stitches three signals: metrics (what is trending), logs (what happened in sequence), and traces (which hop added latency). This chapter covers the CNCF stack on vanilla K8s and the built-in OpenShift monitoring, logging, and tracing operators.
Metrics Stack
Kubernetes exposes two classes of metrics: resource usage (CPU/memory per pod and node via cAdvisor/kubelet) and cluster state (how many pods are ready, PVCs bound, nodes NotReady). The metrics stack layers exporters, a time-series database, dashboards, and alerting on top of those signals.
flowchart LR
subgraph apps["Workloads"]
APP["App /metrics\nRED + USE"]
end
subgraph k8s["Kubernetes metrics"]
MS["metrics-server\nkubectl top"]
KSM["kube-state-metrics\nobject state"]
NE["node-exporter\nhost metrics"]
end
subgraph prom["Prometheus ecosystem"]
PO["Prometheus Operator"]
SM["ServiceMonitor\nPodMonitor"]
PR["Prometheus TSDB"]
AM["AlertManager"]
GF["Grafana"]
end
APP --> SM
KSM --> SM
NE --> SM
MS -.->|"HPA/VPA only"| HPA["HPA / VPA"]
PO --> SM --> PR
PR --> GF
PR --> AM
metrics-server
Aggregates resource metrics from kubelets and exposes the metrics.k8s.io API group. Powers kubectl top pods/nodes and the Horizontal Pod Autoscaler (CPU/memory utilization). It is not a long-term store—metrics are ephemeral, scraped on demand.
- Deployed as a Deployment in kube-system (or operator-managed on managed clouds)
- Requires kubelet read access; uses aggregated API server proxy
- Without it: HPA shows <unknown> targets; kubectl top fails
kube-state-metrics
Watches API objects and emits Prometheus metrics about desired vs actual state—not container CPU. Essential for cluster health dashboards and alerts on scheduling failures, crash loops, and PVC issues.
- kube_pod_status_phase — pod phase (Pending, Running, Failed, …)
- kube_deployment_status_replicas_unavailable — rollout problems
- kube_node_status_condition — NotReady, MemoryPressure, DiskPressure
- kube_persistentvolumeclaim_status_phase — PVC stuck Pending
Prometheus Operator
A Kubernetes operator that manages Prometheus, Alertmanager, and related CRDs. Instead of hand-editing prometheus.yml, you declare scrape targets via ServiceMonitor and PodMonitor resources. The operator generates config, handles TLS, and reloads Prometheus on CRD changes.
Common install: kube-prometheus-stack Helm chart (Prometheus Community)—ships Prometheus, Alertmanager, Grafana, node-exporter, and kube-state-metrics in one release.
ServiceMonitor
Selects Services by label and defines scrape endpoints (port name, path, interval, TLS). Prometheus instances select which ServiceMonitors to honor via their own serviceMonitorSelector—typically matched by a release: monitoring label on both sides.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: payments-api
namespace: team-payments
labels:
release: kube-prometheus-stack # must match Prometheus serviceMonitorSelector
spec:
namespaceSelector:
matchNames:
- team-payments
selector:
matchLabels:
app: payments-api
endpoints:
- port: http-metrics # named port on the Service
path: /actuator/prometheus
interval: 30s
scrapeTimeout: 10s
honorLabels: true
- port: http-metrics
path: /metrics
interval: 30s
metricRelabelings:
- sourceLabels: [__name__]
regex: "go_.*"
action: drop # drop noisy Go runtime metrics
Critical metrics to alert on
| Scope | Metric / signal | Why it matters |
|---|---|---|
| Pod | kube_pod_container_status_restarts_total | CrashLoopBackOff precursor—alert on restart rate spike |
| kube_pod_status_phase{phase="Pending"} | Scheduling or image pull failures | |
| container_cpu_usage_seconds_total / requests | CPU throttling, HPA denominator issues | |
| container_memory_working_set_bytes | OOMKill risk before kubelet evicts | |
| Node | kube_node_status_condition{condition="Ready",status="true"} | Node loss—workloads reschedule abruptly |
| node_memory_MemAvailable_bytes | Node pressure → eviction of best-effort pods | |
| node_filesystem_avail_bytes | Disk full → kubelet/image pull failures, etcd risk on control plane | |
| Cluster | apiserver_request_total{code=~"5.."} | Control plane errors—everything downstream breaks |
| etcd_server_has_leader | etcd quorum loss = cluster brain death | |
| kube_deployment_status_replicas_unavailable | Rollout stuck—users see partial outage | |
| cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests | Capacity planning—recording rules pre-aggregate for dashboards |
Grafana
Visualization layer on top of Prometheus (and Loki, Tempo via data sources). Dashboard-as-code with grafonnet or JSON in git. Import community dashboards: 315 (Kubernetes cluster), 6417 (kube-state-metrics), 12006 (USE method). Tie every panel to a runbook link annotation.
AlertManager
Receives firing alerts from Prometheus, deduplicates, groups, silences, and routes to PagerDuty, Slack, email. Define PrometheusRule CRDs for alerts; AlertManager config via AlertmanagerConfig (namespaced) or secret on vanilla installs.
- inhibition — suppress node-disk alerts when node-not-ready already firing
- group_by — one Slack message per deployment, not per pod
- severity labels — critical pages; warning tickets
$ kubectl get apiservice v1beta1.metrics.k8s.io -o wide → AVAILABLE column must be True for metrics-server $ kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus $ kubectl get servicemonitor -A $ kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 → open http://localhost:9090/targets — State should be UP$ oc get clusteroperator monitoring $ oc get servicemonitor -n openshift-user-workload-monitoring $ oc adm top pods -A --sort-by=cpu | head -20
metrics-server scrapes kubelet /metrics/resource (cAdvisor subset) and aggregates in memory. Prometheus scrapes independently on its own interval—HPA does not read Prometheus; it calls metrics-server via the aggregation API. Two different pipelines for two different purposes.
High-cardinality labels (user IDs, unbounded URL paths) explode Prometheus TSDB size and query latency. Bound labels at instrumentation time—use route="/users/:id" templates, not raw paths. Drop expensive labels in metricRelabelings if legacy apps cannot be fixed quickly.
Instrument apps with Prometheus client libraries: counters for requests/errors, histograms for latency (not summaries—histograms aggregate in PromQL). Expose on a dedicated port named in the Service (http-metrics) so ServiceMonitor can target it without scraping the main app port.
"How do you monitor Kubernetes?" — Cover the three layers: kube-state-metrics for object health, cAdvisor/node-exporter for resource USE, app RED metrics via ServiceMonitor. Mention AlertManager routing and SLO-based alerts (error budget burn) over naive threshold paging.
OpenShift Monitoring Stack
OpenShift ships a managed Prometheus stack operated by the Cluster Monitoring Operator (CMO). Platform metrics live in openshift-monitoring; tenant workloads use user workload monitoring in a separate namespace. Do not fight the platform—extend it.
Built-in Prometheus (openshift-monitoring)
CMO deploys Prometheus, Alertmanager, Thanos querier, kube-state-metrics, node-exporter, and cluster monitoring RBAC. Scrapes platform components: API server, etcd, operators, CNI, registry. Dashboards appear in the OpenShift Console Observe → Metrics view without extra Grafana install.
- Prometheus instances: prometheus-k8s, prometheus-k8s-0 in openshift-monitoring
- Alerting rules managed by CMO—platform SRE ownership
- Long-term storage via Thanos (optional external object store configuration)
Never install your own Prometheus in openshift-monitoring. That namespace is platform-managed. Competing scrape configs, duplicate operators, and manual edits are overwritten on upgrade. Use user workload monitoring for app metrics, or a dedicated monitoring namespace on vanilla patterns.
User workload monitoring
Enabled via cluster config (enableUserWorkload: true on the cluster-monitoring-config ConfigMap). Deploys a second Prometheus stack in openshift-user-workload-monitoring that scrapes ServiceMonitors in user namespaces.
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
enableUserWorkload: true
User ServiceMonitors need label openshift.io/user-monitoring: "true" (or deploy in namespaces with the monitoring label). RBAC: project admins can create ServiceMonitors/PrometheusRules in their namespaces.
Console metrics
The OpenShift Console queries Thanos querier for both platform and user metrics. Developers see pod CPU/memory graphs on the Workloads → Pods detail page. Platform admins use Observe → Dashboards for etcd, API latency, and operator health.
flowchart TB
CMO["Cluster Monitoring Operator"]
subgraph platform["openshift-monitoring"]
PP["Prometheus\nplatform scrape"]
PA["Alertmanager"]
TQ["Thanos Querier"]
end
subgraph user["openshift-user-workload-monitoring"]
UP["Prometheus\nuser ServiceMonitors"]
UA["Alertmanager"]
end
CON["OpenShift Console\nObserve tab"] --> TQ
CMO --> platform
CMO --> user
UP --> TQ
PP --> TQ
$ # vanilla K8s — no openshift-monitoring namespace $ kubectl get pods -n monitoring$ oc get co monitoring $ oc get prometheus -n openshift-monitoring $ oc get prometheus -n openshift-user-workload-monitoring $ oc get cm cluster-monitoring-config -n openshift-monitoring -o yaml $ oc get prometheusrule -n team-payments $ oc rsh -n openshift-monitoring prometheus-k8s-0
Platform Prometheus vs bring-your-own: User workload monitoring covers 80% of app teams without operating Prometheus. BYO (kube-prometheus-stack in a tenant namespace) makes sense when you need custom retention, federation to central TSDB, or multi-cluster Grafana—at the cost of another stack to patch.
Enterprise OCP teams route user-workload Alertmanager receivers to their own PagerDuty service—configured via user-workload-monitoring-config ConfigMap. Platform alerts stay with the cluster admin team; application SLO breaches go to product on-call.
Before upgrading OCP, check oc get co monitoring is Healthy. CMO upgrades Prometheus versions atomically—custom edits to platform Prometheus CRs are unsupported and lost on reconcile.
Logging Stack
Container stdout/stderr is captured by the kubelet and written to files under /var/log/pods/ on each node. A log collector DaemonSet tails those files, enriches with Kubernetes metadata, and forwards to a central store. Choose your backend based on query patterns—not hype.
flowchart LR
APP["Container\nstdout/stderr"]
KL["kubelet"]
LOGS["/var/log/pods/\n/var/log/containers/"]
COL["Collector DaemonSet\nFluent Bit / Vector"]
STORE["Backend"]
UI["Query UI"]
APP --> KL --> LOGS --> COL --> STORE --> UI
subgraph backends["Backend choices"]
EFK["EFK\nElasticsearch"]
LOKI["Loki\nlabel-indexed"]
end
STORE --- backends
Collectors: Fluentd, Fluent Bit, Vector
| Agent | Characteristics | Typical role |
|---|---|---|
| Fluentd | Ruby/C, rich plugin ecosystem, higher memory footprint | Legacy EFK stacks; aggregation tier behind Fluent Bit |
| Fluent Bit | C, lightweight, CNCF graduated; Kubernetes filter built-in | Default node agent—Elastic, Loki, S3, Kafka outputs |
| Vector | Rust, VRL transform language, high throughput | Greenfield pipelines; replaces Fluent Bit when teams want programmable transforms |
EFK vs Loki
| Dimension | EFK (Elasticsearch + Fluent + Kibana) | Loki (+ Grafana) |
|---|---|---|
| Index model | Full-text inverted index on log content | Index labels only (namespace, pod, app); chunk storage cheap |
| Query style | Free-text search, complex aggregations | LogQL—filter by labels, parse JSON at query time |
| Cost at scale | Higher—RAM-heavy JVM, shard management | Lower—object storage (S3) for chunks |
| Best fit | Security analytics, full-text grep across unstructured logs | Kubernetes-native ops—correlate logs + metrics in Grafana |
Log sources on the node
- /var/log/pods/<ns>_<pod>_<uid>/<container>/<n>.log — CRI JSON log format (one JSON object per line)
- /var/log/containers/ — symlinks to pod logs; what most collectors tail
- /var/log/journal — systemd/kubelet logs (optional host journal input)
- /var/log/openshift/ — OCP node and audit logs (platform-specific)
Structured JSON logging
Apps should emit one JSON object per line to stdout with stable field names (level, msg, trace_id, request_id). Collectors parse JSON in the pipeline (Fluent Bit parser filter, Vector VRL) so Loki/Elasticsearch can filter without regex hell. Correlate with traces by injecting W3C traceparent into every log line.
{
"timestamp": "2026-06-05T14:32:01.123Z",
"level": "ERROR",
"msg": "payment declined",
"service": "payments-api",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"order_id": "ord-88421",
"http.status": 402,
"duration_ms": 127
}
OpenShift Logging Operator & ClusterLogForwarder
OpenShift Cluster Logging is managed by the Logging Operator (replacing legacy cluster-logging Ansible ops). ClusterLogForwarder (CLF) CR defines inputs (application, infrastructure, audit) and outputs (Elasticsearch, Loki, Kafka, CloudWatch, Splunk, syslog).
apiVersion: observability.openshift.io/v1
kind: ClusterLogForwarder
metadata:
name: instance
namespace: openshift-logging
spec:
managementState: Managed
outputs:
- name: loki-out
type: loki
url: https://loki-gateway.observability.svc:3100
loki:
tenantKey: kubernetes
pipelines:
- name: app-to-loki
inputRefs:
- application
outputRefs:
- loki-out
$ kubectl logs -n logging -l app.kubernetes.io/name=fluent-bit --tail=50 $ kubectl get pods -n team-payments -l app=payments-api -o name | head -1 | xargs -I{} kubectl logs {} -c payments-api --tail=100 $ kubectl debug node/worker-1 -it --image=busybox -- chroot /host tail -5 /var/log/pods/team-payments_payments-api-*/payments-api/*.log$ oc get clusterlogforwarder -n openshift-logging $ oc get co logging $ oc logs -n openshift-logging -l component=collector --tail=30 $ oc adm node-logs worker-1 --log-type=kubelet | tail -20
Logging DEBUG in production without sampling floods collectors and storage—one chatty pod can saturate Fluent Bit buffers and drop logs cluster-wide. Set log level via env/ConfigMap; use dynamic debug endpoints for incident investigation only.
Sidecar vs DaemonSet collector: Sidecars (Fluent Bit per pod) isolate noisy neighbors but multiply memory overhead. DaemonSet node agents are the standard—one collector per node tails all pod logs with Kubernetes metadata filter.
Never log secrets, tokens, or PII in plaintext. Redact at source or in collector transforms. OpenShift audit logs (API access) flow through CLF separately from application logs—retain per compliance policy, restrict access in the log store RBAC.
In Grafana, link Loki log panels to Prometheus metrics and Tempo traces with derived fields on trace_id—one click from error log line to flame graph.
Distributed Tracing
Metrics show a spike; logs show an error message. Traces show the cross-service path— which hop added 800ms, where retries happened, whether the DB or the cache failed first. OpenTelemetry (OTel) is the vendor-neutral standard; backends store and query span data.
flowchart LR
APP["Instrumented app\nOTel SDK"]
AUTO["Auto-instrumentation\noperator injection"]
COL["OTel Collector\nDaemonSet / sidecar"]
BACK["Trace backend"]
UI["Trace UI\nGrafana / Jaeger"]
APP --> COL
AUTO --> APP
COL --> BACK --> UI
subgraph stores["Backends"]
TEMPO["Grafana Tempo"]
JAEGER["Jaeger"]
end
BACK --- stores
OpenTelemetry Operator
Kubernetes operator that manages OpenTelemetry Collector deployments and auto-instrumentation. Annotate a namespace or pod to inject an init container that adds the OTel Java/Node/Python/.NET agent without rebuilding images.
apiVersion: v1
kind: Namespace
metadata:
name: team-payments
annotations:
instrumentation.opentelemetry.io/inject-java: "true"
---
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: java-instrumentation
namespace: team-payments
spec:
exporter:
endpoint: http://otel-collector.observability.svc:4317
propagators:
- tracecontext
- baggage
sampler:
type: parentbased_traceidratio
argument: "0.1" # 10% head sampling in dev; tune per env
Auto-instrumentation
Language agents intercept HTTP/gRPC, database drivers, and messaging clients—emitting spans with minimal code changes. Trade-offs: black-box spans (less business context), agent CPU overhead, version compatibility with your runtime. For critical paths, add manual spans around business operations (processPayment).
Tempo
Grafana Tempo is a trace backend optimized for object storage (S3/GCS)—no heavy indexing like Jaeger Elasticsearch. Query via TraceQL in Grafana. Pairs naturally with Loki (logs) and Prometheus (metrics) in the Grafana stack. Deploy with the tempo-distributed Helm chart or Grafana Agent/Alloy pipeline.
Jaeger
CNCF graduated tracing system—collector, query UI, agent (legacy sidecar). Storage options: memory (dev), Cassandra, Elasticsearch, Badger. Still widely deployed; many teams migrate to Tempo for lower ops cost. Jaeger v2 converges on OpenTelemetry Collector internals.
OpenShift Distributed Tracing
Red Hat OpenShift Distributed Tracing Platform (based on Jaeger/Tempo operators via OpenTelemetry) integrates with the console and Service Mesh. Install via OperatorHub; OpenTelemetryCollector CR receives OTLP from apps; Jaeger or Tempo instance stores traces. Istio/Service Mesh generates spans automatically for mesh traffic.
$ kubectl get opentelemetrycollector -A $ kubectl get instrumentation -A $ kubectl port-forward -n observability svc/jaeger-query 16686:16686 → open http://localhost:16686 — search by service name $ kubectl logs deploy/payments-api -c opentelemetry-auto-instrumentation-java$ oc get jaeger -n openshift-distributed-tracing $ oc get opentelemetrycollector -n openshift-tempo $ oc get route -n openshift-distributed-tracing
W3C Trace Context (traceparent header) propagates trace IDs across services. The OTel Collector can batch, sample tail-based (keep errors/slow traces), and fan-out to multiple backends. Head sampling at 1% means 99% of traces are discarded at birth—acceptable for high-QPS if tail sampling catches anomalies.
"How do you debug latency in microservices on K8s?" — Metrics for SLI breach → TraceQL/Jaeger for slow span → logs filtered by trace_id. Mention sampling strategy (head vs tail) and avoiding trace cardinality explosion from unbounded span attributes.
Payment platforms enable 100% trace sampling on checkout path only via OTel Collector tail sampling policy— drop health-check spans, keep any trace where http.status_code >= 500 or duration > 2s.
kubectl / oc Observability Commands
Before opening Grafana, the API server already exposes rich live signals. These commands are the first line of incident response—resource pressure, scheduling events, container crashes, and ad-hoc debugging without SSH to nodes.
Resource usage: top
Requires metrics-server (vanilla) or cluster metrics (OCP). Shows current CPU/memory—not historical. Sort to find noisy neighbors during node pressure incidents.
$ kubectl top nodes $ kubectl top pods -A --sort-by=memory | head -15 $ kubectl top pod payments-api-7d4f8b-abc12 -n team-payments --containers$ oc adm top nodes $ oc adm top pods -n team-payments --sort-by=cpu $ oc adm top pod payments-api-7d4f8b-abc12 --containers
Events: cluster audit trail
Kubernetes Events are short-lived (default 1h retention)—capture FailedScheduling, BackOff, Unhealthy, Evicted messages. Always check events when a pod is stuck Pending or CrashLooping.
$ kubectl get events -n team-payments --sort-by='.lastTimestamp' $ kubectl get events -A --field-selector type=Warning | tail -20 $ kubectl get events --for pod/payments-api-7d4f8b-abc12 -n team-payments$ oc get events -n team-payments --sort-by='.lastTimestamp' $ oc get events -A --field-selector type=Warning | tail -20
describe: spec + status + events
Combines object YAML highlights, conditions, and recent events in one view—the fastest way to understand why a Deployment isn't progressing or a PVC is Pending.
$ kubectl describe pod payments-api-7d4f8b-abc12 -n team-payments $ kubectl describe node worker-2 | grep -A5 Conditions $ kubectl describe pvc data-payments-api-0 -n team-payments$ oc describe pod payments-api-7d4f8b-abc12 -n team-payments $ oc describe node worker-2 | grep -A5 Conditions
Logs: current and previous container
kubectl logs tails the current container instance. --previous fetches logs from the crashed container before restart—essential for OOMKill and panic stack traces. Multi-container pods require -c <container>.
$ kubectl logs -f deploy/payments-api -n team-payments --all-containers=true $ kubectl logs payments-api-7d4f8b-abc12 -n team-payments -c payments-api --previous → last crash output before restart $ kubectl logs payments-api-7d4f8b-abc12 -n team-payments --since=1h --timestamps$ oc logs -f deploy/payments-api -n team-payments --all-containers=true $ oc logs payments-api-7d4f8b-abc12 -n team-payments -c payments-api --previous $ oc logs payments-api-7d4f8b-abc12 -n team-payments --since=1h --timestamps
exec, port-forward, debug
Interactive debugging without exposing services publicly. kubectl debug (K8s 1.23+) creates ephemeral debug containers or node shells—preferred over SSH on managed clouds.
$ kubectl exec -it payments-api-7d4f8b-abc12 -n team-payments -c payments-api -- /bin/sh $ kubectl port-forward -n team-payments svc/payments-api 8080:8080 → curl localhost:8080/actuator/health $ kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 $ kubectl debug -it payments-api-7d4f8b-abc12 -n team-payments --image=nicolaka/netshoot --target=payments-api $ kubectl debug node/worker-2 -it --image=busybox -- chroot /host crictl ps$ oc exec -it payments-api-7d4f8b-abc12 -n team-payments -c payments-api -- /bin/sh $ oc port-forward -n team-payments svc/payments-api 8080:8080 $ oc port-forward -n openshift-monitoring svc/prometheus-k8s 9091:9091 $ oc debug -it payments-api-7d4f8b-abc12 -n team-payments --image=registry.redhat.io/ubi9/ubi-minimal --target=payments-api $ oc adm node-logs worker-2 --log-type=kubelet | tail -30
Command quick reference
| Goal | kubectl | oc equivalent / notes |
|---|---|---|
| Pod CPU/memory now | kubectl top pods | oc adm top pods |
| Node resource pressure | kubectl top nodes | oc adm top nodes |
| Scheduling / pull errors | kubectl get events | oc get events |
| Crash before restart | kubectl logs --previous | oc logs --previous |
| Local access to metrics UI | kubectl port-forward svc/… | oc port-forward (same syntax) |
| Node-level logs (OCP) | kubectl debug node/… | oc adm node-logs |
Incident triage order: get pods → describe → logs --previous → get events → top. Only then escalate to Prometheus/Grafana for historical context. Saves minutes when the answer is ImagePullBackOff in Events.
kubectl logs --previous fails if the pod never ran successfully (only one container instance exists). Use describe for init container failures and kubectl logs -c init-container-name before the main container starts.
Enable stern or kubectl logs -l app=… --prefix for multi-pod tailing during rollouts. On OCP, the Console Logs tab aggregates pod logs—useful for devs without CLI access.