Observability: Metrics, Logs & Tracing

Metrics Stack

Kubernetes exposes two classes of metrics: resource usage (CPU/memory per pod and node via cAdvisor/kubelet) and cluster state (how many pods are ready, PVCs bound, nodes NotReady). The metrics stack layers exporters, a time-series database, dashboards, and alerting on top of those signals.

flowchart LR
  subgraph apps["Workloads"]
    APP["App /metrics\nRED + USE"]
  end
  subgraph k8s["Kubernetes metrics"]
    MS["metrics-server\nkubectl top"]
    KSM["kube-state-metrics\nobject state"]
    NE["node-exporter\nhost metrics"]
  end
  subgraph prom["Prometheus ecosystem"]
    PO["Prometheus Operator"]
    SM["ServiceMonitor\nPodMonitor"]
    PR["Prometheus TSDB"]
    AM["AlertManager"]
    GF["Grafana"]
  end
  APP --> SM
  KSM --> SM
  NE --> SM
  MS -.->|"HPA/VPA only"| HPA["HPA / VPA"]
  PO --> SM --> PR
  PR --> GF
  PR --> AM

metrics-server

Aggregates resource metrics from kubelets and exposes the metrics.k8s.io API group. Powers kubectl top pods/nodes and the Horizontal Pod Autoscaler (CPU/memory utilization). It is not a long-term store—metrics are ephemeral, scraped on demand.

Deployed as a Deployment in kube-system (or operator-managed on managed clouds)
Requires kubelet read access; uses aggregated API server proxy
Without it: HPA shows <unknown> targets; kubectl top fails

kube-state-metrics

Watches API objects and emits Prometheus metrics about desired vs actual state—not container CPU. Essential for cluster health dashboards and alerts on scheduling failures, crash loops, and PVC issues.

kube_pod_status_phase — pod phase (Pending, Running, Failed, …)
kube_deployment_status_replicas_unavailable — rollout problems
kube_node_status_condition — NotReady, MemoryPressure, DiskPressure
kube_persistentvolumeclaim_status_phase — PVC stuck Pending

Prometheus Operator

A Kubernetes operator that manages Prometheus, Alertmanager, and related CRDs. Instead of hand-editing prometheus.yml, you declare scrape targets via ServiceMonitor and PodMonitor resources. The operator generates config, handles TLS, and reloads Prometheus on CRD changes.

Common install: kube-prometheus-stack Helm chart (Prometheus Community)—ships Prometheus, Alertmanager, Grafana, node-exporter, and kube-state-metrics in one release.

ServiceMonitor

Selects Services by label and defines scrape endpoints (port name, path, interval, TLS). Prometheus instances select which ServiceMonitors to honor via their own serviceMonitorSelector—typically matched by a release: monitoring label on both sides.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: payments-api
  namespace: team-payments
  labels:
    release: kube-prometheus-stack   # must match Prometheus serviceMonitorSelector
spec:
  namespaceSelector:
    matchNames:
      - team-payments
  selector:
    matchLabels:
      app: payments-api
  endpoints:
    - port: http-metrics              # named port on the Service
      path: /actuator/prometheus
      interval: 30s
      scrapeTimeout: 10s
      honorLabels: true
    - port: http-metrics
      path: /metrics
      interval: 30s
      metricRelabelings:
        - sourceLabels: [__name__]
          regex: "go_.*"
          action: drop                   # drop noisy Go runtime metrics

Critical metrics to alert on

Scope	Metric / signal	Why it matters
Pod	kube_pod_container_status_restarts_total	CrashLoopBackOff precursor—alert on restart rate spike
	kube_pod_status_phase{phase="Pending"}	Scheduling or image pull failures
	container_cpu_usage_seconds_total / requests	CPU throttling, HPA denominator issues
	container_memory_working_set_bytes	OOMKill risk before kubelet evicts
Node	kube_node_status_condition{condition="Ready",status="true"}	Node loss—workloads reschedule abruptly
	node_memory_MemAvailable_bytes	Node pressure → eviction of best-effort pods
	node_filesystem_avail_bytes	Disk full → kubelet/image pull failures, etcd risk on control plane
Cluster	apiserver_request_total{code=~"5.."}	Control plane errors—everything downstream breaks
	etcd_server_has_leader	etcd quorum loss = cluster brain death
	kube_deployment_status_replicas_unavailable	Rollout stuck—users see partial outage
	cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests	Capacity planning—recording rules pre-aggregate for dashboards

Grafana

Visualization layer on top of Prometheus (and Loki, Tempo via data sources). Dashboard-as-code with grafonnet or JSON in git. Import community dashboards: 315 (Kubernetes cluster), 6417 (kube-state-metrics), 12006 (USE method). Tie every panel to a runbook link annotation.

AlertManager

Receives firing alerts from Prometheus, deduplicates, groups, silences, and routes to PagerDuty, Slack, email. Define PrometheusRule CRDs for alerts; AlertManager config via AlertmanagerConfig (namespaced) or secret on vanilla installs.

inhibition — suppress node-disk alerts when node-not-ready already firing
group_by — one Slack message per deployment, not per pod
severity labels — critical pages; warning tickets

$ kubectl get apiservice v1beta1.metrics.k8s.io -o wide
→ AVAILABLE column must be True for metrics-server
$ kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus
$ kubectl get servicemonitor -A
$ kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
→ open http://localhost:9090/targets — State should be UP$ oc get clusteroperator monitoring
$ oc get servicemonitor -n openshift-user-workload-monitoring
$ oc adm top pods -A --sort-by=cpu | head -20

🔬 Under the Hood

metrics-server scrapes kubelet /metrics/resource (cAdvisor subset) and aggregates in memory. Prometheus scrapes independently on its own interval—HPA does not read Prometheus; it calls metrics-server via the aggregation API. Two different pipelines for two different purposes.

⚠️ Pitfall

High-cardinality labels (user IDs, unbounded URL paths) explode Prometheus TSDB size and query latency. Bound labels at instrumentation time—use route="/users/:id" templates, not raw paths. Drop expensive labels in metricRelabelings if legacy apps cannot be fixed quickly.

⚙️ Config

Instrument apps with Prometheus client libraries: counters for requests/errors, histograms for latency (not summaries—histograms aggregate in PromQL). Expose on a dedicated port named in the Service (http-metrics) so ServiceMonitor can target it without scraping the main app port.

🎯 Interview Tip

"How do you monitor Kubernetes?" — Cover the three layers: kube-state-metrics for object health, cAdvisor/node-exporter for resource USE, app RED metrics via ServiceMonitor. Mention AlertManager routing and SLO-based alerts (error budget burn) over naive threshold paging.

OpenShift Monitoring Stack

OpenShift ships a managed Prometheus stack operated by the Cluster Monitoring Operator (CMO). Platform metrics live in openshift-monitoring; tenant workloads use user workload monitoring in a separate namespace. Do not fight the platform—extend it.

Built-in Prometheus (openshift-monitoring)

CMO deploys Prometheus, Alertmanager, Thanos querier, kube-state-metrics, node-exporter, and cluster monitoring RBAC. Scrapes platform components: API server, etcd, operators, CNI, registry. Dashboards appear in the OpenShift Console Observe → Metrics view without extra Grafana install.

Prometheus instances: prometheus-k8s, prometheus-k8s-0 in openshift-monitoring
Alerting rules managed by CMO—platform SRE ownership
Long-term storage via Thanos (optional external object store configuration)

🔴 OpenShift

Never install your own Prometheus in openshift-monitoring. That namespace is platform-managed. Competing scrape configs, duplicate operators, and manual edits are overwritten on upgrade. Use user workload monitoring for app metrics, or a dedicated monitoring namespace on vanilla patterns.

User workload monitoring

Enabled via cluster config (enableUserWorkload: true on the cluster-monitoring-config ConfigMap). Deploys a second Prometheus stack in openshift-user-workload-monitoring that scrapes ServiceMonitors in user namespaces.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true

User ServiceMonitors need label openshift.io/user-monitoring: "true" (or deploy in namespaces with the monitoring label). RBAC: project admins can create ServiceMonitors/PrometheusRules in their namespaces.

Console metrics

The OpenShift Console queries Thanos querier for both platform and user metrics. Developers see pod CPU/memory graphs on the Workloads → Pods detail page. Platform admins use Observe → Dashboards for etcd, API latency, and operator health.

flowchart TB
  CMO["Cluster Monitoring Operator"]
  subgraph platform["openshift-monitoring"]
    PP["Prometheus\nplatform scrape"]
    PA["Alertmanager"]
    TQ["Thanos Querier"]
  end
  subgraph user["openshift-user-workload-monitoring"]
    UP["Prometheus\nuser ServiceMonitors"]
    UA["Alertmanager"]
  end
  CON["OpenShift Console\nObserve tab"] --> TQ
  CMO --> platform
  CMO --> user
  UP --> TQ
  PP --> TQ

$ # vanilla K8s — no openshift-monitoring namespace
$ kubectl get pods -n monitoring$ oc get co monitoring
$ oc get prometheus -n openshift-monitoring
$ oc get prometheus -n openshift-user-workload-monitoring
$ oc get cm cluster-monitoring-config -n openshift-monitoring -o yaml
$ oc get prometheusrule -n team-payments
$ oc rsh -n openshift-monitoring prometheus-k8s-0

⚖️ Trade-off

Platform Prometheus vs bring-your-own: User workload monitoring covers 80% of app teams without operating Prometheus. BYO (kube-prometheus-stack in a tenant namespace) makes sense when you need custom retention, federation to central TSDB, or multi-cluster Grafana—at the cost of another stack to patch.

📦 Real World

Enterprise OCP teams route user-workload Alertmanager receivers to their own PagerDuty service—configured via user-workload-monitoring-config ConfigMap. Platform alerts stay with the cluster admin team; application SLO breaches go to product on-call.

💡 Pro Tip

Before upgrading OCP, check oc get co monitoring is Healthy. CMO upgrades Prometheus versions atomically—custom edits to platform Prometheus CRs are unsupported and lost on reconcile.

Logging Stack

Container stdout/stderr is captured by the kubelet and written to files under /var/log/pods/ on each node. A log collector DaemonSet tails those files, enriches with Kubernetes metadata, and forwards to a central store. Choose your backend based on query patterns—not hype.

flowchart LR
  APP["Container\nstdout/stderr"]
  KL["kubelet"]
  LOGS["/var/log/pods/\n/var/log/containers/"]
  COL["Collector DaemonSet\nFluent Bit / Vector"]
  STORE["Backend"]
  UI["Query UI"]
  APP --> KL --> LOGS --> COL --> STORE --> UI
  subgraph backends["Backend choices"]
    EFK["EFK\nElasticsearch"]
    LOKI["Loki\nlabel-indexed"]
  end
  STORE --- backends

Collectors: Fluentd, Fluent Bit, Vector

Agent	Characteristics	Typical role
Fluentd	Ruby/C, rich plugin ecosystem, higher memory footprint	Legacy EFK stacks; aggregation tier behind Fluent Bit
Fluent Bit	C, lightweight, CNCF graduated; Kubernetes filter built-in	Default node agent—Elastic, Loki, S3, Kafka outputs
Vector	Rust, VRL transform language, high throughput	Greenfield pipelines; replaces Fluent Bit when teams want programmable transforms

EFK vs Loki

Dimension	EFK (Elasticsearch + Fluent + Kibana)	Loki (+ Grafana)
Index model	Full-text inverted index on log content	Index labels only (namespace, pod, app); chunk storage cheap
Query style	Free-text search, complex aggregations	LogQL—filter by labels, parse JSON at query time
Cost at scale	Higher—RAM-heavy JVM, shard management	Lower—object storage (S3) for chunks
Best fit	Security analytics, full-text grep across unstructured logs	Kubernetes-native ops—correlate logs + metrics in Grafana

Log sources on the node

/var/log/pods/<ns>_<pod>_<uid>/<container>/<n>.log — CRI JSON log format (one JSON object per line)
/var/log/containers/ — symlinks to pod logs; what most collectors tail
/var/log/journal — systemd/kubelet logs (optional host journal input)
/var/log/openshift/ — OCP node and audit logs (platform-specific)

Structured JSON logging

Apps should emit one JSON object per line to stdout with stable field names (level, msg, trace_id, request_id). Collectors parse JSON in the pipeline (Fluent Bit parser filter, Vector VRL) so Loki/Elasticsearch can filter without regex hell. Correlate with traces by injecting W3C traceparent into every log line.

{
  "timestamp": "2026-06-05T14:32:01.123Z",
  "level": "ERROR",
  "msg": "payment declined",
  "service": "payments-api",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "order_id": "ord-88421",
  "http.status": 402,
  "duration_ms": 127
}

OpenShift Logging Operator & ClusterLogForwarder

OpenShift Cluster Logging is managed by the Logging Operator (replacing legacy cluster-logging Ansible ops). ClusterLogForwarder (CLF) CR defines inputs (application, infrastructure, audit) and outputs (Elasticsearch, Loki, Kafka, CloudWatch, Splunk, syslog).

apiVersion: observability.openshift.io/v1
kind: ClusterLogForwarder
metadata:
  name: instance
  namespace: openshift-logging
spec:
  managementState: Managed
  outputs:
    - name: loki-out
      type: loki
      url: https://loki-gateway.observability.svc:3100
      loki:
        tenantKey: kubernetes
  pipelines:
    - name: app-to-loki
      inputRefs:
        - application
      outputRefs:
        - loki-out

$ kubectl logs -n logging -l app.kubernetes.io/name=fluent-bit --tail=50
$ kubectl get pods -n team-payments -l app=payments-api -o name | head -1 | xargs -I{} kubectl logs {} -c payments-api --tail=100
$ kubectl debug node/worker-1 -it --image=busybox -- chroot /host tail -5 /var/log/pods/team-payments_payments-api-*/payments-api/*.log$ oc get clusterlogforwarder -n openshift-logging
$ oc get co logging
$ oc logs -n openshift-logging -l component=collector --tail=30
$ oc adm node-logs worker-1 --log-type=kubelet | tail -20

⚠️ Pitfall

Logging DEBUG in production without sampling floods collectors and storage—one chatty pod can saturate Fluent Bit buffers and drop logs cluster-wide. Set log level via env/ConfigMap; use dynamic debug endpoints for incident investigation only.

⚖️ Trade-off

Sidecar vs DaemonSet collector: Sidecars (Fluent Bit per pod) isolate noisy neighbors but multiply memory overhead. DaemonSet node agents are the standard—one collector per node tails all pod logs with Kubernetes metadata filter.

🔒 Security

Never log secrets, tokens, or PII in plaintext. Redact at source or in collector transforms. OpenShift audit logs (API access) flow through CLF separately from application logs—retain per compliance policy, restrict access in the log store RBAC.

💡 Pro Tip

In Grafana, link Loki log panels to Prometheus metrics and Tempo traces with derived fields on trace_id—one click from error log line to flame graph.

Distributed Tracing

Metrics show a spike; logs show an error message. Traces show the cross-service path— which hop added 800ms, where retries happened, whether the DB or the cache failed first. OpenTelemetry (OTel) is the vendor-neutral standard; backends store and query span data.

flowchart LR
  APP["Instrumented app\nOTel SDK"]
  AUTO["Auto-instrumentation\noperator injection"]
  COL["OTel Collector\nDaemonSet / sidecar"]
  BACK["Trace backend"]
  UI["Trace UI\nGrafana / Jaeger"]
  APP --> COL
  AUTO --> APP
  COL --> BACK --> UI
  subgraph stores["Backends"]
    TEMPO["Grafana Tempo"]
    JAEGER["Jaeger"]
  end
  BACK --- stores

OpenTelemetry Operator

Kubernetes operator that manages OpenTelemetry Collector deployments and auto-instrumentation. Annotate a namespace or pod to inject an init container that adds the OTel Java/Node/Python/.NET agent without rebuilding images.

apiVersion: v1
kind: Namespace
metadata:
  name: team-payments
  annotations:
    instrumentation.opentelemetry.io/inject-java: "true"
---
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: java-instrumentation
  namespace: team-payments
spec:
  exporter:
    endpoint: http://otel-collector.observability.svc:4317
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"          # 10% head sampling in dev; tune per env

Auto-instrumentation

Language agents intercept HTTP/gRPC, database drivers, and messaging clients—emitting spans with minimal code changes. Trade-offs: black-box spans (less business context), agent CPU overhead, version compatibility with your runtime. For critical paths, add manual spans around business operations (processPayment).

Tempo

Grafana Tempo is a trace backend optimized for object storage (S3/GCS)—no heavy indexing like Jaeger Elasticsearch. Query via TraceQL in Grafana. Pairs naturally with Loki (logs) and Prometheus (metrics) in the Grafana stack. Deploy with the tempo-distributed Helm chart or Grafana Agent/Alloy pipeline.

Jaeger

CNCF graduated tracing system—collector, query UI, agent (legacy sidecar). Storage options: memory (dev), Cassandra, Elasticsearch, Badger. Still widely deployed; many teams migrate to Tempo for lower ops cost. Jaeger v2 converges on OpenTelemetry Collector internals.

OpenShift Distributed Tracing

Red Hat OpenShift Distributed Tracing Platform (based on Jaeger/Tempo operators via OpenTelemetry) integrates with the console and Service Mesh. Install via OperatorHub; OpenTelemetryCollector CR receives OTLP from apps; Jaeger or Tempo instance stores traces. Istio/Service Mesh generates spans automatically for mesh traffic.

$ kubectl get opentelemetrycollector -A
$ kubectl get instrumentation -A
$ kubectl port-forward -n observability svc/jaeger-query 16686:16686
→ open http://localhost:16686 — search by service name
$ kubectl logs deploy/payments-api -c opentelemetry-auto-instrumentation-java$ oc get jaeger -n openshift-distributed-tracing
$ oc get opentelemetrycollector -n openshift-tempo
$ oc get route -n openshift-distributed-tracing

🔬 Under the Hood

W3C Trace Context (traceparent header) propagates trace IDs across services. The OTel Collector can batch, sample tail-based (keep errors/slow traces), and fan-out to multiple backends. Head sampling at 1% means 99% of traces are discarded at birth—acceptable for high-QPS if tail sampling catches anomalies.

🎯 Interview Tip

"How do you debug latency in microservices on K8s?" — Metrics for SLI breach → TraceQL/Jaeger for slow span → logs filtered by trace_id. Mention sampling strategy (head vs tail) and avoiding trace cardinality explosion from unbounded span attributes.

📦 Real World

Payment platforms enable 100% trace sampling on checkout path only via OTel Collector tail sampling policy— drop health-check spans, keep any trace where http.status_code >= 500 or duration > 2s.

kubectl / oc Observability Commands

Before opening Grafana, the API server already exposes rich live signals. These commands are the first line of incident response—resource pressure, scheduling events, container crashes, and ad-hoc debugging without SSH to nodes.

Resource usage: top

Requires metrics-server (vanilla) or cluster metrics (OCP). Shows current CPU/memory—not historical. Sort to find noisy neighbors during node pressure incidents.

$ kubectl top nodes
$ kubectl top pods -A --sort-by=memory | head -15
$ kubectl top pod payments-api-7d4f8b-abc12 -n team-payments --containers$ oc adm top nodes
$ oc adm top pods -n team-payments --sort-by=cpu
$ oc adm top pod payments-api-7d4f8b-abc12 --containers

Events: cluster audit trail

Kubernetes Events are short-lived (default 1h retention)—capture FailedScheduling, BackOff, Unhealthy, Evicted messages. Always check events when a pod is stuck Pending or CrashLooping.

$ kubectl get events -n team-payments --sort-by='.lastTimestamp'
$ kubectl get events -A --field-selector type=Warning | tail -20
$ kubectl get events --for pod/payments-api-7d4f8b-abc12 -n team-payments$ oc get events -n team-payments --sort-by='.lastTimestamp'
$ oc get events -A --field-selector type=Warning | tail -20

describe: spec + status + events

Combines object YAML highlights, conditions, and recent events in one view—the fastest way to understand why a Deployment isn't progressing or a PVC is Pending.

$ kubectl describe pod payments-api-7d4f8b-abc12 -n team-payments
$ kubectl describe node worker-2 | grep -A5 Conditions
$ kubectl describe pvc data-payments-api-0 -n team-payments$ oc describe pod payments-api-7d4f8b-abc12 -n team-payments
$ oc describe node worker-2 | grep -A5 Conditions

Logs: current and previous container

kubectl logs tails the current container instance. --previous fetches logs from the crashed container before restart—essential for OOMKill and panic stack traces. Multi-container pods require -c <container>.

$ kubectl logs -f deploy/payments-api -n team-payments --all-containers=true
$ kubectl logs payments-api-7d4f8b-abc12 -n team-payments -c payments-api --previous
→ last crash output before restart
$ kubectl logs payments-api-7d4f8b-abc12 -n team-payments --since=1h --timestamps$ oc logs -f deploy/payments-api -n team-payments --all-containers=true
$ oc logs payments-api-7d4f8b-abc12 -n team-payments -c payments-api --previous
$ oc logs payments-api-7d4f8b-abc12 -n team-payments --since=1h --timestamps

exec, port-forward, debug

Interactive debugging without exposing services publicly. kubectl debug (K8s 1.23+) creates ephemeral debug containers or node shells—preferred over SSH on managed clouds.

$ kubectl exec -it payments-api-7d4f8b-abc12 -n team-payments -c payments-api -- /bin/sh
$ kubectl port-forward -n team-payments svc/payments-api 8080:8080
→ curl localhost:8080/actuator/health
$ kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
$ kubectl debug -it payments-api-7d4f8b-abc12 -n team-payments --image=nicolaka/netshoot --target=payments-api
$ kubectl debug node/worker-2 -it --image=busybox -- chroot /host crictl ps$ oc exec -it payments-api-7d4f8b-abc12 -n team-payments -c payments-api -- /bin/sh
$ oc port-forward -n team-payments svc/payments-api 8080:8080
$ oc port-forward -n openshift-monitoring svc/prometheus-k8s 9091:9091
$ oc debug -it payments-api-7d4f8b-abc12 -n team-payments --image=registry.redhat.io/ubi9/ubi-minimal --target=payments-api
$ oc adm node-logs worker-2 --log-type=kubelet | tail -30

Command quick reference

Goal	kubectl	oc equivalent / notes
Pod CPU/memory now	kubectl top pods	oc adm top pods
Node resource pressure	kubectl top nodes	oc adm top nodes
Scheduling / pull errors	kubectl get events	oc get events
Crash before restart	kubectl logs --previous	oc logs --previous
Local access to metrics UI	kubectl port-forward svc/…	oc port-forward (same syntax)
Node-level logs (OCP)	kubectl debug node/…	oc adm node-logs

💡 Pro Tip

Incident triage order: get pods → describe → logs --previous → get events → top. Only then escalate to Prometheus/Grafana for historical context. Saves minutes when the answer is ImagePullBackOff in Events.

⚠️ Pitfall

kubectl logs --previous fails if the pod never ran successfully (only one container instance exists). Use describe for init container failures and kubectl logs -c init-container-name before the main container starts.

⚙️ Config

Enable stern or kubectl logs -l app=… --prefix for multi-pod tailing during rollouts. On OCP, the Console Logs tab aggregates pod logs—useful for devs without CLI access.