Interview ready · Design · Section 9

Observability & evaluation

Fifteen staff-depth scenarios on seeing and proving how LLM systems behave: semantic distributed tracing; privacy-tiered logging; versioned offline RAG eval; online feedback loops with abuse resistance; production grounding checks; latency waterfalls; retrieval SLIs; customer debug playbooks; synthetic probes; ethical A/B design; human labeling ops; red-team CI gates; cost attributes on spans; quality burn-rate alerts; and persona-specific dashboards from one telemetry core.

Interview stance. Observability for LLMs is not ‘logs plus Datadog’—it is tracing probabilistic behavior with business context. Panels expect you to separate infra uptime from answer quality, and to know what you cannot safely log.

Every bad answer needs a replay bundle—model id, prompt template, retrieval trace—not a screenshot in Slack.
Ship offline gates and online proxies together; neither alone catches silent ranking rot.
Quality alerting without burn rates and owners becomes emotional support, not operations.
Treat cost and safety as first-class span attributes, not spreadsheets after the fact.

131. How would you design an observability stack specifically for LLM-powered applications (logs, metrics, traces)?

Three pillars, one trace id. Correlating prompts, retrievals, tool calls, and streaming chunks requires a single request_id / OpenTelemetry trace root. Spans should name the semantic step—retrieve.hybrid, rerank.cross_encoder, llm.completion—not only “HTTP POST.”

Redaction by default. Log payloads with field-level redaction or hashes; store full text only in secured, TTL’d buckets when needed for replay. Security teams audit what leaves the VPC.

Metrics that matter. TTFT, inter-token latency percentiles, retrieval candidate count, rerank drop rate, validator reject rate, cost per request, tokens in/out by model version. Generic CPU metrics miss user-visible pain.

Structured events. Emit JSON lines for grounding decisions, citation ids, and prompt template ids so warehouse joins power weekly quality reviews—not grep.

Product linkage. Tag spans with feature, tenant_tier, experiment_id so PMs slice regressions without filing ‘mystery infra’ tickets.

Trace shape

flowchart LR
  R[Request] --> T[Trace]
  T --> RET[Retrieve span]
  T --> RN[Rerank span]
  T --> L[LLM span]
  T --> V[Validate span]

132. How would you balance rich observability with privacy and compliance (PII in prompts and responses)?

Data classification drives retention. Tier-1 flows might log only hashes + aggregate stats; tier-2 allows truncated snippets with token masking; never ship card numbers to a third-party log vendor “temporarily.”

Techniques. Run PII detectors before write; pseudonymize user ids in dev/stage; separate security audit log (immutable, access controlled) from noisy app logs.

Replay sandboxes. When engineers debug, pull redacted bundles into ephemeral environments; access is ticketed and time-bound.

Regional logging. EU prompts never land in US-only log indices—architecture mirrors inference residency.

Interview candor. ‘Log everything’ and GDPR/HIPAA are incompatible; show you negotiate with Legal early.

133. How would you build an offline evaluation pipeline for a RAG system before each release?

Gold sets. Curate labeled question–answer pairs with required evidence chunk ids, stratified by language, domain, and difficulty. Version datasets like code—rag_eval_v2026_04.

Metrics. Recall@k / nDCG on retrieval, answer exact match where applicable, citation faithfulness checks, and refusal correctness on adversarial ‘no evidence’ items.

Gates. CI fails merges that regress primary metrics beyond tolerance; flaky eval infra is worse than no eval—pin models, seeds, and hardware class.

Cost discipline. Nightly full suites; per-PR smoke subsets. Parallelize across shards.

Human spot checks. Sample diffs for narrative quality; automation catches ranking bugs, not tone regressions.

134. How would you collect and use online user feedback (thumbs, edits, implicit signals) to improve LLM quality?

Explicit signals. Thumbs on answers + optional reason taxonomy (‘wrong facts’, ‘unsafe’, ‘off topic’). Keep UX lightweight; long surveys get selection bias from angry users only.

Implicit signals. Copy rate, time-to-first-regenerate, follow-up correction messages, support ticket linkage. Model these as weak labels with uncertainty.

Closed loop. Route high-value failures into labeling queues; tie back to prompt id, model id, and retrieval trace for root cause—not generic ‘bad AI’ buckets.

Gaming resistance. Rate-limit feedback per session; detect brigading on controversial surfaces.

Ethics. Tell users how feedback trains systems when legally required; allow opt-out where mandated.

135. How would you detect hallucinations or ungrounded answers in production—not only in offline benchmarks?

Entailment checks. Lightweight models or rules compare answer sentences to retrieved passages; flag low-overlap outputs for review or automatic downgrade.

Citation validators. Require citation ids in high-risk modes; regex/AST verify code blocks match sources when feasible.

Confidence proxies. Retrieval score gaps, reranker margin, and self-consistency sampling (two temps) highlight shaky regions—use sparingly due to cost.

Human audit queues. Stochastic sampling plus all flagged finance/medical answers where policy demands.

Metrics. Track grounding violation rate per cohort; alert when baseline shifts after deploy.

136. How would you decompose and optimize end-to-end latency for an LLM application (time-to-first-token, retrieval, tools)?

Waterfall first. Measure DNS, auth, retrieval, rerank, prompt assembly, queue wait, GPU prefill, decode—without a breakdown teams optimize the wrong layer.

TTFT vs total. Streaming improves perceived latency; long tool round-trips dominate some agents—parallelize independent tools, cache deterministic retrieval for hot queries.

Async patterns. Push non-blocking work (logging, secondary enrichment) off the hot path.

Capacity. Autoscale on prefill queue depth, not only CPU; cold GPU adds seconds—keep warm pools for interactive tiers.

Interview link. Mention tail latency: p99 often set by one bad dependency—cap with deadlines and partial results.

Latency waterfall

flowchart LR
  A[Auth] --> R[Retrieve]
  R --> K[Rerank]
  K --> Q[Queue]
  Q --> P[Prefill]
  P --> D[Decode stream]

137. How would you define and monitor service-level indicators for retrieval quality in production?

Proxy metrics. Zero-result rate, filtered-out rate post-ACL, avg top‑1 score drift, click-through on citations, ‘no evidence’ answer frequency.

Periodic deep eval. Nightly or weekly automated replay of gold queries against prod index snapshots—metric deltas catch silent index corruption.

Segmentation. Slice by tenant, language, corpus version; global averages hide one broken connector.

SLO pairing. Pair quality SLIs with freshness SLIs—fast wrong answers fail audits.

Actionability. Each alert should suggest likely owner (ingestion vs ranking vs ACL sync).

138. How would you design a playbook for debugging a ‘bad LLM answer’ reported by a customer?

Time travel bundle. Support pulls session id → reconstruction package: model, prompt template id, retrieval trace, tool I/O redacted, validator outcome—no guessing in Slack threads.

Classify failure. Stale index, citation not retrieved, safety false positive, tool error masquerading as model creativity, user prompt ambiguity—each routes to different teams.

Repro lab. Re-run with frozen inputs on staging; diff against current prod to see if issue already fixed.

Communicate. Customer-facing RCA template distinguishes product limits from bugs.

Prevention. Every Sev2 adds a golden regression test before closing.

139. How would you implement synthetic monitoring (probes) for LLM endpoints?

Representative prompts. Small canary set covering retrieval, tools, safety, multilingual—run every minute from multiple regions.

Assertions. Schema pass, latency bounds, refusal on forbidden category, deterministic math item—avoid LLM-judging-LLM loops in primary alerts.

Cost control. Use cheapest model that still validates infrastructure where possible; escalate to full stack hourly.

Secrets. Probe accounts isolated from production data; synthetic corpuses avoid PII leakage.

Dashboards. Visualize probe history next to deploy markers for quick correlation.

140. How would you design an A/B test to compare two LLM models or prompts in production?

Unit of randomization. Sticky user or session buckets to avoid within-user flavor whiplash; power analysis upfront so you do not stop tests early on noise.

Primary metrics. Task success, revenue proxy, CSAT, safety incidents—not only BLEU. Guardrails on toxicity and cost.

Infrastructure. Feature flags + tracing tags ensure each event records arm id; warehouse queries must be trivial.

Duration & seasonality. Run through full weekly cycles; LLM quality can drift with world events.

Ethical stops. Automated halt if harm metric crosses threshold—experiments are not academia.

141. How would you operationalize human evaluation (labeling pipelines, inter-rater reliability) for LLM outputs?

Rubrics. Score grounding, helpfulness, policy adherence on clear 1–5 scales with examples; ambiguous rubrics yield unusable data.

Calibration. Weekly golden items and drift checks; disagree rate triggers rubric refresh.

Workflow. Labeling UI shows evidence snippets side-by-side; timers and breaks reduce fatigue errors.

Vendor vs internal. Regulated data may forbid crowdsourcing—plan secure facilities or employee-only pools.

Linkage. Labels join telemetry for automatic dataset exports to fine-tuning (where allowed).

142. How would you integrate red teaming and adversarial testing into the LLM development lifecycle?

Shift left. Run automated jailbreak suites on every prompt template change; block merges on severity-1 escalations.

Periodic deep red team. Quarterly human-led campaigns with scope docs and safe harbors; file issues with exploit reproductions.

Metrics. Track mean time to remediate critical findings; executives care about trend, not one-off headlines.

Cross-functional. Legal reviews outputs for regulated claims; product defines acceptable refusal rates.

Residual risk register. Document known gaps instead of pretending 100% jailbreak-proof.

143. How would you expose LLM inference cost and token usage inside observability tools for engineers and finance?

Span attributes. Attach billed tokens, estimated tokens, model tariff id, cache hit bool—joinable to finance’s cost table nightly.

Burn dashboards. Show cost per feature per hour with anomaly detection; separate one-off batch jobs from steady chat.

Allocations. Attribute provider invoices back to traces via shared cost_center tag; untagged spend becomes engineering debt.

Forecast hooks. Export weekly p95/p99 token distributions to FinOps models predicting next quarter.

Culture. Give engineers read-only cost views so they internalize caching wins.

144. How would you alert on LLM quality regressions (not just uptime) without drowning on-call in noise?

Layered signals. Combine offline eval deltas, online thumbs-down rate, synthetic probe failures, and retrieval zero-hit spikes—single-metric pages lie.

Burn-rate alerts. Page only when badness accelerates over SLO window; daily noise goes to tickets.

Ownership. Quality SLOs route to ML + product jointly; pure infra on-call should not root-cause grounding bugs at 3 AM without playbooks.

Tuning. Run fire drills on synthetic regressions to validate thresholds quarterly.

User impact. Weight alerts by affected MAU—niche failures can wait.

145. How would you design observability dashboards for different audiences (execs, PMs, ML platform, SRE)?

Executives: Cost, MAU, incident counts, strategic quality index—no token histograms.

PMs: Feature-level success, experiment arms, top failure reasons from support tags.

ML platform: Model latency matrix, GPU saturation, queue depths, data drift proxies.

SRE: RED metrics, saturation, dependency health, synthetic probe boards.

Principle. Same underlying data model; curated lens per persona—avoid four tools disagreeing because of inconsistent SQL.

Recap — this section

Q	Takeaway
131	Unified trace + redacted payloads; LLM-specific SLIs; structured quality events; product tags.
132	Tiered retention + masking; split audit vs app logs; ephemeral replay; regional log stacks.
133	Versioned gold data; multi-metric gates; CI tiers; human sampling.
134	Explicit + implicit signals; trace-linked ticketing; anti-abuse; lawful transparency.
135	Entailment + citation gates; multi-signal confidence; sampled and mandatory HITL; SLO on violations.
136	Measured waterfall; TTFT vs tool dominates; async offload; queue-depth autoscale.
137	Online proxies + snapshot replays; dimensional slices; freshness pairing; owner-linked alerts.
138	Immutable replay bundles; failure taxonomy; repro lab; regression tests from incidents.
139	Multi-region canaries; cheap infra checks + periodic full stack; isolated data; deploy overlays.
140	Sticky buckets; business-primary metrics; flag discipline; seasonality-aware duration; ethical auto-stop.
141	Concrete rubrics; rater calibration; purpose-built UI; jurisdiction-appropriate workforce; pipeline to training.
142	CI adversarial gates; quarterly human red team; exec-visible MTTR; explicit residual risk.
143	Token + tariff on spans; feature burn charts; invoice reconciliation tags; FinOps feedback loop.
144	Composite SLO signals; burn rates; shared ML/product ownership; rehearsed thresholds; MAU weighting.
145	Persona-specific views on unified telemetry; exec distillation; ML + SRE depth split.

← Section 8 · This section · Design hub · Section 10 →