Fifteen staff-depth reliability scenarios: SLO-aware breakers with real fallback stacks; idempotent retries and streaming resume protocols; schema-normalizing multi-provider gateways; checkpointed partial streams; tool-aware idempotency stores; RAG revision + lag semantics users can see; RPO/RTO-driven vector DR with game days; coordinated 429 handling; GPU-correct K8s probes; JSON repair ladders; governed DLQs; layered validators; non-blocking shadow with residency parity; sticky canaries with auto-rollback; and explicit degradation ladders that preserve trust.
Interview stance. LLM reliability is SRE plus stochastic failure modes: half-open breakers, streaming resumes, idempotent retries, and validators are table stakes. Always pair mechanical resilience with user-trustworthy degradation—empty errors are not ‘graceful.’
Circuit breakers need a stocked fallback lane (alternate model, canned path, or honest outage)—not faster 503s.
RAG is eventually consistent by physics: ship revision metadata, lag metrics, and UX that admits staleness.
Shadow and canary paths require automated rollback on grounding/cost SLO regressions—humans should only triage alerts, not stare at dashboards.
Treat malformed tool JSON and partial streams as first-class test cases in CI, not production surprises.
101. How would you design a circuit breaker for LLM API calls to handle provider outages gracefully?
Signals. Track rolling-window error rate, timeouts, and tail latency per upstream lane—not only HTTP 500, but p95 > SLO while tokens drain slowly (silent brownouts). Separate breakers per region/account key so one bad key does not dark the whole product.
States. Open circuit fast enough to protect user patience; enter half-open only with bounded probe traffic and require N consecutive healthy mini-requests before closing. Hysteresis prevents flap on jittery providers.
Fallbacks. Pre-wired degrade stack: alternate model family, cached answer for FAQs, deterministic template path, or honest outage page. Empty 503s are not a design—write the copy and legal review it for regulated verticals.
Streaming. If circuit opens mid-stream, terminate with structured client error code so UIs can offer resume; do not hang TCP half-open.
Observability. Emit breaker transitions as events; postmortems should answer whether thresholds were tuned from real incidents or guessed.
102. How would you implement retry logic with exponential backoff for transient LLM API failures?
Idempotency contract. Only auto-retry when operation is observationally idempotent: same Idempotency-Key maps to one billed completion, or side effects are read-only. Non-idempotent tool calls need explicit user confirmation before second attempt.
Provider etiquette. Honor Retry-After when present; cap attempts and total budget wall-clock so one bad job does not occupy workers for hours.
Jitter. Full jitter on backoff avoids herd re-hit after coordinated outages; coordinate with region peers via small random floors.
Streaming partials. When retrying, replay from last acknowledged token index if protocol supports it; otherwise dedupe completed segments in server buffer to avoid duplicate paragraphs.
Visibility. Tag retries in traces and billing—finance should see retry multiplier, not unexplained spend spikes.
103. How would you design a multi-provider LLM strategy (OpenAI primary, Anthropic fallback, Azure OpenAI tertiary)?
Normalization layer. Gateway speaks one internal schema (messages, tools, temperature caps, safety settings) and maps to each vendor’s quirks—different tool formats, different max context, different refusal behaviors.
Health & routing. Continuous synthetic probes plus real-traffic error sampling drive a live routing table. Policy engine pins regulated workloads to specific regions (e.g., Azure EU-only) regardless of cost optimality.
Capability matrix. Versioned spreadsheet checked into repo: streaming yes/no, JSON mode quality, vision, function-call parity. CI blocks deploy if feature flags reference unsupported combos.
Eval discipline. Before promoting a lane, run goldens and safety harness across all candidates—fallback quality is not ‘good enough’ if it hallucinates citations.
Exit clarity. Document how fast you can repoint DNS or feature flags when a vendor changes ToS—executive questions arrive during geopolitical news cycles.
104. How would you handle partial LLM responses caused by network timeouts during streaming?
Persist progress. Server buffers partial assistant message with monotonic sequence numbers; client acknowledges or checkpoints periodically so resume does not duplicate text.
Protocol. Expose resume_cursor / continuation tokens in API; mobile clients especially need idempotent ‘continue generation’ after Wi-Fi flaps.
Billing integrity. Mark spans partial for analytics; reconcile tokens billed vs delivered; consider not charging full completion if policy allows—at minimum, prove good faith in disputes.
UX copy. Show ‘connection dropped—tap to continue’ not spinner panic; preserve prior tokens visibly so users trust nothing was lost.
Testing. Chaos inject RST packets in staging against long streams; measure data loss rate.
105. How would you design an idempotent LLM request system to prevent duplicate processing?
Keys. Client-generated UUID v4 in header scoped to tenant+user; server stores hash of canonical request payload → outcome mapping with TTL aligned to finance reconciliation window (often 24–72h).
Semantics. On duplicate, replay stored assistant message and usage record—do not re-run tools blindly if second request arrived after external side effects occurred unless tooling is provably idempotent.
Streaming. Same key resumes same stream fingerprint or returns finalized transcript; never interleave two live streams for one idempotency key.
Edge cases. Key rotation, mobile offline retries, and double-submit forms are classic bugs—document client SDK behavior explicitly.
Audit. Idempotency table is evidence in billing disputes—immutable append or WORM storage in regulated settings.
106. How would you ensure eventual consistency in a RAG system where the vector store is eventually in sync with the source DB?
Versioning. Attach monotonic source_rev and index_rev to each chunk; retrieval returns them in metadata so answer composer can refuse stale synthesis or display freshness banner.
Lag metrics. Export index_lag_seconds per corpus and surface to UX when thresholds exceeded (legal might require hard block).
Ordering. CDC per primary key with partition leaders; document out-of-order edge cases during bulk backfills and how you pause queries or pin to old snapshot.
Deletes. Tombstones must beat resurrect races—define consistency level (strong per doc id after delete ack vs eventual).
Interview candor. ‘Eventually consistent’ without user-visible semantics is an outage waiting to happen.
107. How would you design disaster recovery for a vector database containing millions of embeddings?
Rebuild vs replicate. Most teams can rebuild from object-store chunks + metadata manifests faster/cheaper than synchronous multi-master indexes—state the RPO: ‘we accept N minutes of re-embed’ vs ‘zero data loss’ which demands different $$.
Artifacts. Versioned dumps of metadata, embedding model id, and chunk text; cross-region replication with immutability for ransomware resilience.
Drills. Quarterly game day: delete index in staging region, measure rebuild time, fix automation gaps. DR that only exists in slides fails audits.
Traffic switch. DNS or routing layer can steer reads to warm secondary region if primary region fails; validate embeddings identical via checksum sampling.
Scope. Separate DR tier for ‘nice to have’ vs ‘contractual SLA’ corpora; not everything needs hot standby.
108. How would you handle LLM API rate limit (429) errors in a high-traffic production system?
Global coordination. Central token-bucket in gateway keyed by provider account and workload class; inner services cannot each maintain naive local limits that collectively oversubscribe.
Retry-After discipline. Parse provider headers; pause all workers for that key until window resets—blind retries amplify ban risk.
Prioritization. Interactive user chat beats batch re-embed; shed or delay low-priority queues with transparent job ETA.
Multi-key strategy. Organize additional paid throughput legally; shard tenants across keys carefully to avoid blast radius on revocation.
Product behavior. When throttled, show queue position or offer async completion—never infinite spinners.
109. How would you design a health check and readiness probe for an LLM service in Kubernetes?
Liveness vs readiness. Liveness: lightweight—process not wedged. Readiness: strict—model weights resident, CUDA allocator healthy, tokenizer loaded, and warm-up forward pass succeeded on expected batch shape.
Startup probe. 10–20+ GiB model pulls need long startup probes or InitContainers; premature readiness blackholes traffic to OOMing pods.
GPU specifics. Readiness should fail if ECC errors or thermal throttle sensors trip if your stack exposes them; avoid routing to unhealthy silicon.
Dependency checks. Optional: soft dependency on tokenizer download mirrors—fail readiness if staging cannot reach artifact server, not only if GPU idle.
Anti-pattern. Calling real LLM inference every probe—too costly; separate cheap health tensor from user path.
110. How would you detect and recover from a situation where an LLM returns malformed JSON (broken structured output)?
Prevention. Prefer provider JSON/grammar modes or constrained decoding; validate tool schemas in the loop model sees—reduces but does not eliminate escapes.
Repair ladder. Deterministic linter pass → small ‘json surgeon’ model with tiny context → ask originating model once with ‘fix to schema’ nudge—cap loops.
Telemetry. Classify failure modes (truncation vs illegal escape vs extra prose) to tune max_tokens or prompts upstream.
Safety. Never eval repair snippets; parse in sandbox.
User path. If repair fails, ask human for missing field instead of silently dropping tool call—agents stall mysteriously otherwise.
111. How would you design a dead letter queue for failed LLM processing jobs?
Envelope. Store pointer to payload in object storage, not full prompt text if PII—include tenant, job type, idempotency key, attempt count, categorized error (rate limit vs validation vs model refusal).
Replay safety. Replay button reruns through same validators; optionally require human approval for tool-having jobs after fix.
Analytics. DLQ depth dashboards with burn-rate alerts; correlate spikes to bad deploys or vendor incidents.
Retention. Align expiry with GDPR—auto-purge sensitive payloads even inside DLQ.
Cultural. Treat DLQ non-zero as normal, but trending growth as technical debt—schedule fix-it weeks.
112. How would you validate LLM output before it is returned to the user or passed to the next pipeline stage?
Pydantic / JSON Schema for machine outputs; regex + allowlists for human text policies (no competitor slurs, no medical dosing unless approved). Include numeric sanity checks on domain objects (prices > 0, dates parseable).
Layering. Fast deterministic validators before expensive cross-model review; fail closed on ambiguity in banking/health.
Secondary model. For high-risk paths, lightweight judge checks entailment with source chunks—still probabilistic, so combine with rules.
Telemetry. Validator reject rate per prompt version guides iteration.
UX. On failure, show ‘I could not produce a compliant answer’ with escalation—not silent truncation users mistake for truth.
113. How would you design a shadow mode for testing a new LLM model in production without impacting users?
Sampling. Async fork x% of traffic after primary response already committed; shadow path must not block latency SLO—use separate worker pool with backpressure discard for shadow only.
Comparison. Offline diff metrics: schema pass rate, grounding overlap with citations, toxicity scores, cost delta. Store compact diffs, not full prompts if policy forbids.
Safety. Shadow still obeys data residency—no shipping EU prompts to US-only endpoints for convenience.
Guardrails. Kill switch if shadow spikes errors or cost; never retry shadow into user-visible paths.
Outcome. Promotion decision is a defined checklist, not vibes from one reviewer.
114. How would you design canary deployments for rolling out prompt changes in a production LLM system?
Surface area. Canary at tenant/region granularity for B2B, or user-hash sticky buckets for consumer—avoid ‘5% random’ that confuses reproducibility.
Versioning. Prompt templates immutable with ids; no one edits prod text in place. Pair with model version in telemetry.
Human review. For major tone/legal prompts, require dual approval before canary knob moves past internal-only cohort.
Observability. Dashboard comparing canary vs control on same queries using held-out eval set continuously replayed.
115. How would you implement graceful degradation when the LLM service is overloaded — what do you show the user?
Ladder. (1) Shorten answers with summary mode, (2) disable agents/tools, (3) route to cheaper model with banner, (4) async email/queue full analysis, (5) read-only cached FAQs, (6) maintenance page with ETA—each step has prewritten UX and exec sign-off.
Transparency. Explain capacity honestly; users forgive delay more than invisible quality collapse.
Fairness. VIP tenants or paid SKUs may get reserved concurrency—declare product policy openly to avoid ‘secret favors’ perception internally.
Load shed triggers. Queue age, GPU utilization, or provider error budget—automatic, not engineer on Zoom guessing.
Postmortem. Degradation events feed capacity planning; if ladder rung 4 is common, architecture not sizing correctly.
Example. Black Friday spike: mini model answers with banner + option to email full report later.