Interview ready · Design · Section 17

Capstone synthesis

Fifteen closing scenarios for staff-level fluency: 45-minute zero-to-scale narrative; regression bisection under version drift; dynamic whiteboard facilitation; knowing when deterministic systems win; LLM-specific launch checklists; viral traffic game days; prototype-to-SaaS hardening; consolidating duplicate AI programs; M&A technical diligence; post-merger stack merge; CISO-grade review storytelling; ADRs amidst rapid change; research-to-prod handoff contracts; what signals seniors listen for; and a socio-technical thesis for long-lived LLM infrastructure—the end of the guide series.

Interview stance. Capstone questions reward synthesis: you can thread retrieval, safety, cost, and governance in one story, admit when not to use LLMs, and close with how organizations—not only GPUs—make systems survive contact with reality.

Treat the interview as collaborative problem solving: trade tables beat monologues.
Show you know how acquisitions, audits, and migrations constrain architecture—not only greenfield.
End-to-end fluency means naming what you instrument on day one, not only boxes on slide four.
Closing with socio-technical humility is a feature: machines plus people plus law.

251. In a 45-minute interview, how would you narrate taking an enterprise copilot from prototype to reliable production at scale?

Anchor on user job. 60 seconds on who benefits and success metric—prevents architecture scatter.

Layered story. Data/ACL → ingest/index → retrieve/rerank → gen/validate → eval/observability → cost/reliability/security loops—each with explicit tradeoff.

Risks. Call out injection, staleness, and cost explosions early—senior signal.

Scaling hooks. Queues, caching, multi-tenant isolation, canaries—mention by name tied to pain.

Close. What you’d verify first in week one (instrumentation + goldens) vs hypothetical year-two polish.

252. How would you systematically hunt down a multi-step quality regression after a release (model, data, or code)?

Freeze scope. Record exact model id, prompt hash, index build, config flags for bad vs good era.

Bisect. Binary search deployments, data refreshes, and reranker versions—one variable at a time.

Signals. Slice metrics: language, tenant, query length—localize blast radius before root cause.

Replay. Offline replay of failing traces on staging with controlled variables.

Fix. Add regression tests at lowest layer that failed (retrieval vs generation vs validator).

253. How would you drive a constrained system-design discussion when the interviewer keeps changing requirements mid-problem?

Clarify. Confirm non-functionals: latency, privacy, budget, scale—write them visibly.

Trade table. When requirements conflict, present 2–3 options with explicit sacrifices—not silent cherry-picking.

Timebox. Offer MVP slice first, then ‘if we had another quarter’ depth—shows senior pacing.

Pushback. Politely challenge impossible combos (‘global zero latency + full privacy + cheapest model’).

Close. Summarize decisions and open risks—interviewers score structured endings.

254. How would you recognize when a classical or deterministic system should replace or bound an LLM solution?

Criteria. Formal correctness needed (ledger math), sub-millisecond latency, or legally mandated determinism—LLM is wrong tool.

Hybrid. Let LLM propose plan; calculator/SQL executes; validator enforces—best of both.

Cost. If tiny model + rules hits SLA, don’t stack transformers for fashion.

Risk. Safety-critical control systems—hard no unless human-in-loop and bounded scope.

Maturity. Saying ‘no LLM’ signals senior judgment, not weakness.

255. What belongs on a staff-engineer launch checklist specifically for LLM-powered features beyond typical web apps?

Eval evidence. Offline suite + shadow/canary results attached to release ticket.

Safety. Red-team signoff tier appropriate to risk; moderation configs verified.

Observability. Tracing, cost dashboards, quality SLOs live before users—not day-two.

Data. ACL parity tests, erasure job verified, retention TTLs correct.

Rollback. One-click revert for model, prompt, and index version independently.

256. How would you architect a previously steady LLM service to survive a 10× traffic spike (viral launch or Black Friday)?

Capacity buffers. Pre-warm pools, queue-aware autoscale, multi-provider overflow lanes contracted ahead.

Shedding. Graceful degradation ladder with UX honesty; disable batch features before interactive core.

Caches. Semantic + exact caches for hottest prompts; precompute likely FAQs.

Financial. Hard spend circuit breakers so virality cannot vaporize runway overnight.

Practice. Game-day load tests at 2× expected peak quarterly.

257. How would you migrate a single-tenant LLM prototype into a secure multi-tenant SaaS without rewriting from scratch?

Isolation audit. List every place tenant id assumed; add gateway enforcement + tests.

Data migration. Namespace indices and buckets; verify checksum after bulk copy.

Config. Remove hard-coded secrets; vault + per-tenant flags.

Ops. Central logging/metrics with tenant tags; on-call runbooks updated.

Phasing. Strangler pattern—move one surface at a time with parity checks.

258. How would you consolidate three competing internal LLM initiatives duplicating connectors and evals?

Inventory. Feature parity matrix; quantify maintenance burn and incident history.

Mandate. Exec-sponsored ‘one platform’ decision with sunset dates for forks.

Migrate. Offer compatibility shims so product teams move without big bang.

People. Staffing model: platform team owns core; satellite teams own domains.

Metrics. Track merged cost and defect rate—prove consolidation worked numerically.

259. What would you examine in technical due diligence when your company acquires an LLM-native startup?

Reality of IP. Data licenses, model lineage, customer contracts—verify claims are more than slide-depth.

Tech debt. Shadow prompts, unscanned dependencies, single-point founders’ laptops.

People. Key ML engineers retention package; bus factor on adapters.

Liabilities. Pending privacy complaints, unpayable API bills, unrealistic SLAs to customers.

Integration. High-level merge plan with 30/60/90 risk milestones—price the integration tax.

260. How would you integrate two organizations’ LLM stacks after a merger without duplicated spend and inconsistent policies?

Governance merge. Harmonize RAI policies to stricter superset; single model allowlist.

Infra. Consolidate gateways, tracing, and FinOps—pick winners based on capability not politics alone.

Data. Deduplicate corpora and vector indexes; respect stricter retention regime.

Roadmap. Kill overlapping SKUs with customer comms and grandfathering windows.

Culture. Joint on-call and incident reviews accelerate trust.

261. How would you present an LLM system architecture to a chief information security officer for approval?

Data flow first. One crisp diagram with trust zones; call out every external API and log sink.

Controls map. STRIDE-lite walkthrough with mitigations—no hand-waving on injection.

Evidence. Pen test summary, SBOM highlights, DR plan, access model.

Gaps. Voluntarily disclose known weaknesses and remediation timeline—builds credibility.

Q&A prep. Anticipate BYOK, residency, model poisoning, supply chain—have answers or tracked risks.

262. How would you use architecture decision records (ADRs) for LLM systems where models and prompts change weekly?

Scope. ADR for structural choices—gateway pattern, index topology—not every prompt tweak.

Linkage. Reference prompt template families and model families affected; link to eval ticket.

Deprecation. Supersede ADRs explicitly when reversing course—history shows why.

Automation. Optionally generate ADR drafts from merge descriptions touching model config paths.

Culture. Lightweight template—two pages max; enforce ‘who decides’ field.

263. How would you hand off an ML research prototype to a production platform team without dropping quality?

Contract. Clear acceptance criteria: latency, cost, safety, observability hooks—research signs off on measurables.

Docs. Training recipe, data manifest, eval notebooks reproducible in CI container.

Shadow. Parallel run until production metrics match lab within tolerance.

Ownership. Named production DRI + researcher advisor for first month.

Pitfalls. Ban ‘works on my GPU’ artifacts—enforce standard packaging.

264. What signals do senior system-design interviewers listen for when candidates discuss LLM products?

Clarity. Can you separate facts, retrieval, policy, and generation responsibilities?

Humility. Do you admit partial mitigations and unknowns versus magic answers?

Metrics. Do you define success and failure operationally?

Tradeoffs. Do you name costs—latency, money, privacy—when proposing fixes?

Human factors. Support, legal, and change management appear—not only diagrams.

265. How would you summarize the thesis of operating LLM systems as socio-technical infrastructure—not ‘just another service’?

Probabilistic core. Outcomes vary; design for variance with monitoring and rollback—not denial.

Humans in loop. Experts, moderators, and customers co-produce quality—pipelines must respect their time.

Regulation velocity. Law moves; build configurable policy and regional cells early.

Iteration. LLM systems age like software and data products—continuous eval is maintenance, not science fair.

Leadership. Staff engineers align hype, safety, and economics so teams can ship for years, not demos.

Recap — this section

Q	Takeaway
251	Job-to-metric anchor; layered system story; early risk surfacing; concrete scale mechanisms; time-boxed plan.
252	Version freeze + bisect; dimensional slicing; trace replay; layer-specific regression tests.
253	Written NFRs; explicit tradeoffs; phased depth; professional constraint pushback; crisp recap.
254	Correctness/latency/legal tests; LLM orchestrates tools not replaces truth; cost-aware modesty; safety bounds.
255	Eval attach + safety tier; prod observability first; data plane checks; independent rollback dimensions.
256	Pre-warm + overflow; degradation ladder; attack caches; spend breakers; practiced load tests.
257	Tenant-ID sweep + tests; namespaced data; secretless config; unified observability; strangler migration.
258	Duplication inventory; executive mandate + deadlines; migration shims; platform staffing model; KPI proof.
259	Data/IP/contracts; security debt; key-person risk; legal-financial liabilities; integration milestone plan.
260	Stricter-common governance; unified gateway+telemetry; data dedupe; SKU rationalization; joint ops culture.
261	Trust-zone diagram; threat-control mapping; evidence attachments; honest gap list; prepared hard questions.
262	ADR for enduring architecture; link eval tickets; supersession chain; light automation hooks; short template.
263	Measurable acceptance; reproducible artifacts; shadow parity; paired DRIs; standardized packaging.
264	Separation of concerns; honest limits; operational metrics; explicit tradeoffs; socio-technical awareness.
265	Embrace variance; human-centered ops; adaptive compliance; eval as maintenance; sustainable leadership framing.

← Section 16 · This section · Design hub · Design guide series complete