Interview ready · Design · Section 11

Fine-tuning & model lifecycle

Fifteen staff-depth scenarios on customizing weights responsibly: when RAG vs prompts vs fine-tune; multi-LoRA serving and VRAM realities; full FT vs PEFT tradeoffs; governed training corpora; promotion evals; forgetting controls; preference-tuning systems; immutable registries; signal-driven retraining; distillation economics; fault-tolerant training; tool-use SFT; fast rollback; per-tenant adapter ops; and synthetic augmentation without model collapse.

Interview stance. Fine-tuning questions are ML systems + governance, not “we ran LoRA in a notebook.” Panels want data provenance, eval gates, rollback, and serving economics tied together—especially when adapters multiply.

Defend why weights beat RAG or prompts for the scenario—and admit when they do not.
Adapters need the same versioning discipline as databases: base revision + adapter id or you WILL mismatch in prod.
Preference tuning without rubrics and variance control produces attractive demos and ugly policies.
Treat training data as a regulated asset: licensing, erasure, and opt-out are part of the design.

161. How would you decide between prompt engineering, RAG, and fine-tuning for a domain-specific LLM product?

Decision lens. Facts that change weekly belong in retrieval; stable style/format/tone/task patterns often amortize into weights; one-off behavior is cheapest in prompts until repetition proves otherwise.

RAG first for volatile truth. If correctness is “what’s in the contract today,” external knowledge beats stale weights—pair with citation discipline.

Fine-tune when. You have thousands of verified (input → output) pairs that encode procedure or idiom poorly captured by ICL—billing codes, internal DSLs, brand voice with guardrails—or you need cheaper inference via distillation.

Risk & velocity. Fine-tuning creates a new artifact to secure, eval, roll back, and explain to legal; factor pipeline time and GPU cost.

Hybrid default. Many prod systems use small adapter + strong RAG + tight prompts; interviewers reward naming failure modes of each fork.

162. How would you architect inference for many LoRA / QLoRA adapters on top of one base model in production?

Serving patterns. Option A: single stack dynamically loads hot adapters (PagedAttention-friendly weight swapping); Option B: route tenant → replica pool with pinned adapters; Option C: merge adapters offline for stable SKUs when count is small.

VRAM math. Rank, target modules, batching, and concurrent adapters dominate feasibility—prototype before promising 10k hot adapters on one GPU.

Latency. Adapter swap adds milliseconds to seconds; cap concurrent cold adapters with queueing and tenant warmup where revenue justifies.

Versioning. Pair base_model_revision + adapter_id in every trace; forbid mismatched combos in CI.

Safety. Adapters are code-like artifacts—scan for data leakage in training sets, sign weights, and gate promotion behind eval gates.

Adapter serving

flowchart LR
  R[Request] --> RT[Route tenant]
  RT --> A[Load adapter]
  A --> B[Base weights]
  B --> INF[Inference]

163. How would you compare full fine-tuning versus parameter-efficient fine-tuning for enterprise LLM customization?

Capacity vs cost. Full FT can reshape deeper behavior but demands multi-GPU clusters, careful regularization, and brittle ops; PEFT trades some ceiling for faster iteration and smaller artifacts.

Data scale. Tiny datasets often overfit with full FT; PEFT + strong regularizers + early stopping may generalize better per dollar.

Merge & update. Shipping weekly PEFT patches is realistic; full-model rebases are quarterly events requiring hardening.

IP & portability. Customers sometimes want weights they can air-gap—PEFT bundles are easier to move than 70B full finetunes unless you ship quantized merges.

Interview candor. Mention catastrophic forgetting and multi-task tradeoffs explicitly for full FT.

164. How would you govern training data for fine-tuning (licensing, PII, toxicity, consent)?

Provenance ledger. Every row traces to source system, license class, retention policy, and purpose-of-use—no anonymous CSVs in prod training.

PII & secrets. Scan, redact, or synthesize; block exports from regulated tables without DPIA sign-off.

Quality tiers. Gold human-labeled > curated auto > scraped web—weight or filter accordingly; document known biases.

Opt-outs. Honor erasure: remove examples referencing deleted customers and invalidate derived checkpoints.

Vendor policy. Using customer data to improve shared models needs contract language separate from inference-only DPIA.

165. How would you evaluate a fine-tuned or aligned model before promoting it to production?

Layered eval. Intrinsic task suite (format adherence, tool JSON), domain goldens, safety red-team set, regressions on general capability (don’t ship a billing bot that forgets math).

Shadow & A/B. Sticky cohorts with automatic rollback on SLO breaches—same rigor as prompt changes but heavier artifacts.

Human spot checks. Experts review edge distributions models overfit to (rare policy corners).

Checkpoints. Compare several training steps; sometimes earlier checkpoint generalizes better (late-stage brittle fitting).

Signoff. Model card diff published to stakeholders: data, compute, known limitations.

166. How would you mitigate catastrophic forgetting when fine-tuning on a narrow domain?

Mixing ratios. Blend general instruction data with domain batches so the model retains refusal and multilingual basics—tune ratio empirically.

Regularization. KL to reference model, LoRA rank caps, early stopping—each dampens drift differently.

Continual learning posture. Versioned adapters per domain instead of one monolith model when domains compete.

Monitoring. Canary prompts for baseline capabilities after each training job; alert on refusal rate collapses or toxicity spikes.

Fallback. Keep base model route alive behind flag if adapter degenerates.

167. How would you design a human preference tuning (RLHF/DPO/IPO) pipeline for a product LLM—at a systems level?

Data loop. Collect pairwise preferences with adjudication, rater guidelines, and variance tracking—garbage preferences produce steerable but harmful policies.

Compute graph. DPO avoids separate reward model training but still needs reference model snapshots and stable distributed trainers; RLHF needs reward + KL control plumbing.

Safety rails. Preference data must exclude jailbreak wins; filter attempts to game the reward.

Stability. Learning rate, beta, and dataset size interact; snapshot checkpoints and auto-abort on eval cliffs.

Product truth. Align to company values reflected in rubrics—not only majority rater taste.

168. How would you implement a model registry and release process for internally fine-tuned weights?

Artifacts. Immutable blobs with SHA256, training config hash, dataset id, base model id, eval report attachments—reproducibility is audit currency.

Stages. Development → staging → production with promotion gates; only registry pointers allowed in prod configs.

Access control. Encrypt at rest; scoped IAM; no anonymous HuggingFace pushes from laptops.

Deprecation. Sunset old checkpoints with automated client notifications and compatibility windows.

Interop. Standard export formats (safetensors, GGUF if needed) documented per serving stack.

169. How would you decide when to retrain or refresh adapters based on production signals?

Triggers. Sustained quality drift, new taxonomy launches, major doc corpus migrations, or regulatory template changes—not arbitrary quarterly retrain theater.

Data freshness meters. Track label age, class imbalance shifts, and user correction rates feeding back to training backlog.

Effort sizing. Estimate GPU-hours + labeling cost vs projected error reduction; sometimes prompt+RAG fix suffices.

Automated prep. ETL from feedback warehouse to training manifests with provenance joins.

Risk. Freeze training during incident weeks; coordinate with legal on new data ingestion.

170. How would you design a knowledge distillation pipeline from a large teacher model to a smaller deployable student?

Objectives. Soft labels (logits/KL), intermediate matching, or task-only CE—choose based on latency budget and risk of copying teacher mistakes.

Data generation. Synthetic prompts diversified by template; filter nonsense with validators and dedupe.

Evaluation. Student must be judged on its deployment task, not mimicry scores alone.

Serving payoff. Quantify p99 latency and $/1k tokens saved; distillation that saves pennies may not justify maintenance.

IP. Teacher terms may restrict mimicking—legal review before large-scale distillation from vendor APIs.

171. How would you handle training instability, silent divergence, or GPU failures during long fine-tuning jobs?

Checkpointing. Frequent async checkpoints to durable object storage; resume from last good step automatically.

Metrics. Log train/val loss, grad norms, token accuracy; alert on NaNs or sudden loss spikes.

Hardware. Preemption-aware trainers (spot instances) with elastic rebound; deterministic seeds for debugging.

Validation gates. Short nightly eval during multi-day runs to catch junk early.

Postmortem. Tag failed runs with hyperparameters for searchable registry—not ‘lost weekend.’

172. How would you fine-tune or post-train a model to improve tool calling and structured JSON reliability?

Dataset design. Curate trajectories with diverse tool schemas, partial failures, and user corrections; include negative examples where model should refuse tools.

Formats. Match the exact chat template + tool grammar your gateway emits—one token drift breaks prod.

Eval. Grammar-valid rate, wrong-tool rate, argument hallucination rate on held-out APIs.

Combination. Constrained decoding at serve time still helps; SFT reduces need for heavy repair loops.

Safety. Ensure SFT never rewards exfiltration patterns disguised as tool use.

173. How would you roll back or hot-swap a bad fine-tuned model without prolonged customer impact?

Routing. Feature flag maps tenant → model revision; revert is O(1) config change with cached DNS avoided.

Traffic bake. Canary new weights; auto-promote or rollback on multilingual safety probes.

Artifacts ready. Keep prior checkpoint hot-loaded in serving fleet for instant swap.

Comms. Status page entry if external customers saw quality dip; support macros for apology/credit policy.

Learning. Block promotion pipeline until root-caused—dataset skew vs optimizer vs base mismatch.

174. How would you serve per-tenant or per-customer fine-tuned models on a shared platform without operational chaos?

Artifact isolation. Namespaced registry paths; cryptographic separation of customer weights; no shared mutable volumes.

Scheduling. Route tables: tenant id → allowed adapter set; deny cross-tenant fallthrough at compile time in gateway.

Capacity. Pin top-N tenants to GPU pools; long tail uses dynamic adapter load with LRU eviction and SLAs reflecting cold latency.

Billing. Pass through training GPU-minutes and extra inference RAM charges—product must fund ops.

Lifecycle. Auto-archive adapters for churned customers per contract retention.

175. How would you use synthetic data generation to augment fine-tuning—and what guardrails prevent model collapse?

Diversity controls. Stratify prompts by intent, length, language; reject near-duplicates via embedding clustering.

Quality filters. Human spot checks + automatic consistency checks (execute tools, verify arithmetic).

Collapse risk. Avoid recursively training on model outputs without fresh human or external seed data—track data lineage depth.

Blending. Keep a floor fraction of verified human data in every mix.

Transparency. Document synthetic percentage in model cards for downstream trust.

Recap — this section

Q	Takeaway
161	Volatile facts → RAG; reusable behavior → FT/distill; cost/velocity/risk trade sheet; combined is normal.
162	Pinned vs dynamic adapters; VRAM + swap latency budgets; base+adapter versioning; treat adapters as release artifacts.
163	Full FT for max capacity at cost; PEFT for iteration/portability; data-size realism; forgeting risk callout.
164	Licensed provenance; PII pipeline; quality taxonomy; erasure-aware datasets; contractual clarity.
165	Multi-layer eval + general regressions; gated rollout; checkpoint sweep; model card diff.
166	Replay/general mixing; KL & early stop; modular adapters; post-train capability probes; base-model fallback.
167	High-quality preference data; infra for ref model + trainers; safety filtering; checkpoint stability; rubric alignment.
168	Immutable artifact metadata; gated stages; IAM + encryption; deprecation policy; export matrix.
169	Signal-driven retrains; cost-benefit gate; feedback→dataset automation; coordinated freezes.
170	Multi-term distillation loss; curated synthetic data; task-first eval; economics + vendor terms.
171	Resumable checkpoints; anomaly metrics; preemption-ready training; mid-run eval gates; run metadata hygiene.
172	Schema-faithful trajectories; template parity; tool-specific eval; pair with constrained decode; abuse-aware data.
173	Flag-based routing; canary with auto-revert; warm previous artifact; support playbook; RCA gate.
174	Namespaced artifacts; deterministic routing; tiered warmth; cost pass-through; retention-aware archival.
175	Diversity + dedupe; verifier filters; lineage depth caps; human-data floor; disclosed ratios.

← Section 10 · This section · Design hub · Section 12 →