Fifteen staff-depth scenarios on customizing weights responsibly: when RAG vs prompts vs fine-tune; multi-LoRA serving and VRAM realities; full FT vs PEFT tradeoffs; governed training corpora; promotion evals; forgetting controls; preference-tuning systems; immutable registries; signal-driven retraining; distillation economics; fault-tolerant training; tool-use SFT; fast rollback; per-tenant adapter ops; and synthetic augmentation without model collapse.
Interview stance. Fine-tuning questions are ML systems + governance, not “we ran LoRA in a notebook.” Panels want data provenance, eval gates, rollback, and serving economics tied together—especially when adapters multiply.
Defend why weights beat RAG or prompts for the scenario—and admit when they do not.
Adapters need the same versioning discipline as databases: base revision + adapter id or you WILL mismatch in prod.
Preference tuning without rubrics and variance control produces attractive demos and ugly policies.
Treat training data as a regulated asset: licensing, erasure, and opt-out are part of the design.
161. How would you decide between prompt engineering, RAG, and fine-tuning for a domain-specific LLM product?
Decision lens. Facts that change weekly belong in retrieval; stable style/format/tone/task patterns often amortize into weights; one-off behavior is cheapest in prompts until repetition proves otherwise.
RAG first for volatile truth. If correctness is “what’s in the contract today,” external knowledge beats stale weights—pair with citation discipline.
Fine-tune when. You have thousands of verified (input → output) pairs that encode procedure or idiom poorly captured by ICL—billing codes, internal DSLs, brand voice with guardrails—or you need cheaper inference via distillation.
Risk & velocity. Fine-tuning creates a new artifact to secure, eval, roll back, and explain to legal; factor pipeline time and GPU cost.
Hybrid default. Many prod systems use small adapter + strong RAG + tight prompts; interviewers reward naming failure modes of each fork.
162. How would you architect inference for many LoRA / QLoRA adapters on top of one base model in production?
Serving patterns. Option A: single stack dynamically loads hot adapters (PagedAttention-friendly weight swapping); Option B: route tenant → replica pool with pinned adapters; Option C: merge adapters offline for stable SKUs when count is small.
VRAM math. Rank, target modules, batching, and concurrent adapters dominate feasibility—prototype before promising 10k hot adapters on one GPU.
Latency. Adapter swap adds milliseconds to seconds; cap concurrent cold adapters with queueing and tenant warmup where revenue justifies.
Versioning. Pair base_model_revision + adapter_id in every trace; forbid mismatched combos in CI.
Safety. Adapters are code-like artifacts—scan for data leakage in training sets, sign weights, and gate promotion behind eval gates.
Adapter serving
flowchart LR
R[Request] --> RT[Route tenant]
RT --> A[Load adapter]
A --> B[Base weights]
B --> INF[Inference]
163. How would you compare full fine-tuning versus parameter-efficient fine-tuning for enterprise LLM customization?
Capacity vs cost. Full FT can reshape deeper behavior but demands multi-GPU clusters, careful regularization, and brittle ops; PEFT trades some ceiling for faster iteration and smaller artifacts.
Data scale. Tiny datasets often overfit with full FT; PEFT + strong regularizers + early stopping may generalize better per dollar.
Merge & update. Shipping weekly PEFT patches is realistic; full-model rebases are quarterly events requiring hardening.
IP & portability. Customers sometimes want weights they can air-gap—PEFT bundles are easier to move than 70B full finetunes unless you ship quantized merges.
Interview candor. Mention catastrophic forgetting and multi-task tradeoffs explicitly for full FT.
164. How would you govern training data for fine-tuning (licensing, PII, toxicity, consent)?
Provenance ledger. Every row traces to source system, license class, retention policy, and purpose-of-use—no anonymous CSVs in prod training.
PII & secrets. Scan, redact, or synthesize; block exports from regulated tables without DPIA sign-off.
Quality tiers. Gold human-labeled > curated auto > scraped web—weight or filter accordingly; document known biases.
Vendor policy. Using customer data to improve shared models needs contract language separate from inference-only DPIA.
165. How would you evaluate a fine-tuned or aligned model before promoting it to production?
Layered eval. Intrinsic task suite (format adherence, tool JSON), domain goldens, safety red-team set, regressions on general capability (don’t ship a billing bot that forgets math).
Shadow & A/B. Sticky cohorts with automatic rollback on SLO breaches—same rigor as prompt changes but heavier artifacts.
Human spot checks. Experts review edge distributions models overfit to (rare policy corners).
Checkpoints. Compare several training steps; sometimes earlier checkpoint generalizes better (late-stage brittle fitting).
Signoff. Model card diff published to stakeholders: data, compute, known limitations.
166. How would you mitigate catastrophic forgetting when fine-tuning on a narrow domain?
Mixing ratios. Blend general instruction data with domain batches so the model retains refusal and multilingual basics—tune ratio empirically.
Regularization. KL to reference model, LoRA rank caps, early stopping—each dampens drift differently.
Continual learning posture. Versioned adapters per domain instead of one monolith model when domains compete.
Monitoring. Canary prompts for baseline capabilities after each training job; alert on refusal rate collapses or toxicity spikes.
Fallback. Keep base model route alive behind flag if adapter degenerates.
167. How would you design a human preference tuning (RLHF/DPO/IPO) pipeline for a product LLM—at a systems level?
Data loop. Collect pairwise preferences with adjudication, rater guidelines, and variance tracking—garbage preferences produce steerable but harmful policies.
Compute graph. DPO avoids separate reward model training but still needs reference model snapshots and stable distributed trainers; RLHF needs reward + KL control plumbing.
Safety rails. Preference data must exclude jailbreak wins; filter attempts to game the reward.
Stability. Learning rate, beta, and dataset size interact; snapshot checkpoints and auto-abort on eval cliffs.
Product truth. Align to company values reflected in rubrics—not only majority rater taste.
168. How would you implement a model registry and release process for internally fine-tuned weights?
Artifacts. Immutable blobs with SHA256, training config hash, dataset id, base model id, eval report attachments—reproducibility is audit currency.
Stages. Development → staging → production with promotion gates; only registry pointers allowed in prod configs.
Access control. Encrypt at rest; scoped IAM; no anonymous HuggingFace pushes from laptops.
Deprecation. Sunset old checkpoints with automated client notifications and compatibility windows.
Interop. Standard export formats (safetensors, GGUF if needed) documented per serving stack.
169. How would you decide when to retrain or refresh adapters based on production signals?
Triggers. Sustained quality drift, new taxonomy launches, major doc corpus migrations, or regulatory template changes—not arbitrary quarterly retrain theater.
Data freshness meters. Track label age, class imbalance shifts, and user correction rates feeding back to training backlog.
Effort sizing. Estimate GPU-hours + labeling cost vs projected error reduction; sometimes prompt+RAG fix suffices.
Automated prep. ETL from feedback warehouse to training manifests with provenance joins.
Risk. Freeze training during incident weeks; coordinate with legal on new data ingestion.
170. How would you design a knowledge distillation pipeline from a large teacher model to a smaller deployable student?
Objectives. Soft labels (logits/KL), intermediate matching, or task-only CE—choose based on latency budget and risk of copying teacher mistakes.
Data generation. Synthetic prompts diversified by template; filter nonsense with validators and dedupe.
Evaluation. Student must be judged on its deployment task, not mimicry scores alone.
Serving payoff. Quantify p99 latency and $/1k tokens saved; distillation that saves pennies may not justify maintenance.
IP. Teacher terms may restrict mimicking—legal review before large-scale distillation from vendor APIs.
171. How would you handle training instability, silent divergence, or GPU failures during long fine-tuning jobs?
Checkpointing. Frequent async checkpoints to durable object storage; resume from last good step automatically.
Metrics. Log train/val loss, grad norms, token accuracy; alert on NaNs or sudden loss spikes.
Hardware. Preemption-aware trainers (spot instances) with elastic rebound; deterministic seeds for debugging.
Validation gates. Short nightly eval during multi-day runs to catch junk early.
Postmortem. Tag failed runs with hyperparameters for searchable registry—not ‘lost weekend.’
172. How would you fine-tune or post-train a model to improve tool calling and structured JSON reliability?
Dataset design. Curate trajectories with diverse tool schemas, partial failures, and user corrections; include negative examples where model should refuse tools.
Formats. Match the exact chat template + tool grammar your gateway emits—one token drift breaks prod.