sharpbyte.dev
← Design guide
Interview ready · Design · Section 10

Multi-tenancy & platform

Fifteen staff-depth scenarios on running LLM infra as a product: fair-share GPU scheduling and bulkheads; multi-axis quotas with centralized enforcement; hard and soft tenant data boundaries; hierarchical feature flags; warm-pool SKUs vs cold paths; CMK and search reality; isolated onboarding backfills; tenant-scoped incident containment; fair-share schedulers with aging; vector namespace strategies; usage ledgers for chargeback; safe per-tenant customization; tier-upgrade migrations; regional data-plane cells; and SLA design that matches what you actually measure.

Interview stance. Multi-tenant LLM platforms are scheduling, isolation, and economics problems wearing an API. Interviewers want fairness + blast-radius containment, not ‘we trust our filters.’ Connect product SKUs to real infra: warm pools, regional cells, and quota math.

146. How would you mitigate noisy-neighbor problems in a shared LLM inference platform serving many tenants?

Fair scheduling. Weighted fair queuing or token buckets per tenant class so one viral customer cannot exhaust KV cache or dispatch threads.

Bulkheads. Separate pools for interactive chat vs batch embedding; optional dedicated GPU shards for premium SLAs paid to cover idle capacity.

Admission control. Reject or delay when tenant exceeds burst credits with actionable errors—not 60s hangs.

Observability. Per-tenant queue wait histograms expose abuse before finance forwards the angry email.

Product policy. Publish what ‘shared’ means; upsell reserved capacity with honest math.

147. How would you design per-tenant quotas (tokens, requests per minute, concurrency) for a multi-tenant LLM API?

Dimensions. RPM, concurrent streams, daily token envelopes, and tool-call budgets—different abusers hit different walls.

Soft vs hard. Warn at 80%, throttle at 100%, offer burst credits with decay for human UX spikes.

Enforcement. Central gateway counts authoritative usage; microservices cannot maintain divergent counters.

Override path. Sales-issued temporary lifts logged to finance and security.

Testing. Chaos inject one tenant flood in staging; ensure others stay green.

148. How would you architect data isolation for a multi-tenant RAG product so tenants cannot read each other’s documents?

Hard boundaries. Namespace per tenant in object storage + vector index; cross-tenant queries impossible at API layer, not only by filter discipline.

Cryptography. Optional per-tenant keys for encrypted chunks; key rotation coordinated with imports.

Metadata hygiene. CI tests for filter injection; pen tests for IDOR on doc ids.

Operational access. Support impersonation is audited, time-bound, and customer-visible in some industries.

Shared models. Even shared embeddings can leak via timing side channels—document honest residual risk for nation-state adversaries vs commercial SaaS expectations.

149. How would you implement tenant-specific feature flags (models, tools, prompts) without configuration sprawl?

Layered config. Global defaults → segment overrides → tenant overrides → experiment arms; merges are deterministic and cached at edge.

Typing. Schema-validated flags prevent ‘stringly typed’ prod incidents where a typo disables safety.

Audit. Who flipped a flag, when, with ticket link—compliance teams ask within hours.

Rollout. Staged enablement with automatic rollback if per-tenant error budget burns.

DX. Self-service portal for safe toggles; risky changes require approval workflow.

150. How would you handle model warm-keeping vs cold-start tradeoffs in a multi-tenant SaaS LLM platform?

Tiering. Premium tenants get min-ready replicas; free tier may hit cold paths with explicit latency messaging.

Autoscaler smarts. Predictive scale from regional calendars; idle shutdown with aggressive warmup probes before traffic returns.

Shared weights. Keep base model resident; swap LoRA adapters per tenant bucket to balance memory.

Cost transparency. Finance models idle GPU; product prices warmth into SKUs.

Fallback. Route overflow to serverless vendor burst lane under capacity contract.

151. How would you support customer-managed keys (CMK) or per-tenant encryption for embeddings and logs in an LLM platform?

KMS integration. Tenant KEK in CloudHSM/KMS; data keys wrapped per batch; revocation halts new decrypt within minutes.

Search implications. True CMK may limit server-side features—set expectations, do not promise magic hybrid search on ciphertext without homework.

Rotation. Re-wrap data keys; re-index strategy for vectors if required by crypto architecture.

Compliance narrative. Document shared responsibility: you secure platform ops; customer secures key custody.

Performance. Batch crypto ops; avoid per-token round trips.

152. How would you design onboarding for a new enterprise tenant indexing millions of docs without impacting existing customers?

Isolation. Dedicated ingestion workers or low-priority queues so backfills never starve interactive latency SLOs.

Phasing. Pilot corpus first; expand after quality sign-off; communicate index-lag expectations.

Resource caps. Per-tenant embed concurrency; spill to off-peak windows.

Health dashboards. Tenant sees connector status, DLQ depth, and search smoke tests—reduces ‘black hole’ anxiety.

Posture reviews. Security checklist before wide tools access enabled.

153. How would you contain blast radius if one tenant triggers a systemic failure (bad prompt, tool loop, dataset poisoning)?

Circuit breakers per tenant. Stop abusive sessions while keeping platform up; alert SOC if attack pattern emerges.

Poison controls. Rate limit uploads; hash-block known bad files; manual quarantine for suspicious corpora.

Global kill switches. Disable specific tools or models within seconds; rehearse quarterly.

Forensics. Preserve redacted traces for postmortem without blocking user traffic indefinitely.

Communication. Status page honesty when premium tier unaffected vs widespread outage.

154. How would you design fair-share scheduling when GPU capacity is over-subscribed across thousands of tenants?

Weights. Contracted share translates to scheduler weights; short jobs are not always squeezed out by batch whales—use aging to prevent starvation.

Preemption policy. Only preempt best-effort workloads; never drop paid interactive mid-stream without policy.

Transparency. Show estimated wait in API responses when queued.

Metering. Bill or throttle when exceeding purchased share; economics drive sustainability.

Hybrid cloud. Burst agreements with hyperscaler endpoints when local metal saturates.

155. How would you implement vector index namespaces or collections per tenant at scale?

Partition strategy. Physical collections for big tenants; logical partitions with strict metadata filters for long tail—avoid one flat collection leaking via buggy filters.

Routing. Gateway maps tenant to collection id deterministically; tests assert no cross-pointer.

Migrations. Blue/green per tenant when upgrading embedding schema; shared infra still allows isolated cutovers.

Compaction & GC. Per-tenant tombstone sweeps after delete waves.

Cost. Charge for index GB and QPS; incentivize pruning stale vectors.

156. How would you attribute shared LLM infrastructure costs to individual tenants for internal billing or external invoicing?

Meters. Tokens, GPU-seconds, storage GB, egress, support touches—allocate with the same idempotency rigor as customer billing.

Allocation rules. Shared overhead spread by contracted minimums + marginal usage; document fairness when one tenant drives model upgrade.

Adjustments. Credits for outages; dispute workflow with trace-backed evidence.

Forecasting. Show tenants projected month-end based on trailing week.

Trust. Immutable usage ledger they can audit beats opaque ‘trust us’ invoices.

157. How would you allow tenant-specific prompts or tools without forking the entire codebase for each customer?

Template system. Versioned prompt slots with safe variable injection; schema-guardrails on injected JSON.

Composable tools. Register tenant tool packs from allowlisted primitives; dangerous combos blocked by policy engine.

Linting. Static analysis on templates for hallucination-prone patterns before deploy.

Supportability. Map tenant template id to owner team; sunrise old versions aggressively.

Escape hatches. True bespoke logic lives in bounded sandbox plugins, not ad-hoc prod edits.

158. How would you design the technical path for a tenant upgrading from standard to premium isolation (dedicated models, VPC peering, stricter SLA)?

Blueprints. Infrastructure-as-code modules for each tier; diff shows new subnets, peering, and KMS links.

Data migration. Replicate vectors and blobs with checksum verification; parallel run until parity tests pass.

Traffic shifting. Feature flag routes reads to new stack; instant rollback pointer preserved 48h.

Compliance checks. Re-run pen tests + DPIA when scope changes.

Runbook. Time-bounded cutover window with executive comms template.

159. How would you support per-tenant data residency (EU-only, US-only) in an LLM platform with shared components?

Regional cells. Nearly complete stacks per region—control plane may be global but data plane stays local; forbid accidental cross-region log shipping.

Routing. Tenant creation pins home region; tokens are region-sticky for life unless explicit migration project.

Model parity. Maintain capability matrix per region; some models lag—product must message gaps.

Vendor sprawl. Abstract gateways hide regional provider endpoints; tests enforce ‘no EU prompt hits US API.’

Erase & legal. Regional subprocessors documented per tenant dossier.

Regional cells

flowchart TB
  EU[EU cell] --- GL[Global control]
  US[US cell] --- GL
  T1[Tenant A EU] --> EU
  T2[Tenant B US] --> US
            

160. How would you define and sell differentiated SLAs to enterprise tenants on a shared LLM platform?

Measure honesty. Only promise what you instrument: p95 TTFT, monthly uptime excluding vendor force majeure, max ingestion lag, support response tiers.

Credits. Automatic service credits tied to measurable misses; avoids custom legal every deal.

Shared vs dedicated. Shared tiers get best-effort with high percentile targets; premium gets dedicated capacity pools with harder numbers.

Transparency. Status page with per-region components; postmortems within N days.

Exit ramps. Data portability guarantees reduce proc fear; engineering must back promises with export APIs.

Recap — this section

QTakeaway
146Fair queues + bulkheads; batch vs interactive isolation; admission control; per-tenant queue metrics.
147Multi-axis quotas; soft/hard tiers; centralized enforcement; audited overrides; isolation tests.
148Namespace isolation + mandatory filters; optional CMK; IDOR tests; audited support access.
149Hierarchical deterministic config; validation; audit trail; per-tenant error budgets; governance UX.
150SKU-driven warm pools; predictive autoscale; LoRA swap efficiency; priced idle capacity; vendor burst valve.
151KMS-wrapped DEKs; feature honesty; rotation/reindex plans; shared-responsibility story; batched crypto.
152Backfill isolation + throttles; phased crawl; tenant-visible health; security gates before tools.
153Tenant-scoped breakers; upload/tool poison defenses; rehearsed global kills; crisp comms.
154Weighted fair queue + aging; preemption rules; honest ETA; purchased share enforcement; hybrid burst.
155Physical vs logical tenancy; deterministic routing; per-tenant migrations; GC discipline; storage pricing.
156Granular meters + allocation policy; outage credits; forecasts; tamper-evident usage ledger.
157Versioned templates + lint; composable tool registry; policy engine; deprecation hygiene; sandbox extensions.
158IaC tier blueprints; verified data replication; flag-gated traffic shift; compliance replay; runbooked cutover.
159Regional data-plane cells; sticky tenancy; parity matrix; gateway enforcement; legal pack per region.
160Instrument-before-promise; automated credits; tier-matched architecture; public status; export commitments.

← Section 9 · This section · Design hub · Section 11 →