Fifteen questions on keeping LLM products fast enough, cheap enough, and stable under load—with back-of-the-envelope math, caching, queues, batching, Kubernetes, and honest limits of each technique.
16. Design LLM serving for ~1 million requests per day within ~$5,000/month
There is no universal answer—the bill is (requests × tokens_per_request × price_per_token) plus fixed costs (ingress, vector DB, compute). Interviews want you to show the envelope and list knobs before you guess a vendor.
Rough scale. One million requests per day is about 12 per second on average. Real traffic peaks higher (think 3–10×), so capacity planning targets peak, not the average.
Worked example (illustrative only). Suppose the typical call uses 1k input + 500 output tokens at blended ~$2 / 1M tokens (made-up round numbers). That is 1.5k tokens × 1M requests ≈ 1.5B tokens/day → naive arithmetic ≈ $3k/day at that fictional rate—so you cannot spend recklessly on a frontier model for every request and hope to stay near $5k unless most traffic is filtered or cached.
Cost levers that actually move the needle:
Tiered models: route simple intents to a mini or open model; reserve expensive models for hard prompts.
Shorter prompts and answers: compress retrieval context, cap max_tokens, stop sequences for chatty models.
Batch / off-peak jobs: non-interactive work when spot GPUs or cheaper windows exist.
Self-host selectively: steady, high-volume workloads on owned GPUs can beat list API price—if you absorb ops cost honestly.
Feature flags: disable expensive tools or multi-step agents for free-tier users.
Figure 1 — Spend flows through the request path
flowchart LR
R[Requests] --> F[Filters cache route]
F -->|hit| C[Cache / rules]
F -->|miss cheap| S[Small model]
F -->|hard| L[Large model]
C --> U[Users]
S --> U
L --> U
17. Implement semantic caching to cut API cost and latency
Exact caching keys on the full prompt hash—it only helps when users type the same thing byte-for-byte, rare in chat.
Semantic caching stores past (question, answer) pairs (or intermediate states) keyed by an embedding of the user query. A new query is embedded; you search a vector index of prior queries; if cosine similarity is above a threshold, you return the stored answer (or optionally re-validate with a cheap model).
Design choices: TTL per entry; max cache size; namespace per prompt version and model (never serve a GPT-4 answer when the live stack runs a different model); PII rules (do not cache secrets); optional lightweight verifier (“does this cached answer still apply?”) for fast-changing facts.
Example. Internal IT bot: “How do I reset VPN?” appears in dozens of phrasings. Embeddings cluster closely; first answer was expensive to produce; subsequent near-duplicates return in tens of milliseconds from Redis + vector metadata.
Figure 2 — Semantic cache lookup
flowchart TB
Q[User query] --> E[Embed query]
E --> VS[Vector search prior queries]
VS --> T{Similarity above threshold?}
T -->|yes| RET[Return cached answer]
T -->|no| LLM[Call LLM then store pair]
18. Role of Kafka, RabbitMQ, or SQS in an LLM pipeline—and when to use one
A message queue sits between “something happened” and “something will be processed”—so producers are not blocked by slow consumers.
What queues buy you in LLM systems: absorbing bursts (marketing email triggers millions of summarizations); retries with backoff without losing work; fan-out to multiple workers; dead-letter queues for poison prompts; clearer backpressure than an unbounded in-memory list.
When you might skip them: strict sub-second synchronous chat with no buffering—though you may still queue side effects (analytics, indexing) asynchronously.
Pick primitives by need: SQS for simple AWS-native workloads; RabbitMQ for classic queues and routing; Kafka when you need a durable replayable log, high throughput, and stream processing downstream.
19. Async LLM pipeline for long-running tasks (reports, deep document analysis)
HTTP request threads should not hold open for ten minutes while a model chews a hundred-page PDF. Pattern: accept the job, return a job id, move work to workers, expose status via poll or webhook.
Must-haves:idempotency keys so duplicate submits do not double-charge; persistent state machine (queued → running → succeeded/failed); checkpointing for multi-step pipelines; partial results if useful; visibility timeout so stuck jobs return to the queue.
Example. “Generate quarterly compliance report” uploads files → job enqueued → worker chunks, embeds, summarizes sections, merges → user gets email + link when done; UI shows progress (“3/12 sections”).
20. Request batching to optimize LLM inference throughput
Batching means processing multiple queries together on the GPU so matrix math stays saturated—higher **throughput**, often worse **per-query latency** for small batches.
Vendor batch APIs: some cloud APIs accept arrays of prompts in one HTTP call—good for **offline** scoring or non-interactive enrichment.
Self-hosted servers (vLLM, TGI, etc.): often implement continuous batching—new requests join an in-flight batch dynamically instead of waiting for a fixed batch to fill.
Interview nuance: interactive chat usually avoids giant static batches; you tune **max wait ms** vs throughput. For **offline** jobs, batch aggressively overnight.
21. Scale self-hosted LLaMA / Mistral on Kubernetes for concurrent requests
Split concerns: a thin **API Deployment** (CPU) handles auth and HTTP; a **model server Deployment** (GPU) runs the heavyweight inference; they talk over cluster DNS or loopback sidecar patterns.
GPU node pools: schedule inference pods on nodes with the right GPU type; use resource requests/limits so two jobs do not oversubscribe VRAM and crash.
Scaling signals: scale replicas on **GPU utilization**, **request queue depth**, or **custom metrics** (pending prompts)—not only CPU.
Model weights: bake into the image (heavy), mount from object storage + init container, or use node-local cache—trade image pull time vs startup complexity.
Concurrency per replica: governed by batching settings, max sequences, and KV cache memory—document a **capacity model** (rough max QPS per A100 class card for your chosen model size).
22. Token budget manager to prevent runaway cost in a multi-user product
Runaway cost usually comes from unbounded context, agent loops, or a single tenant launching a script that hammers your gateway.
Layers of defense: **preflight estimate** (tiktoken-like) on prompts plus tool returns; **hard caps** per request, per session, per user/day, per tenant/month; **circuit trip** when spend velocity spikes; **admin alerts**; **graceful errors** (“daily limit reached”) instead of silent truncation surprises where safety depends on full context.
Implementation sketch: Redis or DynamoDB counters with atomic increment; authoritative **billing reconciliation** from provider usage logs nightly to fix estimation drift.
Figure 6 — Budget check on the hot path
flowchart LR
REQ[Incoming request] --> EST[Estimate tokens]
EST --> CHK{Under budget?}
CHK -->|no| REJ[Reject or degrade]
CHK -->|yes| LLM[Forward to LLM]
LLM --> INC[Increment usage meters]
23. Rate limiter for an LLM API you expose to external customers
Rate limits protect **your** providers, **your** GPUs, and **fairness** between tenants.
Algorithms: **token bucket** (smooth bursts with a refill rate) and **sliding window** (hard cap in a moving minute) are common. Publish limits in docs (requests/min and **tokens/min** separately if possible).
Distributed counters: Redis or a dedicated edge rate-limit service so all gateways share state.
Tiered products: higher limits for paid plans via API key metadata. Return 429 with Retry-After and structured error body.
24. Efficient LLM streaming across load balancer and WebSocket
WebSockets: connections are often **stateful**; you may need **session affinity** (sticky) to the same pod—or a broker pattern where any pod can publish to the client’s channel via Redis pub/sub.
Heartbeat frames: periodic pings keep intermediaries from closing “idle” streams during long generation pauses.
Backpressure: if the user reads slowly, bound buffers so memory does not grow without limit; consider pausing upstream read if the framework allows.
25. Job queue for batch inference (e.g. nightly summarization of 50,000 documents)
Partition work into **shards** (e.g. 500 batches of 100 docs) so failures are localized. Each task is **idempotent**: re-running after crash should not duplicate side effects (use dedupe keys in the DB).
Merge strategy: map phase produces per-doc summaries; reduce phase may need a **hierarchical summarize** (“summarize 500 summaries”) to respect context limits—schedule multiple waves.
Operational extras: DLQ for toxic inputs; **metrics** on backlog depth; **dynamic worker count** tied to queue age; cost dashboard in tokens × price.
26. Reduce inference latency—speculative decoding and caching strategies
Speculative decoding: a smaller **draft** model proposes several tokens quickly; the large **target** model verifies them in parallel. When the draft aligns with what the target would have produced, you skip forward—**faster time-to-complete** at the cost of running two models or specialized kernels.
Prefix / KV cache reuse: if many requests share a long system prompt or RAG context, reuse **cached key/value tensors** (implementation varies: some vendors expose “prompt caching”; self-hosted stacks support prefix caching).
Semantic cache (Q17) reduces latency for repeated informational queries entirely.
27. Benchmark and load-test an LLM application before production
Load testing LLM apps differs from CRUD APIs: latency depends on **output length**, **model**, **queueing**, and **tool calls**.
Define profiles: p50/p95 time-to-first-token, tokens/s, error rate, timeout rate, and **cost per synthetic user session**.
Tools: k6 or Locust with custom scripts that parse streamed responses; inject **concurrency ramps**; include **soak tests** (hours at moderate load) to find memory leaks in gateways.
Chaos: simulate provider 429/5xx and measure fallback paths. **Record** prompt templates and model versions with each test run for reproducibility.
28. Dynamic model routing—simple queries to cheap models, hard ones to frontier models
Routing saves money and can improve latency for easy traffic.
Signals: **classifier** model or lightweight heuristic (length, intent label from prior turn, presence of code blocks, user tier); optional **semantic complexity** score from embeddings.
Safety rails: escalate automatically when confidence is low or when the user challenges (“appeal to GPT-4”); **shadow-run** the expensive model on a sample to measure quality drift.
Figure 8 — Router in front of model fleet
flowchart TB
Q[Query] --> R{Router}
R -->|easy| M1[Small / mini model]
R -->|hard| M2[Frontier model]
R -->|uncertain| M2
29. Auto-scaling an LLM inference cluster on AWS EKS or GCP GKE
CPU-centric HPA defaults are insufficient. Common pattern: **KEDA** or custom metrics adapters scaling on **queue length**, **requests waiting**, or **GPU utilization** exported from Prometheus.
30. Cold start latency for self-hosted LLMs on Kubernetes
Cold start includes: node provisioning, pulling a multi-GB container image, loading weights into GPU memory, compiling kernels the first time, and warming HTTP health.
Mitigations: **minimum replicas** during peak; **pre-warmed node pools**; **smaller images** or lazy layer pulls; **PVC** with cached weights on nodes; **readiness** gates that only pass after a dummy inference; **predictive scaling** before known events (keynotes, Monday morning).
For bursty demos, **always-on warm pool** often beats aggressive scale-to-zero economics once you count user abandonment during the wait.