Section 2 — Scalability & performance

16. Design LLM serving for ~1 million requests per day within ~$5,000/month

There is no universal answer—the bill is (requests × tokens_per_request × price_per_token) plus fixed costs (ingress, vector DB, compute). Interviews want you to show the envelope and list knobs before you guess a vendor.

Rough scale. One million requests per day is about 12 per second on average. Real traffic peaks higher (think 3–10×), so capacity planning targets peak, not the average.

Worked example (illustrative only). Suppose the typical call uses 1k input + 500 output tokens at blended ~$2 / 1M tokens (made-up round numbers). That is 1.5k tokens × 1M requests ≈ 1.5B tokens/day → naive arithmetic ≈ $3k/day at that fictional rate—so you cannot spend recklessly on a frontier model for every request and hope to stay near $5k unless most traffic is filtered or cached.

Cost levers that actually move the needle:

Tiered models: route simple intents to a mini or open model; reserve expensive models for hard prompts.
Semantic + exact caching: identical or near-identical questions skip paid generation.
Shorter prompts and answers: compress retrieval context, cap max_tokens, stop sequences for chatty models.
Batch / off-peak jobs: non-interactive work when spot GPUs or cheaper windows exist.
Self-host selectively: steady, high-volume workloads on owned GPUs can beat list API price—if you absorb ops cost honestly.
Feature flags: disable expensive tools or multi-step agents for free-tier users.

Figure 1 — Spend flows through the request path

flowchart LR
  R[Requests] --> F[Filters cache route]
  F -->|hit| C[Cache / rules]
  F -->|miss cheap| S[Small model]
  F -->|hard| L[Large model]
  C --> U[Users]
  S --> U
  L --> U

17. Implement semantic caching to cut API cost and latency

Exact caching keys on the full prompt hash—it only helps when users type the same thing byte-for-byte, rare in chat.

Semantic caching stores past (question, answer) pairs (or intermediate states) keyed by an embedding of the user query. A new query is embedded; you search a vector index of prior queries; if cosine similarity is above a threshold, you return the stored answer (or optionally re-validate with a cheap model).

Design choices: TTL per entry; max cache size; namespace per prompt version and model (never serve a GPT-4 answer when the live stack runs a different model); PII rules (do not cache secrets); optional lightweight verifier (“does this cached answer still apply?”) for fast-changing facts.

Example. Internal IT bot: “How do I reset VPN?” appears in dozens of phrasings. Embeddings cluster closely; first answer was expensive to produce; subsequent near-duplicates return in tens of milliseconds from Redis + vector metadata.

Figure 2 — Semantic cache lookup

flowchart TB
  Q[User query] --> E[Embed query]
  E --> VS[Vector search prior queries]
  VS --> T{Similarity above threshold?}
  T -->|yes| RET[Return cached answer]
  T -->|no| LLM[Call LLM then store pair]

18. Role of Kafka, RabbitMQ, or SQS in an LLM pipeline—and when to use one

A message queue sits between “something happened” and “something will be processed”—so producers are not blocked by slow consumers.

What queues buy you in LLM systems: absorbing bursts (marketing email triggers millions of summarizations); retries with backoff without losing work; fan-out to multiple workers; dead-letter queues for poison prompts; clearer backpressure than an unbounded in-memory list.

When you might skip them: strict sub-second synchronous chat with no buffering—though you may still queue side effects (analytics, indexing) asynchronously.

Pick primitives by need: SQS for simple AWS-native workloads; RabbitMQ for classic queues and routing; Kafka when you need a durable replayable log, high throughput, and stream processing downstream.

Figure 3 — Producers and workers decoupled

flowchart LR
  P1[API] --> Q[[Queue]]
  P2[Scheduler] --> Q
  Q --> W1[Worker]
  Q --> W2[Worker]
  W1 --> GW[LLM gateway]
  W2 --> GW

19. Async LLM pipeline for long-running tasks (reports, deep document analysis)

HTTP request threads should not hold open for ten minutes while a model chews a hundred-page PDF. Pattern: accept the job, return a job id, move work to workers, expose status via poll or webhook.

Must-haves: idempotency keys so duplicate submits do not double-charge; persistent state machine (queued → running → succeeded/failed); checkpointing for multi-step pipelines; partial results if useful; visibility timeout so stuck jobs return to the queue.

Example. “Generate quarterly compliance report” uploads files → job enqueued → worker chunks, embeds, summarizes sections, merges → user gets email + link when done; UI shows progress (“3/12 sections”).

Figure 4 — Async job lifecycle

stateDiagram-v2
  [*] --> Queued
  Queued --> Running: worker claims
  Running --> Succeeded: done
  Running --> Failed: error
  Failed --> Queued: retry policy
  Succeeded --> [*]

20. Request batching to optimize LLM inference throughput

Batching means processing multiple queries together on the GPU so matrix math stays saturated—higher **throughput**, often worse **per-query latency** for small batches.

Vendor batch APIs: some cloud APIs accept arrays of prompts in one HTTP call—good for **offline** scoring or non-interactive enrichment.

Self-hosted servers (vLLM, TGI, etc.): often implement continuous batching—new requests join an in-flight batch dynamically instead of waiting for a fixed batch to fill.

Interview nuance: interactive chat usually avoids giant static batches; you tune **max wait ms** vs throughput. For **offline** jobs, batch aggressively overnight.

21. Scale self-hosted LLaMA / Mistral on Kubernetes for concurrent requests

Split concerns: a thin **API Deployment** (CPU) handles auth and HTTP; a **model server Deployment** (GPU) runs the heavyweight inference; they talk over cluster DNS or loopback sidecar patterns.

GPU node pools: schedule inference pods on nodes with the right GPU type; use resource requests/limits so two jobs do not oversubscribe VRAM and crash.

Scaling signals: scale replicas on **GPU utilization**, **request queue depth**, or **custom metrics** (pending prompts)—not only CPU.

Model weights: bake into the image (heavy), mount from object storage + init container, or use node-local cache—trade image pull time vs startup complexity.

Concurrency per replica: governed by batching settings, max sequences, and KV cache memory—document a **capacity model** (rough max QPS per A100 class card for your chosen model size).

Figure 5 — API tier vs GPU inference tier

flowchart TB
  IN[Ingress] --> API[API pods CPU]
  API --> INF[Inference pods GPU]
  INF --> M["Model weights + vLLM/TGI"]

22. Token budget manager to prevent runaway cost in a multi-user product

Runaway cost usually comes from unbounded context, agent loops, or a single tenant launching a script that hammers your gateway.

Layers of defense: **preflight estimate** (tiktoken-like) on prompts plus tool returns; **hard caps** per request, per session, per user/day, per tenant/month; **circuit trip** when spend velocity spikes; **admin alerts**; **graceful errors** (“daily limit reached”) instead of silent truncation surprises where safety depends on full context.

Implementation sketch: Redis or DynamoDB counters with atomic increment; authoritative **billing reconciliation** from provider usage logs nightly to fix estimation drift.

Figure 6 — Budget check on the hot path

flowchart LR
  REQ[Incoming request] --> EST[Estimate tokens]
  EST --> CHK{Under budget?}
  CHK -->|no| REJ[Reject or degrade]
  CHK -->|yes| LLM[Forward to LLM]
  LLM --> INC[Increment usage meters]

23. Rate limiter for an LLM API you expose to external customers

Rate limits protect **your** providers, **your** GPUs, and **fairness** between tenants.

Algorithms: **token bucket** (smooth bursts with a refill rate) and **sliding window** (hard cap in a moving minute) are common. Publish limits in docs (requests/min and **tokens/min** separately if possible).

Distributed counters: Redis or a dedicated edge rate-limit service so all gateways share state.

Tiered products: higher limits for paid plans via API key metadata. Return 429 with Retry-After and structured error body.

24. Efficient LLM streaming across load balancer and WebSocket

HTTP SSE: ensure reverse proxies disable response buffering—e.g. Nginx proxy_buffering off, reasonable read timeouts, HTTP/1.1 chunked transfer support.

WebSockets: connections are often **stateful**; you may need **session affinity** (sticky) to the same pod—or a broker pattern where any pod can publish to the client’s channel via Redis pub/sub.

Heartbeat frames: periodic pings keep intermediaries from closing “idle” streams during long generation pauses.

Backpressure: if the user reads slowly, bound buffers so memory does not grow without limit; consider pausing upstream read if the framework allows.

25. Job queue for batch inference (e.g. nightly summarization of 50,000 documents)

Partition work into **shards** (e.g. 500 batches of 100 docs) so failures are localized. Each task is **idempotent**: re-running after crash should not duplicate side effects (use dedupe keys in the DB).

Merge strategy: map phase produces per-doc summaries; reduce phase may need a **hierarchical summarize** (“summarize 500 summaries”) to respect context limits—schedule multiple waves.

Operational extras: DLQ for toxic inputs; **metrics** on backlog depth; **dynamic worker count** tied to queue age; cost dashboard in tokens × price.

Figure 7 — Map-reduce style batch LLM

flowchart TB
  D[50k documents] --> Q[Queue of batches]
  Q --> W1[Workers map summarize]
  Q --> W2[Workers map summarize]
  W1 --> ACC[Accumulate chunk results]
  W2 --> ACC
  ACC --> R[Reduce merge waves]

26. Reduce inference latency—speculative decoding and caching strategies

Speculative decoding: a smaller **draft** model proposes several tokens quickly; the large **target** model verifies them in parallel. When the draft aligns with what the target would have produced, you skip forward—**faster time-to-complete** at the cost of running two models or specialized kernels.

Prefix / KV cache reuse: if many requests share a long system prompt or RAG context, reuse **cached key/value tensors** (implementation varies: some vendors expose “prompt caching”; self-hosted stacks support prefix caching).

Semantic cache (Q17) reduces latency for repeated informational queries entirely.

27. Benchmark and load-test an LLM application before production

Load testing LLM apps differs from CRUD APIs: latency depends on **output length**, **model**, **queueing**, and **tool calls**.

Define profiles: p50/p95 time-to-first-token, tokens/s, error rate, timeout rate, and **cost per synthetic user session**.

Tools: k6 or Locust with custom scripts that parse streamed responses; inject **concurrency ramps**; include **soak tests** (hours at moderate load) to find memory leaks in gateways.

Chaos: simulate provider 429/5xx and measure fallback paths. **Record** prompt templates and model versions with each test run for reproducibility.

28. Dynamic model routing—simple queries to cheap models, hard ones to frontier models

Routing saves money and can improve latency for easy traffic.

Signals: **classifier** model or lightweight heuristic (length, intent label from prior turn, presence of code blocks, user tier); optional **semantic complexity** score from embeddings.

Safety rails: escalate automatically when confidence is low or when the user challenges (“appeal to GPT-4”); **shadow-run** the expensive model on a sample to measure quality drift.

Figure 8 — Router in front of model fleet

flowchart TB
  Q[Query] --> R{Router}
  R -->|easy| M1[Small / mini model]
  R -->|hard| M2[Frontier model]
  R -->|uncertain| M2

29. Auto-scaling an LLM inference cluster on AWS EKS or GCP GKE

CPU-centric HPA defaults are insufficient. Common pattern: **KEDA** or custom metrics adapters scaling on **queue length**, **requests waiting**, or **GPU utilization** exported from Prometheus.

Cluster autoscaler: add GPU **node pool** capacity when pending pods cannot schedule. Watch **bootstrap time**—cold nodes plus image pulls lengthen scale-out.

Scale-to-zero for GPUs sounds cheap but clashes with cold starts (Q30); many teams keep **minimum replicas > 0** during business hours.

Figure 9 — Metrics-driven scale loop

flowchart LR
  PROM[Metrics: queue GPU] --> HPA[HPA / KEDA]
  HPA --> REP[Replica count]
  REP --> CA[Cluster autoscaler nodes]

30. Cold start latency for self-hosted LLMs on Kubernetes

Cold start includes: node provisioning, pulling a multi-GB container image, loading weights into GPU memory, compiling kernels the first time, and warming HTTP health.

Mitigations: **minimum replicas** during peak; **pre-warmed node pools**; **smaller images** or lazy layer pulls; **PVC** with cached weights on nodes; **readiness** gates that only pass after a dummy inference; **predictive scaling** before known events (keynotes, Monday morning).

For bursty demos, **always-on warm pool** often beats aggressive scale-to-zero economics once you count user abandonment during the wait.

Recap — Section 2

Q	Takeaway
16	Envelope cost (tokens × price × volume); peak > average; tier, cache, cap, self-host where steady.
17	Semantic cache via embeddings + similarity threshold; version by model/prompt.
18	Queues decouple bursts, enable retries/DLQ; pick Kafka vs SQS by replay and ops model.
19	Async jobs: id, workers, status, idempotency, checkpoints.
20	Batching raises throughput; continuous batching trades wait ms for GPU efficiency.
21	K8s: split API vs GPU inference; node pools; scale on GPU/queue signals.
22	Budget manager: estimate + distributed counters + hard caps + alerts.
23	Token + request limits; token bucket/sliding window; distributed state.
24	Disable buffering; WebSocket stickiness or broker; heartbeats.
25	Batch map-reduce with idempotent shards and merge waves.
26	Spec decoding + KV/prefix caching + semantic cache.
27	Profile TTFT/tokens/s; soak + chaos; versioned reproducibility.
28	Router with escalation and optional shadow eval.
29	KEDA/custom metrics; cluster autoscaler; beware GPU cold nodes.
30	Min replicas, warm pools, image/weight caching, readiness after warm infer.