sharpbyte.dev
← Design guide
Interview ready · Design · Section 1

Core LLM application architecture

Fifteen scenario questions on how you structure a production-grade LLM product in an enterprise: not “call OpenAI from a script,” but teams, trust boundaries, cost, compliance, and how every layer earns its place.

1. Design an end-to-end LLM application for an enterprise: what are the core layers?

An enterprise product is a chain of trust decisions: who is the user, what may they see, how much can we spend, what do auditors need, and how do we recover when a model vendor blips? You carve that into layers with clear jobs so one team does not “accidentally” smuggle secrets, PII, or the wrong model into production.

Experience is the surface users touch: web, mobile, sometimes voice. It streams partial answers, shows citations, handles errors gracefully, and never stores provider API keys in the browser.

API / BFF (backend-for-frontend) is the first trusted server: authentication, tenant or company context, input validation, and rate limits. It shapes “what the user asked” into a structured request the rest of the stack understands.

Orchestration is the workflow brain: should we RAG? Which index? Call a calculator or ticket tool? Block risky flows until a human approves? Product policy lives here—not as megabyte string literals scattered in HTTP handlers.

The LLM gateway is the controlled front door to every model vendor: keys, retries, routing, quotas, redacted logging, and failover. Every internal service should speak to the gateway—not directly to five different cloud APIs with five different retry implementations.

Knowledge & tools is where facts and actions live: vector search, relational data, CRM, and tool executors with per-user permissions. The model does not magically know your company; this layer grounds answers and constrains side effects.

The data platform ingests documents, chunks, embeds, and re-indexes when sources change. Without it, RAG goes stale and answers silently diverge from reality.

Observability & governance covers traces, cost by tenant, audit logs where required, and versioning of prompts and models so you can answer: “Exactly which configuration produced this answer on Tuesday at 2pm?”

Worked example (HR copilot). An employee in Germany asks about parental leave. The BFF attaches identity and region. Orchestration selects HR policy RAG in an EU-approved region. Retrieval applies access control so contractors never see full-time-only policies. The gateway calls the approved endpoint. The answer cites policy paragraphs instead of sounding confident but vague.

Figure 1 — Reference architecture (logical layers)

flowchart TB
  subgraph L1["1 Experience"]
    UI["Web / mobile / voice"]
  end
  subgraph L2["2 API / BFF"]
    API["Auth, validate, tenant"]
  end
  subgraph L3["3 Orchestration"]
    OR["RAG, agents, policies"]
  end
  subgraph L4["4 LLM gateway"]
    GW["Keys, route, meter, retry"]
  end
  subgraph L5["5 Knowledge and tools"]
    V[("Vector / search")]
    DB[("SQL / APIs")]
    TX["Tool execution"]
  end
  subgraph L6["6 Data platform"]
    ING["Ingest, chunk, embed"]
  end
  subgraph L7["7 Observability"]
    O["Traces, audit, versions"]
  end
  UI --> API --> OR --> GW
  OR --> V
  OR --> DB
  OR --> TX
  ING --> V
  GW --> O
  OR --> O
  TX --> O
            

2. Monolithic LLM pipeline vs microservices LLM architecture—difference and when to choose each

In a monolith, most of the system ships as one deployable (or a small number). The same process might handle HTTP, retrieval, and some ingestion. That is simple for a small team: one repo, one release train, easy debugging in a single trace. The downside is scaling is coarse: you duplicate the entire app to add capacity, even if only the chat path is hot. Heavy batch work—for example parsing a 10,000-page upload—can steal CPU from live users unless you are disciplined about background threads, queues, and resource limits inside that process.

Microservices split those concerns into separate deployables: chat API, ingestion workers, gateway, evaluation jobs, and so on. Different teams can own different pieces; you scale GPU inference separately from stateless HTTP. You pay with network latency, more failure modes, and the need for distributed tracing to understand one user request end to end.

When to choose: early product and a small team often win with a modular monolith (clean modules, one process) so you move fast without pretending you are Netflix on day one. When multiple teams, different SLOs, or regulated isolation really matter, split out long-running and GPU-bound work so interactive latency stays predictable.

Figure 2 — One deployable vs many services

flowchart LR
  subgraph mono["Monolith"]
    M["HTTP + RAG + jobs in one unit"]
  end
  subgraph micro["Services"]
    S1["Chat API"]
    S2["Ingest workers"]
    S3["LLM gateway"]
    S1 --> S3
    S2 --> S3
  end
            

3. Design a stateless LLM API that scales horizontally

Horizontal scaling means adding more identical servers behind a load balancer as traffic grows. That only works if any server can complete any request given the information in the request (or loadable from shared storage).

If conversation memory lived only in one machine’s RAM, the next request from the same user might land on a different machine—so the model would “forget” everything. The fix is external session state: Redis for hot data, a database for durable transcripts, or a pattern where the client sends a compact continuation token. The application tier becomes a pool of stateless workers.

Also treat egress to model providers seriously: connection pools, bounded concurrency per replica, and timeouts so one thundering herd does not exhaust file descriptors or open millions of HTTP/2 streams.

Figure 3 — Stateless replicas + shared session store

flowchart LR
  U[Users] --> LB[Load balancer]
  LB --> A[Replica 1]
  LB --> B[Replica 2]
  LB --> C[Replica N]
  A --> R[("Shared session store")]
  B --> R
  C --> R
  A --> GW[LLM gateway]
  B --> GW
  C --> GW
            

4. Handle session and conversation state in a distributed LLM system

Split state by how long it must live and how durable it must be. Recent turns and scratchpads used for tool calls are often kept in Redis with a time-to-live, so abandoned chats clean themselves up. If compliance requires a full transcript, the database is the source of truth; Redis may be only a cache that can be rebuilt.

Models have finite context windows, so you cannot paste an infinitely long thread. A common pattern: keep the last few turns verbatim and periodically summarize older history into a short paragraph stored like any other message. That summary is cheap to refresh with a smaller model or a background job.

Two devices or two tabs can write at once—use atomic appends (compare-and-swap in Redis, or transactions in SQL) so message order does not corrupt. If Redis is empty after a restart but Postgres has history, reload from Postgres rather than silently starting a “new person” for a returning user.

Figure 4 — Hot cache vs durable transcript

flowchart TB
  APP["Chat service"]
  APP --> Redis[("Redis: hot turns")]
  APP --> PG[("Postgres: durable log")]
            

5. Architect an LLM platform for ten internal teams and products at once

This is a platform product, not “everyone shares one API key.” Isolation first: Team A’s documents must not surface in Team B’s retrieval. That is enforced with tenant filters on every query and separate indexes or namespaces when policy demands—not by hoping the prompt says the right thing.

Leverage second: one gateway, one observability standard, one SDK, one way to register tools and prompts. You do not want ten homemade rate limiters and ten different ways to leak PII into logs.

A control plane holds who may use which model, monthly spend caps, allowed connectors (Salesforce, Confluence, internal wiki), and policy profiles—for example Legal restricted to EU endpoints while DevTools uses a higher-throughput cheaper route.

Figure 5 — Many products, shared gateway, split knowledge

flowchart TB
  T1["Team A app"] --> GW["Shared LLM gateway"]
  T2["Team B app"] --> GW
  T3["Team C app"] --> GW
  GW --> IA[("Index A")]
  GW --> IB[("Index B")]
  CP["Control plane: IAM, quotas"] -.-> GW
            

6. Key non-functional requirements (NFRs) before architecting an LLM system

NFRs are the qualities beyond feature checklists: speed, cost, uptime, safety, auditability. Agree on them before drawing boxes.

  • Latency: time to first token versus complete answer; is streaming mandatory?
  • Availability: what happens when a provider is down—silent failure, backup model, queued async response?
  • Cost: monthly ceiling, per-tenant budget, what to do when the ceiling hits.
  • Data: residency, retention, no-training flags for vendor APIs, encryption expectations.
  • Quality & safety: acceptable error modes for your domain (medical vs marketing copy differ).
  • Observability & audit: what must be logged, what must never appear in logs.
  • Abuse & fairness: rate limits, content policies, account-level kill switches.

Without these, you risk building a Ferrari when the business needed a reliable bus—or the opposite.

7. Request–response lifecycle for a streaming LLM API in production

Streaming means tokens arrive while the user reads. The lifecycle should be deliberate at every hop.

  1. Client opens a request (SSE or WebSocket).
  2. Server validates auth and quota before opening the expensive upstream stream.
  3. Gateway calls the provider with streaming enabled.
  4. Chunks flow back through your stack; tune proxies so they do not buffer the whole response.
  5. On completion, persist final text and token usage; use background work where safe, but plan for crashes.
  6. If the client disconnects, cancel upstream where possible to avoid burning money.
  7. On mid-stream errors, return partial content plus a clear recovery path (retry, continuation).

Figure 6 — Streaming sequence

sequenceDiagram
  participant C as Client
  participant B as API BFF
  participant G as LLM gateway
  participant P as Provider
  C->>B: Chat request stream
  B->>B: Auth quota
  B->>G: Start stream
  G->>P: Stream completion
  loop Chunks
    P-->>G: token chunk
    G-->>B: chunk
    B-->>C: SSE chunk
  end
  P-->>G: end plus usage
  B->>B: Persist transcript
            

8. Separate prompt management, LLM calling, post-processing, and response delivery

Prompt management treats prompts as versioned artifacts—“system v12,” “RAG user template v5”—with variables filled from structured context, rollback when quality drops, and A/B experiments tied to metrics.

LLM calling is transport: streaming, timeouts, provider errors. It should not know your HTML or mobile layout.

Post-processing repairs structured output (JSON schema validation and a constrained retry), strips unsafe markup, enforces citations, and redacts patterns that look like account numbers.

Response delivery maps the canonical assistant turn to the channel: SSE for web, compact JSON for mobile, different framing for voice assistants.

Figure 7 — Pipeline of concerns

flowchart LR
  PM["Versioned prompts"] --> ASM["Assemble messages"]
  ASM --> GW["LLM call"]
  GW --> RAW["Raw output"]
  RAW --> POST["Post-process"]
  POST --> OUT["Deliver to client"]
            

9. Fallback architecture when the primary LLM provider is down or rate-limited

429 rate limits mean “slow down”; 5xx or timeouts mean “something is sick.” A production system assumes both.

Use an ordered backup path: secondary region or vendor, then a smaller or self-hosted model with honest UX (“backup assistant—answers may be shorter”), then cache or canned responses when safe, then async for non-interactive tasks.

Retries should use exponential backoff and honor Retry-After headers—never tight loops that amplify outages. A circuit breaker stops calling a failing dependency for a short window so you fail fast into backups instead of burning everyone’s latency budget.

Figure 8 — Degradation ladder

stateDiagram-v2
  [*] --> Primary
  Primary --> Secondary: outage or CB open
  Secondary --> Tertiary: still failing
  Tertiary --> FallbackUX: cache or static
            

10. LLM backend with FastAPI or Spring Boot exposing a chat completion endpoint

The framework is less important than the shape of the service. You want strict request size limits, validated schemas, async or reactive I/O while waiting on networks, a streaming response primitive that matches your frontend, correlation IDs on every request, consistent timeouts, and secrets from environment or a vault—not Git.

Keep controllers thin: a service layer assembles context, calls the gateway, and hands results to a presenter. Persistence and analytics should not block time-to-first-token if you can help it.

11. Middleware between the application and multiple LLM providers

Vendors agree on the idea of “messages in, text out,” but disagree on JSON field names, tool formats, streaming events, and error bodies. Middleware exposes one stable internal contract; adapters translate to each vendor.

Centralize cross-cutting behavior: attach billing tags per tenant, redact prompts in logs, enforce global max_tokens, normalize errors, and route cheap vs expensive models consistently.

Figure 9 — Adapters behind one API

flowchart TB
  APP["Application"] --> MID["LLM middleware"]
  MID --> A1["OpenAI adapter"]
  MID --> A2["Anthropic adapter"]
  MID --> A3["Gemini adapter"]
            

12. What is an LLM gateway? Production responsibilities

An LLM gateway is the single controlled front door for model traffic from your systems—like an API gateway, but aware of tokens, models, and safety hooks.

It should authenticate internal callers, enforce quotas and spend, route by policy and latency needs, implement retries and circuit breaking, attach structured telemetry, scrub or partition logs for PII, hold and rotate secrets, and provide a consistent place to turn off or throttle access during incidents.

13. Data flow for a real-time LLM app targeting sub-500ms responses

Be honest in interviews: a large remote frontier model plus retrieval often exceeds half a second to the first token. Sub-500ms end-to-end usually forces tradeoffs: tiny context, regional proximity, aggressive caching, a fast small model, or UX that acknowledges immediately and streams the heavy answer.

Clarify the NFR: is 500ms first byte or a complete short answer? Those imply different topologies.

Figure 10 — Tight-latency path (conceptual)

flowchart LR
  U[User] --> E[Edge API]
  E --> C{Cache hit}
  C -->|yes| R[Return]
  C -->|no| V[Small retrieve]
  V --> M[Fast model]
  M --> U
            

14. Architecture for online real-time and offline batch modes

Online paths optimize for a human waiting: short timeouts, partial results, cancel on disconnect. Offline paths optimize throughput and cost: durable queues, idempotent workers, retries, checkpoints for huge batches.

Share the same gateway and prompt registry where possible so batch summaries and chat answers do not feel like two different products unless you intend a separate “batch personality.”

Figure 11 — Shared gateway, different ingress

flowchart TB
  subgraph onl["Online"]
    UI[Users] --> API[Sync API]
    API --> GW[Gateway]
  end
  subgraph off["Offline"]
    SCH[Scheduler] --> Q[Queue]
    Q --> W[Workers]
    W --> GW
  end
            

15. System that processes structured JSON/SQL, semi-structured CSV, and unstructured PDF/email

Users upload different shapes of truth. Treating a 500-page PDF like a twenty-field JSON file guarantees garbage in—or out.

Why “type detection” matters

You must route each payload to the right parser. A filename like report.csv can be wrong: people rename files, email strips metadata, and tools mislabel attachments.

MIME type (simple explanation)

When data arrives over HTTP, the client often sends a Content-Type header such as application/pdf. That is a hint about intent. It helps—but senders can be mistaken or even misleading, so you do not stop there.

Magic bytes (simple explanation)

Many formats begin with a short, fixed signature in the raw bytes. A PDF often starts with the characters %PDF; PNG and JPEG have their own signatures. By reading the first small chunk of the file, you identify the actual format even when the extension or MIME header lies. That reduces “we sent a PNG to the CSV parser” failures.

Per-shape handling

Structured (JSON, SQL results): parse and validate with code, not with a model. Use the LLM to interpret or explain verified facts. For Text-to-SQL, constrain to read-only views and audited queries so numbers are not hallucinated.

Semi-structured (CSV): parse with a real CSV library (quotes, embedded commas). Often aggregate or sample before sending hundreds of thousands of cells into a model—summaries and histograms are cheaper and safer.

Unstructured (PDF, email): use layout-aware extraction; for email, strip quoted threads and footers so legal disclaimers do not drown the user’s question. Produce chunks with metadata: page, section, source id.

At the end, unify into a canonical representation the orchestrator can pass to RAG or prompts—always with provenance so answers can cite spreadsheet vs narrative document.

Figure 12 — Ingest router to unified chunks

flowchart TB
  IN["Upload"] --> DET{"Detect MIME + signature + try-parse"}
  DET --> STR["Structured helpers"]
  DET --> CSV["CSV pathway"]
  DET --> UNS["PDF / email pathway"]
  STR --> U["Canonical chunks + metadata"]
  CSV --> U
  UNS --> U
  U --> ORCH["Orchestration / RAG"]
            

Recap

QTakeaway
1Layered stack: experience → API → orchestration → gateway → knowledge → data → governance.
2Monolith for early speed; services when ownership and SLOs diverge.
3Stateless app tier + shared session store.
4Hot cache + durable log + summarization for long threads.
5Shared platform primitives with strict tenant isolation.
6NFRs before boxes: latency, cost, compliance, quality.
7Streaming: validate early, forward chunks, persist, cancel on disconnect.
8Versioned prompts ≠ transport ≠ post-processing ≠ channel delivery.
9Backoff, circuit breaker, ordered fallbacks, honest degraded UX.
10Framework + async/streaming/timeouts/tracing matter more than brand.
11Adapters per vendor behind one middleware contract.
12Gateway = policy + resilience + metering at the model edge.
13Sub-500ms usually needs small context, cache, fast model, or UX.
14Same core model; sync online vs queued offline workers.
15Detect real type; specialized parsers; unify with metadata and provenance.

Section 1 · Next: Section 2 — Scalability & performance · Design hub · Learning hub