Fully offline, air-gapped AI assistant for defense
The network has zero internet: no OpenAI, no cloud vector DB, no phone-home telemetry. You still need a fully functional assistant over classified documents at 99.9% uptime. This guide covers the full stack—models, local inference, on-prem retrieval, sneaker-net updates, and hardware you can defend in a design review.
Scenario
Design a fully offline, air-gapped AI assistant for a defense client.
The client operates in a network with zero internet access. No API calls to OpenAI, no cloud vector DBs, no external dependencies. They need a fully functional AI assistant over classified documents with 99.9% uptime.
Design the full stack—model selection, local inference, vector storage, updates, and hardware considerations.
What you should be able to do after reading:
- Map every component to on-prem software with a supply-chain and accreditation story.
- Size GPU inference + vector + ingest for classified corpus scale and concurrent analysts.
- Explain how models and indexes update without the public internet (controlled transfer enclave).
- Hit 99.9% with redundancy, health checks, and graceful degradation—not wishful SLAs.
Step 0 — Constraints you state up front
| Constraint | Implication |
|---|---|
| No egress | All weights, containers, dependencies arrive via approved physical transfer |
| Classified data | Encryption at rest, need-to-know retrieval, audit every query |
| 99.9% uptime | ~8.7h downtime/year max—requires N+1 inference and index replicas |
| No SaaS | Postgres, Milvus/Qdrant, vLLM, Keycloak—or equivalents already accredited on base |
| Human updates only | Change windows, rollback bundles, dual control for model promotion |
Step 1 — Clarifying questions
| Question | Drives |
|---|---|
| Classification level (SECRET, TS/SCI)? | Hardware location, cross-domain rules |
| Corpus size and growth? | GPU count, disk, index sharding |
| Concurrent users? | Inference replicas, queue depth |
| Latency target per answer? | Model size (7B vs 70B) vs quality |
| Existing accredited OS / Kubernetes? | Deploy shape (K8s vs bare metal) |
| Allowed to fine-tune on classified text? | Adapter training inside enclave vs RAG-only |
Step 2 — The sixty-second answer
Three-zone architecture: (1) Transfer enclave—one-way ingest of signed update bundles from removable media; (2) Data plane—classified doc store, local embed model, self-hosted vector + lexical indexes on encrypted NVMe; (3) Serving plane—vLLM (or accredited runtime) on N+1 GPU nodes behind an internal API gateway with SSO and entitlements. RAG is fully local: retrieve with ACL filters → generate with citations → audit log to WORM storage.
Model: approved open-weight instruct model (e.g. 13B–70B class) quantized for throughput; embed model matched and frozen per corpus version. Uptime: active/passive inference, replicated index shards, health-based drain, offline eval gate before any bundle promotion.
Phrase that lands well: “Air-gapped does not mean simple—it means every dependency is a supply-chain decision you can explain to security accreditation.”
Step 3 — Full stack architecture
flowchart TB
subgraph transfer [Transfer enclave - low side or diode]
MEDIA[Signed update media]
SCAN[Malware + hash verify]
STAGE[Staging registry]
end
subgraph classified [Classified enclave]
DOC[Classified document store]
ING[Ingest + OCR - on prem]
EMB[Local embed service]
VEC[(Vector DB cluster)]
LEX[(Lexical index)]
META[(Metadata Postgres)]
LLM[Inference cluster vLLM]
API[Assistant API + SSO]
AUD[Audit WORM store]
end
subgraph ops [Operations]
MON[On-prem metrics - no egress]
BK[Encrypted backup]
end
MEDIA --> SCAN --> STAGE
STAGE -->|one-way transfer| DOC
STAGE -->|model weights| LLM
DOC --> ING --> EMB
EMB --> VEC
ING --> LEX
ING --> META
API --> VEC
API --> LEX
API --> LLM
API --> AUD
LLM --> MON
VEC --> BK
Step 4 — Model selection (no cloud APIs)
LLM (generation)
| Criteria | Practical choice |
|---|---|
| Accreditation | Models already cleared by program office—or open weights with full SBOM scanned in transfer enclave |
| Quality vs hardware | 13B–34B instruct at INT4/AWQ for most desks; 70B if budget allows A100/H100 fleet |
| Context length | 8k–32k; RAG supplies facts—do not rely on huge context alone |
| License | Permissive for government use; document redistribution limits |
| Determinism | Fixed temperature for compliance answers; seed where supported |
Embedding + rerank (retrieval quality)
- Embed model running locally on CPU or small GPU pool (e.g. 100M–400M param sentence encoders).
- Optional cross-encoder reranker on CPU—small model, big recall gain for acronyms and program names.
- Version lock: corpus tagged with
embed_model_version; no mixed vectors in one index.
Fine-tuning inside the enclave
Often RAG-only in v1 to avoid training-data governance pain. If adapters are allowed: train on sanitized pairs in transfer-approved pipeline; never export adapters out of enclave.
Step 5 — Local inference layer
Runtime
- vLLM or TensorRT-LLM on Linux GPU nodes—continuous batching, OpenAI-compatible internal API (not public).
- Container images built in transfer enclave, scanned, promoted—no
docker pullfrom internet on classified side. - Separate pools: chat generation GPUs vs embed/rerank CPUs to prevent starvation.
Serving pattern
Client → mTLS API gateway → AuthZ (groups/clearance)
→ RAG orchestrator → retrieve (vector + lexical)
→ prompt builder (citations only from retrieved chunks)
→ vLLM replica → output filter → audit → response
Throughput sizing (example to state)
200 concurrent analysts, ~2 queries/min active peak → plan 40–80 in-flight generations; 4× GPU nodes with 2× A100 80GB each running 34B AWQ often sufficient—with N+1 spare node.
Step 6 — Vector and lexical storage (on-prem only)
| Component | Option | Notes |
|---|---|---|
| Vector DB | Milvus, Qdrant, or Weaviate self-hosted | HA cluster, encrypted volumes, no license phone-home |
| Lexical | OpenSearch / Elasticsearch on-prem | Program numbers, NSNs, exact phrases |
| Metadata | Postgres + object store (MinIO on classified SAN) | ACL, classification label, doc version |
| Avoid | Pinecone, Weaviate Cloud, pgvector alone at 10M+ chunks | Cloud and scale limits |
Security on indexes
- Every chunk carries
classification,compartments[],owner_org. - Retrieval applies filter-first from SSO claims—never global search then filter.
- Disk encryption (FIPS-validated modules where required); keys in HSM or accredited KMS on-prem.
Step 7 — Classified document RAG pipeline
- Ingest from accredited CMS, file shares, scanned PDFs—virus scan in lower enclave before one-way push if policy allows.
- Parse/OCR on-prem (Tesseract or approved commercial OCR bundle transferred offline).
- Chunk + label classification metadata at ingest—cannot be guessed at query time.
- Embed + index async queue; incremental upsert.
- Query hybrid retrieval + optional rerank → LLM with citation-only context.
- Audit immutable log: user, clearance snapshot, chunk ids, model version, response hash.
Step 8 — Updates without the internet
Transfer enclave workflow
- Build signed bundle on connected staging (SBOM, hashes, release notes): model weights, container images, index migrations, app binaries.
- Verify signatures + antivirus on removable media or one-way diode.
- Promote through dev → test → prod enclaves inside classified network with separate bundles.
- Offline eval suite runs on test enclave; compliance sign-off before prod.
- Prod rollout: blue/green inference deployment; vector index dual-write then cutover.
What gets updated how often
| Artifact | Frequency | Mechanism |
|---|---|---|
| New classified documents | Daily–continuous | Internal ingest only—no external media |
| LLM / embed model | Quarterly or as needed | Signed bundle; full regression eval |
| Application code | Patch windows | Same bundle pipeline |
| Threat / malware defs | Synced to enclave policy | Separate approved defs bundle |
Step 9 — Hardware and 99.9% uptime
Reference topology (adjust to program scale)
| Tier | Hardware | Role |
|---|---|---|
| Inference | 4–8× GPU servers (A100/H100/L40S per accreditation) | vLLM replicas, N+1 |
| Embed / rerank | 2× CPU-heavy nodes | Cheaper scale for indexing |
| Vector / search | 3+ nodes, NVMe RAID | Milvus/OpenSearch quorum |
| Storage | Encrypted SAN or distributed storage | Raw docs + backups |
| Network | Internal load balancers, no default route | mTLS everywhere |
Uptime mechanics (99.9% = design, not a single server)
- Active/active inference behind LB; drain node on GPU ECC errors or latency SLO breach.
- Replicated vector shards + automated failover; rehearse shard recovery quarterly.
- Queue depth + backpressure—return 503 with retry-after instead of OOM crash.
- Degraded mode: retrieval-only + template summary if all LLM nodes down (still useful, still audited).
- RTO/RPO for index and doc store documented; encrypted backups air-gapped to tape or secondary vault.
99.9% over a year still allows planned maintenance—schedule bundle upgrades in windows with standby capacity.
Step 10 — Security and accreditation hooks
- No egress: network policies deny 0.0.0.0/0; DNS internal only; block metadata IP endpoints on hypervisors.
- Supply chain: SBOM per container; reproducible builds where possible.
- Logging: on-prem SIEM; no third-party analytics SDKs in UI.
- Prompt injection: treat documents as data; strip instruction-like patterns at ingest.
- Output control: classification banner on answers; block export formats that bypass DLP.
Step 11 — Failure points and mitigations
| Failure | Impact | Mitigation |
|---|---|---|
| GPU node loss | Queue latency | N+1 replicas; autoscale queue workers |
| Bad model bundle | Wrong or toxic answers | Eval gate; instant rollback pointer; keep N-1 weights |
| Index corruption | Miss retrieval | Rebuild from Postgres ledger; nightly checksum |
| ACL mapping bug | Over-delivery | Deny-by-default; penetration test per release |
| Transfer enclave compromise | Malware ingress | Hash + sign + dual control; one-way diode |
| Staff uploads malicious PDF | RCE on parser | Sandboxed parsers; strip active content |
| Capacity underestimate | SLO miss feels like outage | Load test in test enclave with synthetic corpus |
Step 12 — How to walk through this in a design session
- 3 min — restate zero internet + classified + 99.9%.
- 7 min — three-zone diagram (transfer / classified / ops).
- 8 min — model + inference sizing and separation of embed vs chat GPUs.
- 7 min — on-prem vector + lexical + ACL retrieval.
- 8 min — sneaker-net update bundles and eval gates.
- 5 min — HA for 99.9% and degraded mode.
- Close — “Every box is ours; every update is signed.”
Step 13 — Goals → knobs
| Goal | Knob |
|---|---|
| Higher answer quality | Larger quant-aware model; better reranker; more GPU |
| Lower hardware cost | Smaller LLM + stronger RAG; INT4; fewer concurrent slots |
| Higher uptime | More inference replicas; faster index failover; degraded read-only mode |
| Faster accreditation | Reuse already ATO components; minimize novel dependencies |
| Safer disclosure | Stricter retrieval filters; citation-only generation; human review tier |
The one line to remember
An air-gapped defense assistant is a accredited private cloud in a room: signed bundles in, classified RAG and local inference in the middle, immutable audit out—and 99.9% uptime comes from redundant GPUs and indexes, not from any external SLA you cannot control.