Fully offline, air-gapped AI assistant for defense

The network has zero internet: no OpenAI, no cloud vector DB, no phone-home telemetry. You still need a fully functional assistant over classified documents at 99.9% uptime. This guide covers the full stack—models, local inference, on-prem retrieval, sneaker-net updates, and hardware you can defend in a design review.

Scenario

Design a fully offline, air-gapped AI assistant for a defense client.

The client operates in a network with zero internet access. No API calls to OpenAI, no cloud vector DBs, no external dependencies. They need a fully functional AI assistant over classified documents with 99.9% uptime.

Design the full stack—model selection, local inference, vector storage, updates, and hardware considerations.

What you should be able to do after reading:

Map every component to on-prem software with a supply-chain and accreditation story.
Size GPU inference + vector + ingest for classified corpus scale and concurrent analysts.
Explain how models and indexes update without the public internet (controlled transfer enclave).
Hit 99.9% with redundancy, health checks, and graceful degradation—not wishful SLAs.

Step 0 — Constraints you state up front

Constraint	Implication
No egress	All weights, containers, dependencies arrive via approved physical transfer
Classified data	Encryption at rest, need-to-know retrieval, audit every query
99.9% uptime	~8.7h downtime/year max—requires N+1 inference and index replicas
No SaaS	Postgres, Milvus/Qdrant, vLLM, Keycloak—or equivalents already accredited on base
Human updates only	Change windows, rollback bundles, dual control for model promotion

Step 1 — Clarifying questions

Question	Drives
Classification level (SECRET, TS/SCI)?	Hardware location, cross-domain rules
Corpus size and growth?	GPU count, disk, index sharding
Concurrent users?	Inference replicas, queue depth
Latency target per answer?	Model size (7B vs 70B) vs quality
Existing accredited OS / Kubernetes?	Deploy shape (K8s vs bare metal)
Allowed to fine-tune on classified text?	Adapter training inside enclave vs RAG-only

Step 2 — The sixty-second answer

Three-zone architecture: (1) Transfer enclave—one-way ingest of signed update bundles from removable media; (2) Data plane—classified doc store, local embed model, self-hosted vector + lexical indexes on encrypted NVMe; (3) Serving plane—vLLM (or accredited runtime) on N+1 GPU nodes behind an internal API gateway with SSO and entitlements. RAG is fully local: retrieve with ACL filters → generate with citations → audit log to WORM storage.

Model: approved open-weight instruct model (e.g. 13B–70B class) quantized for throughput; embed model matched and frozen per corpus version. Uptime: active/passive inference, replicated index shards, health-based drain, offline eval gate before any bundle promotion.

Phrase that lands well: “Air-gapped does not mean simple—it means every dependency is a supply-chain decision you can explain to security accreditation.”

Step 3 — Full stack architecture

flowchart TB
  subgraph transfer [Transfer enclave - low side or diode]
    MEDIA[Signed update media]
    SCAN[Malware + hash verify]
    STAGE[Staging registry]
  end
  subgraph classified [Classified enclave]
    DOC[Classified document store]
    ING[Ingest + OCR - on prem]
    EMB[Local embed service]
    VEC[(Vector DB cluster)]
    LEX[(Lexical index)]
    META[(Metadata Postgres)]
    LLM[Inference cluster vLLM]
    API[Assistant API + SSO]
    AUD[Audit WORM store]
  end
  subgraph ops [Operations]
    MON[On-prem metrics - no egress]
    BK[Encrypted backup]
  end
  MEDIA --> SCAN --> STAGE
  STAGE -->|one-way transfer| DOC
  STAGE -->|model weights| LLM
  DOC --> ING --> EMB
  EMB --> VEC
  ING --> LEX
  ING --> META
  API --> VEC
  API --> LEX
  API --> LLM
  API --> AUD
  LLM --> MON
  VEC --> BK

Step 4 — Model selection (no cloud APIs)

LLM (generation)

Criteria	Practical choice
Accreditation	Models already cleared by program office—or open weights with full SBOM scanned in transfer enclave
Quality vs hardware	13B–34B instruct at INT4/AWQ for most desks; 70B if budget allows A100/H100 fleet
Context length	8k–32k; RAG supplies facts—do not rely on huge context alone
License	Permissive for government use; document redistribution limits
Determinism	Fixed temperature for compliance answers; seed where supported

Embedding + rerank (retrieval quality)

Embed model running locally on CPU or small GPU pool (e.g. 100M–400M param sentence encoders).
Optional cross-encoder reranker on CPU—small model, big recall gain for acronyms and program names.
Version lock: corpus tagged with embed_model_version; no mixed vectors in one index.

Fine-tuning inside the enclave

Often RAG-only in v1 to avoid training-data governance pain. If adapters are allowed: train on sanitized pairs in transfer-approved pipeline; never export adapters out of enclave.

Step 5 — Local inference layer

Runtime

vLLM or TensorRT-LLM on Linux GPU nodes—continuous batching, OpenAI-compatible internal API (not public).
Container images built in transfer enclave, scanned, promoted—no docker pull from internet on classified side.
Separate pools: chat generation GPUs vs embed/rerank CPUs to prevent starvation.

Serving pattern

Client → mTLS API gateway → AuthZ (groups/clearance)
      → RAG orchestrator → retrieve (vector + lexical)
      → prompt builder (citations only from retrieved chunks)
      → vLLM replica → output filter → audit → response

Throughput sizing (example to state)

200 concurrent analysts, ~2 queries/min active peak → plan 40–80 in-flight generations; 4× GPU nodes with 2× A100 80GB each running 34B AWQ often sufficient—with N+1 spare node.

Step 6 — Vector and lexical storage (on-prem only)

Component	Option	Notes
Vector DB	Milvus, Qdrant, or Weaviate self-hosted	HA cluster, encrypted volumes, no license phone-home
Lexical	OpenSearch / Elasticsearch on-prem	Program numbers, NSNs, exact phrases
Metadata	Postgres + object store (MinIO on classified SAN)	ACL, classification label, doc version
Avoid	Pinecone, Weaviate Cloud, pgvector alone at 10M+ chunks	Cloud and scale limits

Security on indexes

Every chunk carries classification, compartments[], owner_org.
Retrieval applies filter-first from SSO claims—never global search then filter.
Disk encryption (FIPS-validated modules where required); keys in HSM or accredited KMS on-prem.

Step 7 — Classified document RAG pipeline

Ingest from accredited CMS, file shares, scanned PDFs—virus scan in lower enclave before one-way push if policy allows.
Parse/OCR on-prem (Tesseract or approved commercial OCR bundle transferred offline).
Chunk + label classification metadata at ingest—cannot be guessed at query time.
Embed + index async queue; incremental upsert.
Query hybrid retrieval + optional rerank → LLM with citation-only context.
Audit immutable log: user, clearance snapshot, chunk ids, model version, response hash.

Step 8 — Updates without the internet

Transfer enclave workflow

Build signed bundle on connected staging (SBOM, hashes, release notes): model weights, container images, index migrations, app binaries.
Verify signatures + antivirus on removable media or one-way diode.
Promote through dev → test → prod enclaves inside classified network with separate bundles.
Offline eval suite runs on test enclave; compliance sign-off before prod.
Prod rollout: blue/green inference deployment; vector index dual-write then cutover.

What gets updated how often

Artifact	Frequency	Mechanism
New classified documents	Daily–continuous	Internal ingest only—no external media
LLM / embed model	Quarterly or as needed	Signed bundle; full regression eval
Application code	Patch windows	Same bundle pipeline
Threat / malware defs	Synced to enclave policy	Separate approved defs bundle

Step 9 — Hardware and 99.9% uptime

Reference topology (adjust to program scale)

Tier	Hardware	Role
Inference	4–8× GPU servers (A100/H100/L40S per accreditation)	vLLM replicas, N+1
Embed / rerank	2× CPU-heavy nodes	Cheaper scale for indexing
Vector / search	3+ nodes, NVMe RAID	Milvus/OpenSearch quorum
Storage	Encrypted SAN or distributed storage	Raw docs + backups
Network	Internal load balancers, no default route	mTLS everywhere

Uptime mechanics (99.9% = design, not a single server)

Active/active inference behind LB; drain node on GPU ECC errors or latency SLO breach.
Replicated vector shards + automated failover; rehearse shard recovery quarterly.
Queue depth + backpressure—return 503 with retry-after instead of OOM crash.
Degraded mode: retrieval-only + template summary if all LLM nodes down (still useful, still audited).
RTO/RPO for index and doc store documented; encrypted backups air-gapped to tape or secondary vault.

99.9% over a year still allows planned maintenance—schedule bundle upgrades in windows with standby capacity.

Step 10 — Security and accreditation hooks

No egress: network policies deny 0.0.0.0/0; DNS internal only; block metadata IP endpoints on hypervisors.
Supply chain: SBOM per container; reproducible builds where possible.
Logging: on-prem SIEM; no third-party analytics SDKs in UI.
Prompt injection: treat documents as data; strip instruction-like patterns at ingest.
Output control: classification banner on answers; block export formats that bypass DLP.

Step 11 — Failure points and mitigations

Failure	Impact	Mitigation
GPU node loss	Queue latency	N+1 replicas; autoscale queue workers
Bad model bundle	Wrong or toxic answers	Eval gate; instant rollback pointer; keep N-1 weights
Index corruption	Miss retrieval	Rebuild from Postgres ledger; nightly checksum
ACL mapping bug	Over-delivery	Deny-by-default; penetration test per release
Transfer enclave compromise	Malware ingress	Hash + sign + dual control; one-way diode
Staff uploads malicious PDF	RCE on parser	Sandboxed parsers; strip active content
Capacity underestimate	SLO miss feels like outage	Load test in test enclave with synthetic corpus

Step 12 — How to walk through this in a design session

3 min — restate zero internet + classified + 99.9%.
7 min — three-zone diagram (transfer / classified / ops).
8 min — model + inference sizing and separation of embed vs chat GPUs.
7 min — on-prem vector + lexical + ACL retrieval.
8 min — sneaker-net update bundles and eval gates.
5 min — HA for 99.9% and degraded mode.
Close — “Every box is ours; every update is signed.”

Step 13 — Goals → knobs

Goal	Knob
Higher answer quality	Larger quant-aware model; better reranker; more GPU
Lower hardware cost	Smaller LLM + stronger RAG; INT4; fewer concurrent slots
Higher uptime	More inference replicas; faster index failover; degraded read-only mode
Faster accreditation	Reuse already ATO components; minimize novel dependencies
Safer disclosure	Stricter retrieval filters; citation-only generation; human review tier

The one line to remember

An air-gapped defense assistant is a accredited private cloud in a room: signed bundles in, classified RAG and local inference in the middle, immutable audit out—and 99.9% uptime comes from redundant GPUs and indexes, not from any external SLA you cannot control.