sharpbyte.dev

Fully offline, air-gapped AI assistant for defense

The network has zero internet: no OpenAI, no cloud vector DB, no phone-home telemetry. You still need a fully functional assistant over classified documents at 99.9% uptime. This guide covers the full stack—models, local inference, on-prem retrieval, sneaker-net updates, and hardware you can defend in a design review.

Scenario

Design a fully offline, air-gapped AI assistant for a defense client.

The client operates in a network with zero internet access. No API calls to OpenAI, no cloud vector DBs, no external dependencies. They need a fully functional AI assistant over classified documents with 99.9% uptime.

Design the full stack—model selection, local inference, vector storage, updates, and hardware considerations.

What you should be able to do after reading:

Step 0 — Constraints you state up front

ConstraintImplication
No egressAll weights, containers, dependencies arrive via approved physical transfer
Classified dataEncryption at rest, need-to-know retrieval, audit every query
99.9% uptime~8.7h downtime/year max—requires N+1 inference and index replicas
No SaaSPostgres, Milvus/Qdrant, vLLM, Keycloak—or equivalents already accredited on base
Human updates onlyChange windows, rollback bundles, dual control for model promotion

Step 1 — Clarifying questions

QuestionDrives
Classification level (SECRET, TS/SCI)?Hardware location, cross-domain rules
Corpus size and growth?GPU count, disk, index sharding
Concurrent users?Inference replicas, queue depth
Latency target per answer?Model size (7B vs 70B) vs quality
Existing accredited OS / Kubernetes?Deploy shape (K8s vs bare metal)
Allowed to fine-tune on classified text?Adapter training inside enclave vs RAG-only

Step 2 — The sixty-second answer

Three-zone architecture: (1) Transfer enclave—one-way ingest of signed update bundles from removable media; (2) Data plane—classified doc store, local embed model, self-hosted vector + lexical indexes on encrypted NVMe; (3) Serving planevLLM (or accredited runtime) on N+1 GPU nodes behind an internal API gateway with SSO and entitlements. RAG is fully local: retrieve with ACL filters → generate with citations → audit log to WORM storage.

Model: approved open-weight instruct model (e.g. 13B–70B class) quantized for throughput; embed model matched and frozen per corpus version. Uptime: active/passive inference, replicated index shards, health-based drain, offline eval gate before any bundle promotion.

Phrase that lands well: “Air-gapped does not mean simple—it means every dependency is a supply-chain decision you can explain to security accreditation.”

Step 3 — Full stack architecture

flowchart TB
  subgraph transfer [Transfer enclave - low side or diode]
    MEDIA[Signed update media]
    SCAN[Malware + hash verify]
    STAGE[Staging registry]
  end
  subgraph classified [Classified enclave]
    DOC[Classified document store]
    ING[Ingest + OCR - on prem]
    EMB[Local embed service]
    VEC[(Vector DB cluster)]
    LEX[(Lexical index)]
    META[(Metadata Postgres)]
    LLM[Inference cluster vLLM]
    API[Assistant API + SSO]
    AUD[Audit WORM store]
  end
  subgraph ops [Operations]
    MON[On-prem metrics - no egress]
    BK[Encrypted backup]
  end
  MEDIA --> SCAN --> STAGE
  STAGE -->|one-way transfer| DOC
  STAGE -->|model weights| LLM
  DOC --> ING --> EMB
  EMB --> VEC
  ING --> LEX
  ING --> META
  API --> VEC
  API --> LEX
  API --> LLM
  API --> AUD
  LLM --> MON
  VEC --> BK
    

Step 4 — Model selection (no cloud APIs)

LLM (generation)

CriteriaPractical choice
AccreditationModels already cleared by program office—or open weights with full SBOM scanned in transfer enclave
Quality vs hardware13B–34B instruct at INT4/AWQ for most desks; 70B if budget allows A100/H100 fleet
Context length8k–32k; RAG supplies facts—do not rely on huge context alone
LicensePermissive for government use; document redistribution limits
DeterminismFixed temperature for compliance answers; seed where supported

Embedding + rerank (retrieval quality)

Fine-tuning inside the enclave

Often RAG-only in v1 to avoid training-data governance pain. If adapters are allowed: train on sanitized pairs in transfer-approved pipeline; never export adapters out of enclave.

Step 5 — Local inference layer

Runtime

Serving pattern

Client → mTLS API gateway → AuthZ (groups/clearance)
      → RAG orchestrator → retrieve (vector + lexical)
      → prompt builder (citations only from retrieved chunks)
      → vLLM replica → output filter → audit → response

Throughput sizing (example to state)

200 concurrent analysts, ~2 queries/min active peak → plan 40–80 in-flight generations; 4× GPU nodes with 2× A100 80GB each running 34B AWQ often sufficient—with N+1 spare node.

Step 6 — Vector and lexical storage (on-prem only)

ComponentOptionNotes
Vector DBMilvus, Qdrant, or Weaviate self-hostedHA cluster, encrypted volumes, no license phone-home
LexicalOpenSearch / Elasticsearch on-premProgram numbers, NSNs, exact phrases
MetadataPostgres + object store (MinIO on classified SAN)ACL, classification label, doc version
AvoidPinecone, Weaviate Cloud, pgvector alone at 10M+ chunksCloud and scale limits

Security on indexes

Step 7 — Classified document RAG pipeline

  1. Ingest from accredited CMS, file shares, scanned PDFs—virus scan in lower enclave before one-way push if policy allows.
  2. Parse/OCR on-prem (Tesseract or approved commercial OCR bundle transferred offline).
  3. Chunk + label classification metadata at ingest—cannot be guessed at query time.
  4. Embed + index async queue; incremental upsert.
  5. Query hybrid retrieval + optional rerank → LLM with citation-only context.
  6. Audit immutable log: user, clearance snapshot, chunk ids, model version, response hash.

Step 8 — Updates without the internet

Transfer enclave workflow

  1. Build signed bundle on connected staging (SBOM, hashes, release notes): model weights, container images, index migrations, app binaries.
  2. Verify signatures + antivirus on removable media or one-way diode.
  3. Promote through dev → test → prod enclaves inside classified network with separate bundles.
  4. Offline eval suite runs on test enclave; compliance sign-off before prod.
  5. Prod rollout: blue/green inference deployment; vector index dual-write then cutover.

What gets updated how often

ArtifactFrequencyMechanism
New classified documentsDaily–continuousInternal ingest only—no external media
LLM / embed modelQuarterly or as neededSigned bundle; full regression eval
Application codePatch windowsSame bundle pipeline
Threat / malware defsSynced to enclave policySeparate approved defs bundle

Step 9 — Hardware and 99.9% uptime

Reference topology (adjust to program scale)

TierHardwareRole
Inference4–8× GPU servers (A100/H100/L40S per accreditation)vLLM replicas, N+1
Embed / rerank2× CPU-heavy nodesCheaper scale for indexing
Vector / search3+ nodes, NVMe RAIDMilvus/OpenSearch quorum
StorageEncrypted SAN or distributed storageRaw docs + backups
NetworkInternal load balancers, no default routemTLS everywhere

Uptime mechanics (99.9% = design, not a single server)

99.9% over a year still allows planned maintenance—schedule bundle upgrades in windows with standby capacity.

Step 10 — Security and accreditation hooks

Step 11 — Failure points and mitigations

FailureImpactMitigation
GPU node lossQueue latencyN+1 replicas; autoscale queue workers
Bad model bundleWrong or toxic answersEval gate; instant rollback pointer; keep N-1 weights
Index corruptionMiss retrievalRebuild from Postgres ledger; nightly checksum
ACL mapping bugOver-deliveryDeny-by-default; penetration test per release
Transfer enclave compromiseMalware ingressHash + sign + dual control; one-way diode
Staff uploads malicious PDFRCE on parserSandboxed parsers; strip active content
Capacity underestimateSLO miss feels like outageLoad test in test enclave with synthetic corpus

Step 12 — How to walk through this in a design session

  1. 3 min — restate zero internet + classified + 99.9%.
  2. 7 min — three-zone diagram (transfer / classified / ops).
  3. 8 min — model + inference sizing and separation of embed vs chat GPUs.
  4. 7 min — on-prem vector + lexical + ACL retrieval.
  5. 8 min — sneaker-net update bundles and eval gates.
  6. 5 min — HA for 99.9% and degraded mode.
  7. Close — “Every box is ours; every update is signed.”

Step 13 — Goals → knobs

GoalKnob
Higher answer qualityLarger quant-aware model; better reranker; more GPU
Lower hardware costSmaller LLM + stronger RAG; INT4; fewer concurrent slots
Higher uptimeMore inference replicas; faster index failover; degraded read-only mode
Faster accreditationReuse already ATO components; minimize novel dependencies
Safer disclosureStricter retrieval filters; citation-only generation; human review tier

The one line to remember

An air-gapped defense assistant is a accredited private cloud in a room: signed bundles in, classified RAG and local inference in the middle, immutable audit out—and 99.9% uptime comes from redundant GPUs and indexes, not from any external SLA you cannot control.