Fifteen staff-depth scenarios on moving inference closer to users: cloud vs edge vs device tradeoffs; quantization and packaging realism; native runtime isolation; WebGPU/WASM ceilings; thermal and battery governance; signed OTA model delivery; local-first privacy architecture; offline sync and idempotent replay; confidence-based cloud escalation; RAM-conscious on-device RAG; PoP edge servers; weight protection and licensing; multi-GB download UX; production debug with minimal exfil; and air-gapped or export-controlled deployments.
Interview stance. Edge LLM work is systems + physics + policy: thermals, download bytes, browser sandboxes, and export law matter as much as perplexity. Demonstrate hybrid routing with explicit privacy gates—not ‘run 7B in the mobile keynote.’
Pick on-device when data must not leave or offline is mandatory—otherwise cloud economics usually win.
Quantization and runtime choices need device-matrix profiling, not one laptop benchmark.
OTA model shipping is a release engineering discipline: signed chunks, staged rollout, rollback.
Assume motivated attackers on weights—combine legal, crypto, and honest residual risk language.
191. How would you decide between on-device, edge-server, and cloud LLM inference for a product feature?
Decision axes. Privacy (does raw text ever leave device?), latency SLO, offline necessity, model size ceiling, update cadence, and who pays for GPUs—none of these are ‘tiny model good’ platitudes.
On-device wins. Always-on voice, field technicians without connectivity, and strict data-sovereignty screens where even TLS to your VPC is politically blocked.
Cloud wins. Frontier reasoning, gigantic contexts, rapid prompt iteration, and centralized safety filters you can patch hourly.
Hybrid norm. Classify intent locally; escalate only hard subtasks—requires honest fallbacks when cloud unreachable.
199. How would you implement adaptive escalation from a tiny on-device model to a cloud model when local confidence is low?
Signals. Internal certainty proxies, retrieval emptiness, or classifier on task difficulty—avoid shipping raw logits if policy forbids.
Privacy gate. Explicit user consent token per escalation; redact or summarize before uplink when possible.
Cost. Cap daily escalations per tier; abuse-resistant metering.
Latency. Speculatively start cloud path when local TTFT slow—careful with duplicate spend.
Eval. Measure escalation rate vs quality uplift; tune thresholds from data not intuition.
200. How would you architect on-device or edge RAG when the corpus must stay local (e.g., field engineering manuals)?
Index size. Compressed embeddings, product quantization, and shard by manual id; maybe keyword + small vector hybrid fits RAM where deep models do not.
Update path. USB sideload, local mesh, or nightly Wi-Fi sync packs from corporate MDM.
Security. Encrypt index at rest; wipe on device retire command.
Quality. Smaller models need tighter chunking and UI that shows retrieved passage titles for trust.
Fallback. Read-only mode with cached answers if index corrupt—better than silent nonsense.
201. How would you design regional edge servers for LLM inference (vs hyperscaler core) to reduce latency while controlling cost?
Placement. PoPs near user clusters; warm smallest model that meets SLA; burst overflow still hits core.
Consistency. Same gateway schema, auth, and eval harness as core—drift creates ‘works in Virginia only’ nightmares.
Data gravity. Edge caches embeddings + KV for hot tenants; cold path pulls object storage lazily.
Ops. Immutable images, config as code, and synthetic probes from each PoP to core dependencies.
Finance. Compare reserved edge GPU vs extra RTT savings; not every app wins.
202. How would you protect proprietary on-device model weights from extraction or redistribution?
Obfuscation is weak. Assume motivated reversers—combine encryption, server attestation for unlock, and legal TOU.
Hardware hooks. TEE-backed key unwrap where available; degrade gracefully on emulators for anti-piracy SKUs.
Watermarking. Optional inference watermarks to trace leaks forensically.
Licensing. Time-bomb leases for trial models; remote kill switch for fraud accounts.
Realism. State that 100% protection is impossible; goal is raise cost of theft above value.
203. How would you design UX and technical flows for downloading multi-gigabyte language models to consumer devices?
Progressive enablement. Ship app shell first; download tiers ‘basic / balanced / max quality’ with clear size labels.
Network awareness. Wi-Fi only by default; low-data mode respects cellular caps.
Resume & verify. Chunked downloads with checksum per chunk; visible retry affordances.
Storage checks. Preflight disk headroom; offer cleanup of old models.
Trust. Show publisher signature hash; educate why large download equals privacy option.
204. How would you debug production issues in on-device LLM pipelines (crashes, wrong outputs) with limited server visibility?
Structured breadcrumbs. Non-PII event codes (model id, stage, error class) uploaded only with consent.
Replay packs. Encrypted export the user can email support—think iOS sysdiagnose but scoped.
Shadow metrics. On-device histograms aggregated differentially privately when allowed.
Lab repro. Device farm with matching SoC + OS combo; symbolicated native traces.
Feature flags. Verbose local logging builds for internal dogfood only.
205. How would you adapt LLM serving for air-gapped, classified, or export-controlled environments?
Supply chain. Offline transfer of signed artifacts via sneakernet; attest hashes on ingest.
Hardware. Approved accelerators list; no phone-home telemetry—design observability that stays on-prem.
Models. Verify licenses permit such deployment; some weights are ECCN-gated—legal is part of architecture.
Updates. Quarterly patch bundles with air-gap test harness before production flip.
Fallbacks. When even local LLM disallowed, provide deterministic rule engines for mission-critical paths.