sharpbyte.dev
← Design guide
Interview ready · Design · Section 13

Edge & on-device LLM

Fifteen staff-depth scenarios on moving inference closer to users: cloud vs edge vs device tradeoffs; quantization and packaging realism; native runtime isolation; WebGPU/WASM ceilings; thermal and battery governance; signed OTA model delivery; local-first privacy architecture; offline sync and idempotent replay; confidence-based cloud escalation; RAM-conscious on-device RAG; PoP edge servers; weight protection and licensing; multi-GB download UX; production debug with minimal exfil; and air-gapped or export-controlled deployments.

Interview stance. Edge LLM work is systems + physics + policy: thermals, download bytes, browser sandboxes, and export law matter as much as perplexity. Demonstrate hybrid routing with explicit privacy gates—not ‘run 7B in the mobile keynote.’

191. How would you decide between on-device, edge-server, and cloud LLM inference for a product feature?

Decision axes. Privacy (does raw text ever leave device?), latency SLO, offline necessity, model size ceiling, update cadence, and who pays for GPUs—none of these are ‘tiny model good’ platitudes.

On-device wins. Always-on voice, field technicians without connectivity, and strict data-sovereignty screens where even TLS to your VPC is politically blocked.

Cloud wins. Frontier reasoning, gigantic contexts, rapid prompt iteration, and centralized safety filters you can patch hourly.

Hybrid norm. Classify intent locally; escalate only hard subtasks—requires honest fallbacks when cloud unreachable.

Interview clarity. Cite measured battery/thermals and p95 latency; panels debunk vibes quickly.

Split inference

flowchart TB
  D[Device small model] -->|easy| A[Answer local]
  D -->|hard / policy| C[Cloud large model]
  C --> A
            

192. How would you choose quantization formats and model sizes for mobile or browser deployment?

Hardware truth. INT4 on NPU vs FP16 on GPU vs WASM float paths—profiling beats paper FLOPs; some kernels are faster at INT8 if vendor-optimized.

Quality gates. Per-task golden sets for summarization, extraction, and safety refusals; track perplexity only as a secondary smoke test.

Packaging. GGUF, ONNX, TFLite, Core ML each imply different runtime teams—pick one primary stack per OS family.

Download economics. Delta updates, compressed artifacts, and CDN resumable downloads; cap install size or churn spikes.

Rollback. Keep two model revisions on device with atomic swap if quality regresses OTA.

193. How would you architect model serving on iOS, Android, and desktop (native runtimes vs embedded engines)?

Isolation. Inference in bounded worker processes or NPU contexts; never block UI thread—streaming tokens still need cooperative scheduling.

Memory. mmap weights, memory-map warmup, and aggressive eviction when app backgrounds; handle low-memory kills gracefully.

Accelerators. Abstract vendor SDKs behind an internal interface so Pixel vs iPhone vs Snapdragon differs only in backends.

CI. Emulator/device farm smoke tests for numerical parity (tight tolerances) and crash telemetry.

Updates. Feature-flag new runtime versions; staged rollout by OS version and chip family.

194. How would you design a browser-based LLM experience using WebGPU / WASM—and what are the hard limits?

Sandbox reality. No arbitrary file access, tight memory caps, user gesture requirements, and throttled background tabs—plan for intermittent compute.

Asset delivery. Split weights into range-requestable chunks; service worker caches with LRU; signature verification before execute.

Fingerprinting ethics. Local models reduce server logging but WebGPU probes can still identify—coordinate with privacy policy.

Fallback. Stream to cloud when WebGPU missing or model too large; disclose which path ran.

Testing. Matrix Chrome/Safari/Firefox; Safari lacks features others have—tier capabilities explicitly.

195. How would you manage thermal throttling, battery drain, and fan noise for continuous on-device LLM use?

Budgets. Max tokens per minute, duty cycle with cool-down periods, and user-visible ‘performance mode’ toggles.

Scheduling. Defer noninteractive summarization until charging + Wi-Fi unless user overrides.

Hardware signals. Subscribe to OS thermal state APIs; degrade proactively instead of hard crashes.

Quality trade. Shorter context windows under stress beat surprise shutdowns mid-sentence.

Telemetry. Opt-in heat/battery anonymized metrics feed hardware team and model pickers.

196. How would you ship over-the-air model updates safely to millions of client devices?

Integrity. Signatures + hash checkpoints; block MITM swaps; support offline verification on first boot after download.

Resilience. Resumable HTTP; peer assists or P2P only if legal team approves.

Staged rollout. Canary percent by device segment; automatic rollback telemetry on crash or quality SLO breach.

Storage hygiene. GC old weights; user setting to reclaim GB; enterprise policy to pin versions.

Compliance. Geofenced downloads when export rules apply to model weights themselves.

197. How would you design a ‘local-first’ LLM product where sensitive prompts never leave the device by default?

Trust boundaries. Clear UX states: local-only lock icon vs cloud enhance; never silently escalate.

Networking. If optional telemetry exists, scrub before send; default deny outbound for regulated modes.

Key material. Protect fine-tunes or PEFT blobs in secure storage; biometrics gate export.

Support. Explain why logs are limited—train CS on safe diagnostics without raw prompts.

Audit. Third-party pen tests on ‘offline’ claims; marketing language reviewed by legal.

198. How would you handle offline sessions that must reconcile with server-side RAG or user accounts when connectivity returns?

Queue. Durable outbox of tasks with vector clocks or simple monotonic ids; idempotent replay to avoid double billing.

Conflict policy. Last-writer-wins vs prompt user when cloud state diverged—especially shared notebooks.

RAG catch-up. On reconnect, refresh critical indexes or show staleness banner until incremental sync completes.

Security. Bind queued payloads to device attestation where enterprise requires.

UX. Transparent ‘syncing knowledge…’ progress; partial answers labeled provisional.

199. How would you implement adaptive escalation from a tiny on-device model to a cloud model when local confidence is low?

Signals. Internal certainty proxies, retrieval emptiness, or classifier on task difficulty—avoid shipping raw logits if policy forbids.

Privacy gate. Explicit user consent token per escalation; redact or summarize before uplink when possible.

Cost. Cap daily escalations per tier; abuse-resistant metering.

Latency. Speculatively start cloud path when local TTFT slow—careful with duplicate spend.

Eval. Measure escalation rate vs quality uplift; tune thresholds from data not intuition.

200. How would you architect on-device or edge RAG when the corpus must stay local (e.g., field engineering manuals)?

Index size. Compressed embeddings, product quantization, and shard by manual id; maybe keyword + small vector hybrid fits RAM where deep models do not.

Update path. USB sideload, local mesh, or nightly Wi-Fi sync packs from corporate MDM.

Security. Encrypt index at rest; wipe on device retire command.

Quality. Smaller models need tighter chunking and UI that shows retrieved passage titles for trust.

Fallback. Read-only mode with cached answers if index corrupt—better than silent nonsense.

201. How would you design regional edge servers for LLM inference (vs hyperscaler core) to reduce latency while controlling cost?

Placement. PoPs near user clusters; warm smallest model that meets SLA; burst overflow still hits core.

Consistency. Same gateway schema, auth, and eval harness as core—drift creates ‘works in Virginia only’ nightmares.

Data gravity. Edge caches embeddings + KV for hot tenants; cold path pulls object storage lazily.

Ops. Immutable images, config as code, and synthetic probes from each PoP to core dependencies.

Finance. Compare reserved edge GPU vs extra RTT savings; not every app wins.

202. How would you protect proprietary on-device model weights from extraction or redistribution?

Obfuscation is weak. Assume motivated reversers—combine encryption, server attestation for unlock, and legal TOU.

Hardware hooks. TEE-backed key unwrap where available; degrade gracefully on emulators for anti-piracy SKUs.

Watermarking. Optional inference watermarks to trace leaks forensically.

Licensing. Time-bomb leases for trial models; remote kill switch for fraud accounts.

Realism. State that 100% protection is impossible; goal is raise cost of theft above value.

203. How would you design UX and technical flows for downloading multi-gigabyte language models to consumer devices?

Progressive enablement. Ship app shell first; download tiers ‘basic / balanced / max quality’ with clear size labels.

Network awareness. Wi-Fi only by default; low-data mode respects cellular caps.

Resume & verify. Chunked downloads with checksum per chunk; visible retry affordances.

Storage checks. Preflight disk headroom; offer cleanup of old models.

Trust. Show publisher signature hash; educate why large download equals privacy option.

204. How would you debug production issues in on-device LLM pipelines (crashes, wrong outputs) with limited server visibility?

Structured breadcrumbs. Non-PII event codes (model id, stage, error class) uploaded only with consent.

Replay packs. Encrypted export the user can email support—think iOS sysdiagnose but scoped.

Shadow metrics. On-device histograms aggregated differentially privately when allowed.

Lab repro. Device farm with matching SoC + OS combo; symbolicated native traces.

Feature flags. Verbose local logging builds for internal dogfood only.

205. How would you adapt LLM serving for air-gapped, classified, or export-controlled environments?

Supply chain. Offline transfer of signed artifacts via sneakernet; attest hashes on ingest.

Hardware. Approved accelerators list; no phone-home telemetry—design observability that stays on-prem.

Models. Verify licenses permit such deployment; some weights are ECCN-gated—legal is part of architecture.

Updates. Quarterly patch bundles with air-gap test harness before production flip.

Fallbacks. When even local LLM disallowed, provide deterministic rule engines for mission-critical paths.

Recap — this section

QTakeaway
191Privacy/latency/offline/budget matrix; hybrid intent routing; measure thermals + p95; eschew cargo-cult.
192Profile real hardware; task goldens over PPL; one runtime stack per platform; OTA delta + dual-version rollback.
193Off UI thread; mmap + OOM handling; backend abstraction; device CI matrix; staged runtime rollout.
194WebGPU/WASM constraints; cacheable signed shards; cloud fallback transparency; cross-browser capability tiers.
195Token budgets + duty cycles; defer batch work; thermal API aware degrade; opt-in perf telemetry.
196Signed resumable bundles; staged OTA; auto-rollback signals; storage GC; export/geo rules.
197Explicit UX modes; minimal telemetry; hardware-backed secrets; honest support playbook; pen-tested claims.
198Idempotent outbox; conflict UX; post-connect index refresh; optional attestation; honest sync states.
199Confidence + policy gates; consentful uplink; tiered escalation caps; race-aware scheduling; threshold tuning.
200RAM-conscious indexes; MDM-managed sync packs; encrypted corpora; citation-forward UX; safe degrade.
201PoP warm pools + core overflow; config parity; hot-cache strategy; PoP probes; TCO realism.
202Encrypted artifacts + attestation; TEE where possible; forensic watermarks; leased licenses; honest threat model.
203Tiered model SKUs; Wi-Fi defaults; resumable verified chunks; disk preflight; transparency copy.
204Privacy-preserving breadcrumbs; user-initiated replay bundles; SoC-matched labs; internal verbose builds.
205Offline signed supply chain; no phone-home; export-class legal review; quarterly air-gap patches; non-LLM fallbacks.

← Section 12 · This section · Design hub · Section 14 →