sharpbyte.dev
← Design guide
Interview ready · Design · Section 6

Cost optimization

Fifteen expanded economics scenarios—same structure, staff-level detail: FinOps governance with reconciled bills and RFC gates; tokenizer preflight and billed-vs-estimate learning; multi-signal model routing and honest self-host TCO; calibrated triage classifiers; compression across RAG, tools, and dialog with golden validation; two-tier caching with tenant-safe invalidation; per-feature chargeback with customer-facing dashboards; alerting with contextual auto-mitigation; API↔self-host ROI beyond GPU sticker price; freemium capability matrices; embedding churn control; lazy-evaluation waterfalls with telemetry; percentile-based launch models; long-input UX that preserves trust; and shared GPU clusters with fairness and autoscale discipline.

Interview stance. Cost optimization is economic engineering: connect tokens to SKUs, margin, and customer promises. Always narrate estimation drift, peak versus average traffic, and that agent loops compound spend super-linearly unless capped.

86. How would you design a cost governance framework for an LLM application used across multiple teams?

Single source of truth. Every inference call flows through a gateway that stamps team, product_surface, customer_tier, model, feature_flag. Raw provider bills are reconciled nightly—finance refuses “we think eng used it.”

Budget envelopes. Hard monthly caps for experimental teams, soft alerts at 70/90%, auto-throttle non-critical workloads (batch summarization pauses before customer chat). Executives get variance narratives, not only invoices.

Approval workflow. Shipping agents, huge context windows, or always-on shadow traffic requires lightweight RFC with estimated token growth—stops one PM from multiplying spend 10× on launch Friday.

Chargeback cultures. Optional internal prices per 1k tokens so product teams feel tradeoffs when choosing mini vs frontier; include embedding + reranker + eval jobs in TCO dashboards.

Governance metrics. Track cost per successful task, waste from retries, cache hit rate, and % spend on customers who never pay—panels like economic thinking, not gatekeeping.

87. How would you implement token counting and budgeting before sending a request to an LLM API?

Fast local estimates. Run the same tokenizer family as the provider (verify revision!) on assembled prompts, tools, and historical message rails. Reserve headroom for tool-return expansion you know is coming.

Budget actions. If over budget: summarize retrieval pack, truncate lowest-ranked chunks, drop decorative system prose, or refuse with actionable UX (‘upload exceeds limit—pick sections’). Never rely on provider 400 for happy path.

Async true-up. Compare estimates vs billed tokens per request id; learn systematic drift (tool JSON bloat) and feed corrections back into heuristics weekly.

Streaming math. For partial generation caps, track emitted tokens vs remaining—not only input—so marathon answers do not bankrupt sessions mid-stream.

Testing. Golden fixtures with multilingual strings and emoji since tokenizer edge cases become finance incidents.

88. How would you decide when to use GPT-4o vs GPT-4o-mini vs a self-hosted model, based on query complexity?

Signals. Combine lexical features, retrieval density, presence of tools/agents, historical failure patterns, customer tier, and regulatory constraints (health/finance ⇒ route map).

Policy table. Default cheap for FAQ and extraction with tight schemas; escalate to frontier for multi-hop reasoning, adversarial users, or low-confidence classifiers.

Self-host economics. Favor steady, high-QPS, latency-forgiving workloads with dedicated GPU ops; include ML engineers, incident hours, spare capacity, and model refresh tax—not only $/token.

Quality guardrails. Shadow or periodic canary prompts comparing tiers; auto-rollback routes that hemorrhage quality for marginal savings.

Interview note. Mention carbon/region constraints and data residency as first-class routing dimensions, not afterthoughts.

89. How would you use a small model to pre-classify queries and route only complex ones to a larger model?

Classifier options. Traditional ML on embeddings, small LLM triage prompts, or hybrid: cheap model proposes label + confidence; margin-based escalation when two heads disagree.

Calibration. Tune thresholds on a time-sliced eval so seasonality (tax season, release week) does not mis-route; monitor false cheap-routes as quality incidents.

User transparency. Optionally show ‘thinking’ labels internally; avoid promising premium brains when user paid for economy tier.

Continuous learning. Feed misroutes into relabeling queues; refresh classifier when product taxonomy evolves.

Business metrics. Report monthly $ saved vs acceptable quality delta; kill experiments that save pennies but spike refunds.

90. How would you design a prompt compression system that reduces input tokens without losing meaning?

Retrieval side. MMR or relevance caps; summarize clusters of chunks with a small model instructed to preserve citations; strip markup noise while keeping headings for anchors.

Tool side. Shrink JSON to minimal schemas; project wide tables to the columns tools actually need; paginate huge tool payloads with continuation tokens.

Conversation side. Rolling summaries with explicit ‘open constraints’ bullets so long chats do not lose safety rules when history compresses.

Validation. Offline golden suites + online A/B comparing task success, not BLEU—compression can look fluent yet delete negation.

Guardrails. Never compress legal disclaimers or tool-use schemas without human sign-off; put them on a do-not-touch list.

91. How would you implement response caching for deterministic queries to avoid redundant LLM calls?

Exact tier. Canonicalize prompts (Unicode normalize, strip volatile timestamps where safe), hash with model + template + tool schema versions. Store compressed value + safety metadata.

Semantic tier. For FAQs, embed question variants; nearest-neighbor hit below distance threshold returns cached answer—must still run policy filters for tenant-specific secrets.

Invalidation. TTL by domain (news vs HR policy), event-driven bust when KB version bumps, proactive purge when prompts change—caching wrong answers is expensive brand damage.

Abuse. Per-tenant cache salt to prevent cross-customer leakage; rate limit cache priming endpoints.

Accounting. Attribute cache savings in FinOps dashboards so PMs see leverage from good design.

92. How would you track per-user and per-feature LLM cost in a SaaS product?

Instrumentation. OpenTelemetry spans include user_hash, org_id, feature_key, request_id, token counts; forbid bare CLI scripts that bypass the gateway in prod.

Rollups. Stream to warehouse (hourly for ops, nightly for finance) with slowly changing dimension for plan tiers so historical reports stay honest after upgrades.

Customer visibility. Usage pages show tokens, approximate dollars, and drivers (‘agents +42% WoW’)—transparency reduces escalations and helps upsell responsibly.

Anomaly detection. Alert on single-tenant spikes (API key leak, runaway loop) and auto-suspend suspicious workloads.

Privacy. Aggregate or hash identifiers where GDPR limits fine-grained retention; still preserve enough resolution for abuse response.

93. How would you design a cost alerting system that triggers when LLM spend exceeds a threshold?

Alert shapes. Derivative alarms on hourly spend (catch bots), weekly envelopes vs forecast (catch drift), monthly strategic review with narrative. Avoid only monthly bills—too slow for LLM spikes.

Context-rich pages. Each alert shows top dimensions: model, tenant, feature, new deployment correlation. Link to traces, not only graphs.

Auto-mitigation. Safe toggles: temporarily force mini models for marketing surfaces, disable batch re-embeds, shorten default max tokens—never silent user harm; banner honest degradation.

Runbooks. Pre-approved playbooks per alert type so on-call isn’t inventing finance policy at 3 AM.

Feedback loop. Post-incident, reconcile predicted vs actual cost model; alerts are worthless if thresholds always false-positive.

Example. Marketing chatbot suddenly 4× baseline triggers PagerDuty + automatic switch to mini model for 24h while investigating.

94. How would you evaluate the ROI of switching from OpenAI API to a self-hosted open-source model?

Total cost ledger. GPU capex/lease, power, colo, SRE+ML engineer FTE, on-call burden, vendor support, security reviews, model refresh cadence, and opportunity cost of roadmap delay.

Quality & capability debt. Map feature parity: tool calling quirks, context length, safety filters, multilingual, latency tails. Quantify revenue risk if conversions drop 1%.

Traffic shape. Steady, forecastable QPS wins self-host; bursty viral products often cheaper on API elasticity until baseline grows.

Validation path. Shadow traffic, A/B revenue tests, staged cutover with automatic revert hooks—not a big bang spreadsheet.

Risk framing. Mention insurance-style value: API hedges your talent shortage; self-host hedges vendor pricing—truth wins interviews.

95. How would you design a tiered service model (free vs paid) where free users get smaller/cheaper models?

Product map. Tie tiers to model class, max tokens, concurrent requests, agent depth, connector count, and data retention—not vague ‘premium AI’ labels.

Quota UX. Friendly counters (‘12 summaries left this week’) beat hard errors; upsell moments when user hits compression wall.

Abuse resistance. Device checks, CAPTCHA on spikes, mandatory email verification—free tiers attract scraping rings.

Fairness. Ensure free tier still safe: same red-team baselines, no weaker content filters that create headline risk.

Analytics. Monitor conversion funnels vs support load; tune generosity so acquisition math stays sane.

96. How would you reduce embedding costs for a RAG system that re-embeds frequently updated documents?

Stable chunks. Content-hash chunks; skip re-embed when bytes unchanged even if parent doc version ticked for metadata-only edits.

Incremental strategies. Matryoshka or shortened dimensions for secondary indexes when quality holds; quantize stored vectors if recall tests pass.

Batched pipelines. Accumulate edits for micro-batching to GPU services instead of per-keystroke API calls from editors.

Cache & dedupe. Cross-tenant dedupe only where policy permits; otherwise keep isolation.

Trigger discipline. Debounce noisy autosaves; ingest on commit or publication events, not every draft delta.

97. How would you implement lazy evaluation — only calling the LLM if cached or rule-based responses are insufficient?

Waterfall design. Deterministic FAQ/regex → retrieval-with-template → tiny model → frontier model. Instrument drop-off rates at each gate to prove ROI.

Confidence gating. Only escalate when template fill misses required slots or semantic similarity to curated answers falls below threshold.

Guardrails stay upstream. Safety/classification blocks still run even when skipping LLM to avoid cheap toxic path.

Fallback honesty. If cheap path uncertain, ask clarifying question before burning big model tokens.

Anti-pattern. Do not chain three LLM calls ‘just in case’—each hop should have a crisp decision function.

98. How would you design a cost model to estimate LLM spend for a new product before launch?

Driver tree. MAU × sessions × prompts/session × (p50/p95/p99 token histogram) × model price table × regional multiplier + embedding amortization + eval jobs + agent branching factor.

Scenario analysis. Model viral (3× sessions) and failure (2× retries) cases; include support bots humans escalate to.

Instrumentation first. Commit to logging every beta call with token counts so week-one actuals recalibrate spreadsheet immediately.

Stakeholder alignment. Pair engineering estimates with finance assumptions on FX, tax, and committed spend discounts.

Honesty. State unknowns (tool adoption curve) and plan iteration—execs prefer ranges with error bars over false precision.

99. How would you handle token limit errors gracefully when a user submits an extremely long document?

Pre-flight checks. Client-side size hints, server-side tokenizer precount before queueing expensive retrieval—fail fast with friendly copy.

Remediation flows. Offer guided summarization (‘pick sections’), chapter-by-chapter Q&A, or background job emailing results when long jobs unacceptable synchronously.

Error shaping. Translate provider errors into next steps; log raw codes only server-side.

Power users. API responses include structured MAX_INPUT_TOKENS and remaining budget for integrators.

Product tie-in. Long doc ingestion might be paid feature where you fund higher context models—say that aloud if relevant.

100. How would you architect a shared LLM inference cluster to maximize GPU utilization across multiple applications?

Scheduler-centric design. Central queue (vLLM/TGI/etc.) with priority classes, preemption policies for interactive vs batch, and dynamic batching within SLA tails.

Tenant fairness. Concurrency tokens + weighted fair queueing prevent one chatty microservice from hogging VRAM; burst allowances for revenue-critical paths.

Multi-tenant models. Shared base weights with LoRA adapters loaded dynamically where vendor supports; pin hot adapters, evict cold.

Autoscale signals. Scale on queue age p95, not only CPU; watch KV-cache pressure and clock throttling.

Operational maturity. Capacity games, regular burn-in, and chaos drills—GPUs fail creatively; panels like pragmatic SRE detail over buzzwords.

Shared inference quotas

flowchart TB
  A1[App A] --> SCH[Scheduler]
  A2[App B] --> SCH
  SCH --> GPU[GPU pool]
            

Recap — this section

QTakeaway
86Metered gateway + reconciled bills; tiered budgets with auto-throttle; RFC for spend step-functions; internal unit economics; waste KPIs.
87Tokenizer-accurate estimates; proactive truncation/summarization; billed vs estimated drift feedback; output-side caps; fixture tests.
88Multi-signal routing table; self-host TCO with people cost; quality canaries; residency/policy axes.
89Calibrated triage w/ escalation bands; monitor misroutes; label flywheel; savings vs quality trade sheet.
90Multi-frontier compression (RAG/tools/dialog); schema-aware tool slimming; loss-aware goldens; protected system text.
91Versioned canonical keys; optional semantic FAQ cache; tenant-isolated invalidation; abuse controls; savings visibility.
92Mandatory tracing tags; warehouse rollups; customer transparency; anomaly automation; privacy-aware grain.
93Derivative + budget envelopes; dimensional drill-down; safe auto-mitigation; playbooks; calibrate thresholds.
94Full economic + ops ledger; capability gap analysis; traffic fit; gated migration with revert; strategic hedge framing.
95Explicit tier capability matrix; legible quotas; anti-abuse controls; consistent safety; conversion analytics.
96Hash-gated re-embed; dimensionality tricks after eval; batch/debounce; policy-aware dedupe; publish-time triggers.
97Measured policy waterfall; confidence-based escalation; safety not skipped; clarifying Q before spend; avoid redundant hops.
98Percentile token driver tree; stress scenarios; beta telemetry; finance/engineering joint review; explicit uncertainty.
99Precount + proactive UX; chunked/summary workflows; structured API limits; monetization clarity.
100Central batching scheduler; fair-share quotas; LoRA hot-swap strategy; queue-age autoscale; hardware chaos readiness.

← Section 5 · This section · Design hub · Section 7 →