Interview ready · Theory · Applied

Applied LLM products & agents

You will prepare for applied LLM and agent interviews the way strong candidates do: articulate architectures (prompts, retrieval, tools), spell out evaluation and observability, and reason about latency, cost, and safety as first-class constraints. Long-form system-design drills ship in the next module; this page makes your vocabulary and tradeoffs automatic.

PROMPT ENGINEERING

1. What is prompt engineering? Why is it important when working with LLMs?

Answer. Prompt engineering is shaping inputs (instructions, formats, examples, constraints) to steer a frozen model toward reliable behavior. It matters because it’s your fastest lever before fine-tuning: you trade $0 marginal training for careful eval and versioning. Impressive candidates mention prompt+tool contracts, refusal patterns, regression suites, and that prompts are code-like artifacts needing review and CI—not vibes.

2. What is zero-shot prompting? When does it work and when does it fail?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is zero-shot prompting? When does it work and when does it fail?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

3. What is few-shot prompting? How does the number of examples affect output quality?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is few-shot prompting? How does the number of examples affect output quality?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

4. What is chain-of-thought (CoT) prompting? How does it improve reasoning?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is chain-of-thought (CoT) prompting? How does it improve reasoning?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

5. What is zero-shot chain-of-thought prompting?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is zero-shot chain-of-thought prompting?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

6. What is self-consistency prompting?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is self-consistency prompting?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

7. What is Tree-of-Thought (ToT) prompting?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is Tree-of-Thought (ToT) prompting?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

8. What is ReAct prompting? How does it combine reasoning and acting?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is ReAct prompting? How does it combine reasoning and acting?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

9. What is a system prompt? How does it differ from a user prompt?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is a system prompt? How does it differ from a user prompt?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

10. What is prompt injection? How do you defend against it?

Answer. Prompt injection is untrusted text in the model’s context overriding developer instructions. Defense in depth: least-privilege tools, structured tool schemas, allowlists, content isolation, output validators, human gates for risky actions, and logging. Prompt engineering alone will not secure agents—the threat model is the whole tool surface.

11. What is prompt leaking?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is prompt leaking?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

12. What is the difference between a hard prompt and a soft prompt?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. I map “What is the difference between a hard prompt and a soft prompt?” to tradeoffs teams debate in prod for PROMPT ENGINEERING—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

13. What is prefix tuning? How is it different from fine-tuning?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is prefix tuning? How is it different from fine-tuning?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

14. What is prompt chaining? When would you use it?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is prompt chaining? When would you use it?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

15. What are delimiters in prompts and why do they matter?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. I bucket the list, tie each bucket to an operational concern (reliability, eval, safety), and avoid encyclopedic tone. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

16. What is role prompting?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is role prompting?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

17. What is meta-prompting?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is meta-prompting?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

18. How do you handle long prompts that exceed the context window?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. I explain the data/control path for “How do you handle long prompts that exceed the context window?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

19. What is output formatting in prompts? (JSON mode, structured outputs)

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. “What is output formatting in prompts? (JSON mode, structured outputs)” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

20. How do you evaluate the quality of a prompt?

Answer. Framing: instruction sensitivity, tool contracts, and eval-backed iteration—not clever wording alone. I explain the data/control path for “How do you evaluate the quality of a prompt?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

RETRIEVAL-AUGMENTED GENERATION (RAG)

21. What is Retrieval-Augmented Generation (RAG)? Why was it introduced?

Answer. RAG retrieves evidence per query, packs it into context, and generates an answer—grounding fluency in external knowledge. It’s the default pattern for moving targets, citation requirements, or knowledge too large to memorize in weights. The hard part is retrieval quality and evaluation (faithfulness, recall of relevant passages)—not the LLM call in isolation.

22. What are the key components of a RAG pipeline?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. I bucket the list, tie each bucket to an operational concern (reliability, eval, safety), and avoid encyclopedic tone. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

23. What is the difference between RAG and fine-tuning for domain adaptation?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. I map “What is the difference between RAG and fine-tuning for domain adaptation?” to tradeoffs teams debate in prod for RETRIEVAL-AUGMENTED GENERATION (RAG)—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

24. What is naive RAG vs advanced RAG vs modular RAG?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. I map “What is naive RAG vs advanced RAG vs modular RAG?” to tradeoffs teams debate in prod for RETRIEVAL-AUGMENTED GENERATION (RAG)—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

25. What is a chunk in RAG? How do you decide chunk size?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is a chunk in RAG? How do you decide chunk size?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

26. What is the overlap strategy in chunking? Why does it matter?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is the overlap strategy in chunking? Why does it matter?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

27. What are different chunking strategies? (fixed-size, semantic, recursive, document-based)

28. What is an embedding model? How is it used in RAG?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is an embedding model? How is it used in RAG?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

29. What is semantic search? How is it different from keyword search?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is semantic search? How is it different from keyword search?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

30. What is hybrid search? What is BM25?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is hybrid search? What is BM25?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

31. What is a reranker? How does it improve RAG quality?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is a reranker? How does it improve RAG quality?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

32. What is the difference between dense and sparse retrieval?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. I map “What is the difference between dense and sparse retrieval?” to tradeoffs teams debate in prod for RETRIEVAL-AUGMENTED GENERATION (RAG)—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

33. What is HyDE (Hypothetical Document Embeddings)?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is HyDE (Hypothetical Document Embeddings)?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

34. What is query expansion and query rewriting in RAG?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is query expansion and query rewriting in RAG?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

35. What is multi-query retrieval?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is multi-query retrieval?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

36. What is self-RAG?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is self-RAG?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

37. What is CRAG (Corrective RAG)?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is CRAG (Corrective RAG)?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

38. What is the context stuffing problem in RAG?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is the context stuffing problem in RAG?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

39. What is the "lost in the middle" problem in RAG?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is the "lost in the middle" problem in RAG?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

40. How do you evaluate a RAG pipeline? What metrics are used?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. I explain the data/control path for “How do you evaluate a RAG pipeline? What metrics are used?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

41. What is RAGAs? What metrics does it measure?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is RAGAs? What metrics does it measure?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

42. What is faithfulness in RAG evaluation?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is faithfulness in RAG evaluation?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

43. What is answer relevance vs context relevance in RAG?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. I map “What is answer relevance vs context relevance in RAG?” to tradeoffs teams debate in prod for RETRIEVAL-AUGMENTED GENERATION (RAG)—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

44. What is GraphRAG? How does it differ from standard RAG?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is GraphRAG? How does it differ from standard RAG?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

45. What is Agentic RAG?

Answer. Framing: retrieval precision/recall as the product; generation is only as honest as the evidence set. “What is Agentic RAG?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

VECTOR DATABASES & EMBEDDINGS

46. What is a vector database? How does it differ from a traditional database?

Answer. A vector database stores embeddings + metadata and serves ANN queries under filters—latency/SLO-focused. Traditional OLTP/warehouse rows aren’t built for million-point cosine neighborhoods at millisecond scales. I compare FAISS-in-postgres DIY vs managed Pinecone/Qdrant/Weaviate tradeoffs: hybrid search, multi-tenant isolation, backup/replication, and operational UI.

47. What is Approximate Nearest Neighbor (ANN) search?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. “What is Approximate Nearest Neighbor (ANN) search?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

48. What is the HNSW (Hierarchical Navigable Small World) index?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. “What is the HNSW (Hierarchical Navigable Small World) index?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

49. What is the IVF (Inverted File Index) in vector search?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. “What is the IVF (Inverted File Index) in vector search?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

50. What is the difference between FAISS, Pinecone, Weaviate, Qdrant, Chroma, and Milvus?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. I map “What is the difference between FAISS, Pinecone, Weaviate, Qdrant, Chroma, and Milvus?” to tradeoffs teams debate in prod for VECTOR DATABASES & EMBEDDINGS—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

51. How do you choose the right vector database for production?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. I explain the data/control path for “How do you choose the right vector database for production?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

52. What is cosine similarity? When do you use it over dot product or Euclidean distance?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. “What is cosine similarity? When do you use it over dot product or Euclidean distance?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

53. What is the difference between bi-encoder and cross-encoder models?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. I map “What is the difference between bi-encoder and cross-encoder models?” to tradeoffs teams debate in prod for VECTOR DATABASES & EMBEDDINGS—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

54. What is sentence-transformers? How are they used?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. “What is sentence-transformers? How are they used?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

55. What is embedding drift? How do you handle it in production?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. “What is embedding drift? How do you handle it in production?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

56. What is metadata filtering in vector search?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. “What is metadata filtering in vector search?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

57. How do you handle multi-tenant vector storage?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. I explain the data/control path for “How do you handle multi-tenant vector storage?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

58. What is a knowledge graph? How can it augment vector search?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. “What is a knowledge graph? How can it augment vector search?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

59. What is the role of embedding dimensionality in retrieval performance?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. “What is the role of embedding dimensionality in retrieval performance?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

60. How do you re-index embeddings when the embedding model changes?

Answer. Framing: ANN tradeoffs (recall vs latency), embedding drift, tenancy, and hybrid retrieval. I explain the data/control path for “How do you re-index embeddings when the embedding model changes?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

LLM APIs, INFERENCE & DEPLOYMENT

61. What is the difference between completion API and chat completion API?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. I map “What is the difference between completion API and chat completion API?” to tradeoffs teams debate in prod for LLM APIs, INFERENCE & DEPLOYMENT—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

62. What are tokens in LLM APIs? How is cost typically calculated?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. I bucket the list, tie each bucket to an operational concern (reliability, eval, safety), and avoid encyclopedic tone. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

63. What is the difference between input tokens and output tokens?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. I map “What is the difference between input tokens and output tokens?” to tradeoffs teams debate in prod for LLM APIs, INFERENCE & DEPLOYMENT—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

64. What is streaming in LLM APIs? How does Server-Sent Events (SSE) work?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. “What is streaming in LLM APIs? How does Server-Sent Events (SSE) work?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

65. What are the main inference parameters? (temperature, top-p, top-k, max_tokens, stop sequences)

66. What is structured output / JSON mode in LLM APIs?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. “What is structured output / JSON mode in LLM APIs?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

67. What is function calling / tool use in LLM APIs?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. “What is function calling / tool use in LLM APIs?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

68. What is the difference between OpenAI, Anthropic, Gemini, and open-source model APIs?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. I map “What is the difference between OpenAI, Anthropic, Gemini, and open-source model APIs?” to tradeoffs teams debate in prod for LLM APIs, INFERENCE & DEPLOYMENT—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

69. What is model context protocol (MCP)?

Answer. MCP standardizes how hosts discover and call tools/resources with schematized contracts—reducing bespoke HTTP glue and tightening security boundaries. I pitch it as an integration layer for multiple models/agents sharing the same capability surface with typed I/O instead of scraping prompts.

70. What is vLLM? How does it improve LLM inference throughput?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. “What is vLLM? How does it improve LLM inference throughput?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

71. What is PagedAttention in vLLM?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. “What is PagedAttention in vLLM?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

72. What is speculative decoding?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. “What is speculative decoding?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

73. What is continuous batching in LLM serving?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. “What is continuous batching in LLM serving?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

74. What is KV cache in Transformer inference? Why is it important?

Answer. The KV cache stores prior token keys/values during autoregressive decoding so attention over the prefix isn’t recomputed. It cuts per-step compute but grows memory with context—why long contexts need PagedAttention/quantized KV and batching strategies. Mention this when asked about p99 latency vs throughput.

75. What is the difference between greedy decoding and sampling?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. I map “What is the difference between greedy decoding and sampling?” to tradeoffs teams debate in prod for LLM APIs, INFERENCE & DEPLOYMENT—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

76. What is latency vs throughput tradeoff in LLM serving?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. I map “What is latency vs throughput tradeoff in LLM serving?” to tradeoffs teams debate in prod for LLM APIs, INFERENCE & DEPLOYMENT—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

77. What is quantization-aware inference? (GPTQ, AWQ, GGUF)

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. “What is quantization-aware inference? (GPTQ, AWQ, GGUF)” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

78. What is Ollama? What use case does it serve?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. “What is Ollama? What use case does it serve?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

79. What is LM Studio?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. “What is LM Studio?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

80. What is the difference between self-hosted and API-based LLM deployment?

Answer. Framing: tokens=dollars, streaming UX, batching, quantization, and SLO-aware routing. I map “What is the difference between self-hosted and API-based LLM deployment?” to tradeoffs teams debate in prod for LLM APIs, INFERENCE & DEPLOYMENT—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

AI AGENTS & AGENTIC SYSTEMS

81. What is an AI agent? How is it different from a simple LLM call?

Answer. An agent is a loop: observe → plan/reason → act with tools → update memory. A plain LLM call is a one-shot function; agents orchestrate multi-step policies with side effects and recovery. Interviewers listen for timeouts, idempotency, partial observability, and escalation—not just clever prompts.

82. What is the ReAct (Reasoning + Acting) framework for agents?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is the ReAct (Reasoning + Acting) framework for agents?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

83. What is the difference between a single-agent and multi-agent system?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. I map “What is the difference between a single-agent and multi-agent system?” to tradeoffs teams debate in prod for AI AGENTS & AGENTIC SYSTEMS—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

84. What are tools in the context of LLM agents?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. I bucket the list, tie each bucket to an operational concern (reliability, eval, safety), and avoid encyclopedic tone. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

85. What is tool use / function calling in agents?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is tool use / function calling in agents?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

86. What is the agent loop (observe → think → act)?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is the agent loop (observe → think → act)?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

87. What is planning in AI agents? What are common planning strategies?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is planning in AI agents? What are common planning strategies?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

88. What is task decomposition in agents?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is task decomposition in agents?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

89. What is a supervisor agent pattern?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is a supervisor agent pattern?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

90. What is a hierarchical multi-agent architecture?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is a hierarchical multi-agent architecture?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

91. What is an autonomous agent vs a supervised agent?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. I map “What is an autonomous agent vs a supervised agent?” to tradeoffs teams debate in prod for AI AGENTS & AGENTIC SYSTEMS—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

92. What is the difference between a stateful and stateless agent?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. I map “What is the difference between a stateful and stateless agent?” to tradeoffs teams debate in prod for AI AGENTS & AGENTIC SYSTEMS—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

93. What is memory in AI agents? What are the types of agent memory?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is memory in AI agents? What are the types of agent memory?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

94. What is short-term memory vs long-term memory in agents?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. I map “What is short-term memory vs long-term memory in agents?” to tradeoffs teams debate in prod for AI AGENTS & AGENTIC SYSTEMS—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

95. What is episodic memory in agents?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is episodic memory in agents?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

96. What is semantic memory in agents?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is semantic memory in agents?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

97. What is procedural memory in agents?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is procedural memory in agents?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

98. What is a scratchpad in agent reasoning?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is a scratchpad in agent reasoning?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

99. What is the role of a working memory in agentic systems?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is the role of a working memory in agentic systems?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

100. What are the common failure modes of AI agents?

101. What is agent hallucination? How do you mitigate it?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is agent hallucination? How do you mitigate it?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

102. What is an agent's action space?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is an agent's action space?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

103. What is the difference between a reactive agent and a deliberative agent?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. I map “What is the difference between a reactive agent and a deliberative agent?” to tradeoffs teams debate in prod for AI AGENTS & AGENTIC SYSTEMS—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

104. What is a code execution agent?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is a code execution agent?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

105. What is a browser-use agent?

Answer. Framing: control loops, tool safety, partial observability, and recovery—not anthropomorphism. “What is a browser-use agent?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

ORCHESTRATION FRAMEWORKS

106. What is LangChain? What problems does it solve?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is LangChain? What problems does it solve?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

107. What is LangGraph? How does it differ from LangChain?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is LangGraph? How does it differ from LangChain?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

108. What is the concept of a chain in LangChain?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is the concept of a chain in LangChain?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

109. What is a LangChain agent? What are its components?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is a LangChain agent? What are its components?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

110. What is LlamaIndex? What is it best used for?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is LlamaIndex? What is it best used for?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

111. What is a query engine in LlamaIndex?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is a query engine in LlamaIndex?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

112. What is LlamaIndex's node parser?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is LlamaIndex's node parser?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

113. What is Haystack? How does it compare to LangChain?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is Haystack? How does it compare to LangChain?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

114. What is CrewAI? What is it used for?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is CrewAI? What is it used for?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

115. What is AutoGen (Microsoft)? What is its architecture?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is AutoGen (Microsoft)? What is its architecture?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

116. What is the difference between CrewAI and AutoGen?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. I map “What is the difference between CrewAI and AutoGen?” to tradeoffs teams debate in prod for ORCHESTRATION FRAMEWORKS—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

117. What is a DAG (Directed Acyclic Graph) in agent orchestration?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is a DAG (Directed Acyclic Graph) in agent orchestration?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

118. What is LangSmith? What does it help with?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is LangSmith? What does it help with?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

119. What is the role of callbacks in LLM frameworks?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. “What is the role of callbacks in LLM frameworks?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

120. What is the difference between synchronous and asynchronous LLM calls in agents?

Answer. Framing: composability, state machines vs bags of chains, observability hooks. I map “What is the difference between synchronous and asynchronous LLM calls in agents?” to tradeoffs teams debate in prod for ORCHESTRATION FRAMEWORKS—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

LLM EVALUATION & OBSERVABILITY

121. How do you evaluate an LLM-powered application?

Answer. I evaluate in layers: offline gold sets with task metrics (faithfulness for RAG, tool success for agents), LLM-as-judge with bias controls plus human audits, then online gated experiments on business KPIs with safety/latency guardrails. The strong answer admits proxy metrics lie and ties release decisions to risk, not leaderboard scores.

122. What is LLM-as-a-judge evaluation?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is LLM-as-a-judge evaluation?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

123. What are the risks of using an LLM to evaluate another LLM?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. I bucket the list, tie each bucket to an operational concern (reliability, eval, safety), and avoid encyclopedic tone. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

124. What is MT-Bench?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is MT-Bench?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

125. What is HELM (Holistic Evaluation of Language Models)?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is HELM (Holistic Evaluation of Language Models)?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

126. What is BIG-Bench?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is BIG-Bench?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

127. What is the MMLU benchmark?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is the MMLU benchmark?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

128. What is TruthfulQA?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is TruthfulQA?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

129. What is HumanEval? What does it measure?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is HumanEval? What does it measure?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

130. What are the key metrics for evaluating RAG: faithfulness, answer relevance, context precision, context recall?

131. What is tracing in LLM observability?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is tracing in LLM observability?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

132. What is span and trace in the context of LLM pipelines?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is span and trace in the context of LLM pipelines?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

133. What is LangSmith used for in observability?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is LangSmith used for in observability?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

134. What is Arize Phoenix?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is Arize Phoenix?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

135. What is the role of feedback loops in LLM product improvement?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is the role of feedback loops in LLM product improvement?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

136. How do you detect and monitor prompt injection in production?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. I explain the data/control path for “How do you detect and monitor prompt injection in production?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

137. What is latency monitoring for LLM APIs?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is latency monitoring for LLM APIs?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

138. What is cost tracking in LLM applications?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is cost tracking in LLM applications?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

139. What is token usage monitoring?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. “What is token usage monitoring?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

140. How do you A/B test prompt changes in production?

Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. I explain the data/control path for “How do you A/B test prompt changes in production?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

LLMOps & PRODUCTIONIZING LLM APPS

141. What is LLMOps? How is it different from MLOps?

Answer. LLMOps is MLOps for nondeterministic LLM stacks: versioning prompts/programs/retriever configs, eval harnesses for outputs, tracing, caching, routing among models, cost accounting, red-team loops, and fast rollback. Treat prompts and retrieval indices as release artifacts with tests, not config sprawl.

142. What is prompt versioning? Why is it necessary in production?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is prompt versioning? Why is it necessary in production?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

143. What is a prompt registry?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is a prompt registry?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

144. What is caching in LLM applications? What is semantic caching?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is caching in LLM applications? What is semantic caching?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

145. What is GPTCache?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is GPTCache?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

146. What are the components of an LLM production stack?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. I bucket the list, tie each bucket to an operational concern (reliability, eval, safety), and avoid encyclopedic tone. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

147. How do you handle rate limits from LLM APIs?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. I explain the data/control path for “How do you handle rate limits from LLM APIs?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

148. What is a fallback strategy when an LLM API is unavailable?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is a fallback strategy when an LLM API is unavailable?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

149. What is a model router? When would you use multiple LLM backends?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is a model router? When would you use multiple LLM backends?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

150. What is the difference between synchronous and streaming LLM responses in a web app?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. I map “What is the difference between synchronous and streaming LLM responses in a web app?” to tradeoffs teams debate in prod for LLMOps & PRODUCTIONIZING LLM APPS—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

151. How do you manage secrets (API keys) in LLM applications?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. I explain the data/control path for “How do you manage secrets (API keys) in LLM applications?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

152. What is context management in long-running LLM conversations?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is context management in long-running LLM conversations?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

153. What is conversation summarization and why is it used in production?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is conversation summarization and why is it used in production?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

154. What is the role of a message queue in an agentic pipeline?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is the role of a message queue in an agentic pipeline?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

155. How do you handle retries and exponential backoff with LLM APIs?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. I explain the data/control path for “How do you handle retries and exponential backoff with LLM APIs?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

156. What is a guardrail in LLM applications?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is a guardrail in LLM applications?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

157. What is NeMo Guardrails?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is NeMo Guardrails?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

158. What is Guardrails AI?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is Guardrails AI?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

159. How do you prevent your LLM app from generating harmful content?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. I explain the data/control path for “How do you prevent your LLM app from generating harmful content?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

160. What is PII detection and masking in LLM pipelines?

Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. “What is PII detection and masking in LLM pipelines?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

FINE-TUNING LLMs FOR PRODUCTS

161. When should you fine-tune an LLM vs use RAG vs use prompt engineering?

Answer. Prompt/in-context first when behavior fits the base model and latency/cost favor zero training; RAG when fresh or citeable knowledge dominates; fine-tune when you need consistent style/format, low-latency specialization, or proprietary behavior hard to elicit via prompts—and you can pay data + CI costs. Often combine: SFT for tone, RAG for facts.

162. What is supervised fine-tuning (SFT) of an LLM?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is supervised fine-tuning (SFT) of an LLM?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

163. What data format is used for SFT? (instruction-response pairs, chat format)

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. I answer “What data format is used for SFT? (instruction-response pairs, chat format)” with definition + production example + how I’d validate it before rollout. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

164. What is LoRA? How does it make fine-tuning affordable?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is LoRA? How does it make fine-tuning affordable?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

165. What is QLoRA? How does it further reduce memory requirements?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is QLoRA? How does it further reduce memory requirements?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

166. What is PEFT (Parameter-Efficient Fine-Tuning)?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is PEFT (Parameter-Efficient Fine-Tuning)?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

167. What is adapter tuning?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is adapter tuning?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

168. What is prefix tuning?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is prefix tuning?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

169. What is prompt tuning?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is prompt tuning?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

170. What are the risks of fine-tuning (catastrophic forgetting, overfitting)?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. I bucket the list, tie each bucket to an operational concern (reliability, eval, safety), and avoid encyclopedic tone. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

171. How do you create a fine-tuning dataset for an LLM?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. I explain the data/control path for “How do you create a fine-tuning dataset for an LLM?”: what’s cached, streamed, retried, or batched; where p99 blows up. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

172. What is DPO (Direct Preference Optimization)? How does it differ from RLHF?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is DPO (Direct Preference Optimization)? How does it differ from RLHF?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

173. What is ORPO (Odds Ratio Preference Optimization)?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is ORPO (Odds Ratio Preference Optimization)?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

174. What is the role of synthetic data in fine-tuning LLMs?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is the role of synthetic data in fine-tuning LLMs?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

175. How do you evaluate a fine-tuned model vs the base model?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. I map “How do you evaluate a fine-tuned model vs the base model?” to tradeoffs teams debate in prod for FINE-TUNING LLMs FOR PRODUCTS—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

176. What is unsloth? What makes it faster for fine-tuning?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is unsloth? What makes it faster for fine-tuning?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

177. What is Axolotl?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is Axolotl?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

178. What is Hugging Face TRL library?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is Hugging Face TRL library?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

179. What is the difference between full fine-tuning and LoRA fine-tuning in terms of memory?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. I map “What is the difference between full fine-tuning and LoRA fine-tuning in terms of memory?” to tradeoffs teams debate in prod for FINE-TUNING LLMs FOR PRODUCTS—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

180. What is merging fine-tuned adapters with a base model?

Answer. Framing: data hygiene, forgetting, eval deltas vs base, and deployment merge strategy. “What is merging fine-tuned adapters with a base model?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

MULTIMODAL AI & SPECIALIZED LLMs

181. What is a multimodal LLM? Give examples.

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is a multimodal LLM? Give examples.” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

182. What is vision-language model (VLM)? How does it combine image and text?

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is vision-language model (VLM)? How does it combine image and text?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

183. What is CLIP? How is it trained?

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is CLIP? How is it trained?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

184. What is GPT-4V / GPT-4o? What can it do beyond text?

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is GPT-4V / GPT-4o? What can it do beyond text?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

185. What is image captioning? How do LLMs approach it?

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is image captioning? How do LLMs approach it?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

186. What is visual question answering (VQA)?

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is visual question answering (VQA)?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

187. What is a text-to-image model? How does it relate to LLMs?

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is a text-to-image model? How does it relate to LLMs?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

188. What is a text-to-speech (TTS) model?

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is a text-to-speech (TTS) model?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

189. What is a speech-to-text (STT) model? (e.g., Whisper)

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is a speech-to-text (STT) model? (e.g., Whisper)” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

190. What is a code LLM? Give examples (Codex, CodeLlama, DeepSeek Coder).

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is a code LLM? Give examples (Codex, CodeLlama, DeepSeek Coder).” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

191. What is a domain-specific LLM? Give examples in medicine, law, and finance.

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is a domain-specific LLM? Give examples in medicine, law, and finance.” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

192. What is a small language model (SLM)? When is it preferred over a large LLM?

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is a small language model (SLM)? When is it preferred over a large LLM?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

193. What is on-device LLM inference? What are its constraints?

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is on-device LLM inference? What are its constraints?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

194. What is the difference between a chat model and an instruct model?

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. I map “What is the difference between a chat model and an instruct model?” to tradeoffs teams debate in prod for MULTIMODAL AI & SPECIALIZED LLMs—not just API names. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

195. What is tool-augmented generation?

Answer. Framing: alignment of modalities in representation space, moderation across media, cost. “What is tool-augmented generation?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

AI SAFETY, ALIGNMENT & RESPONSIBLE LLM USE

196. What is AI alignment? Why is it critical for LLM products?

Answer. Alignment means building models whose behavior matches human intent and values under distribution shift and adversarial use—not only maximizing likelihood. For products: refusal policies, preference training, safety classifiers, tool constraints, and monitoring. I connect misalignment to business risk: brand, compliance, and incident response time.

197. What is Constitutional AI (Anthropic)?

Answer. Framing: misuse, jailbreaks, oversight, policy layering—tie to org risk appetite. “What is Constitutional AI (Anthropic)?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

198. What is RLHF? How does it align LLM outputs with human preferences?

Answer. Framing: misuse, jailbreaks, oversight, policy layering—tie to org risk appetite. “What is RLHF? How does it align LLM outputs with human preferences?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

199. What is reward hacking in RLHF?

Answer. Framing: misuse, jailbreaks, oversight, policy layering—tie to org risk appetite. “What is reward hacking in RLHF?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

200. What is a jailbreak? How do you protect against it?

Answer. Framing: misuse, jailbreaks, oversight, policy layering—tie to org risk appetite. “What is a jailbreak? How do you protect against it?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

201. What is prompt injection in the context of LLM agents?

Answer. Framing: misuse, jailbreaks, oversight, policy layering—tie to org risk appetite. “What is prompt injection in the context of LLM agents?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

202. What is indirect prompt injection?

Answer. Framing: misuse, jailbreaks, oversight, policy layering—tie to org risk appetite. “What is indirect prompt injection?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

203. What are content moderation APIs and when should you use them?

Answer. Framing: misuse, jailbreaks, oversight, policy layering—tie to org risk appetite. I bucket the list, tie each bucket to an operational concern (reliability, eval, safety), and avoid encyclopedic tone. I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

204. What is the OpenAI moderation endpoint?

Answer. Framing: misuse, jailbreaks, oversight, policy layering—tie to org risk appetite. “What is the OpenAI moderation endpoint?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

205. What is toxicity detection in LLM applications?

Answer. Framing: misuse, jailbreaks, oversight, policy layering—tie to org risk appetite. “What is toxicity detection in LLM applications?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

206. What are the risks of LLMs in agentic settings (irreversible actions, scope creep)?

207. What is the principle of least privilege for AI agents?

Answer. Framing: misuse, jailbreaks, oversight, policy layering—tie to org risk appetite. “What is the principle of least privilege for AI agents?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

208. What is a human-in-the-loop (HITL) system for agents?

Answer. Framing: misuse, jailbreaks, oversight, policy layering—tie to org risk appetite. “What is a human-in-the-loop (HITL) system for agents?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

209. What is output validation in LLM pipelines?

Answer. Framing: misuse, jailbreaks, oversight, policy layering—tie to org risk appetite. “What is output validation in LLM pipelines?” gets a tight product definition, a concrete example in an LLM stack, and the sharp edge teams hit (cost, security, or evaluation). I keep the ops lens: latency, tokens/cost, eval regression, and guardrails—applied interviews reward shipping judgment.

210. What are the OWASP Top 10 for LLMs?

Answer. The OWASP Top 10 for LLMs names classes like prompt injection, insecure output handling, training pipeline poisoning, sensitive data leakage, excessive agency, overreliance, and supply-chain issues. It’s a checklist for threat modeling: pair with least privilege, schema validation, egress controls, and eval-based gates before tools execute.