sharpbyte.dev
← Learning hub
Path B · applied-llm

Ship with models

APIs, tokenization and sampling, structured outputs and tools, hybrid RAG, LangGraph, agents and MCP, evaluation with LangSmith, and shipping patterns—including when to reach for fine-tuning or local models (Ollama). Bridge callouts point back to Path A where useful.

B0

Minimum viable Python

A small, practical slice of Python so you can call models, move data, and keep secrets safe. Each topic below has a plain explanation, a short example you can run, and how it connects to AI work.

Names, values, and types

Why this matters

A computer program works by remembering values under names (variables). Numbers do math; text stays in quotes; true/false values follow rules in if checks. Errors often come from mixing types (for example, treating a number like text without converting). Naming things clearly saves hours when scripts grow.

Example code
names_and_math.py
# You can rerun this file anytime; Python reads top to bottom.

model_name = "gpt-example"
max_tokens = 512
temperature = 0.2

# f-strings glue text and values in one readable line.
prompt = f"Answer in under {max_tokens} tokens using {model_name}."
print(prompt)

cost_per_1k = 0.01
approx_cost = (max_tokens / 1000) * cost_per_1k
print("Rough cost hint:", round(approx_cost, 4))
How this shows up in AI

Model names, token limits, temperature, and cost estimates are all plain variables in integration scripts. Clear names make it obvious what you are sending to an API and what you are billing for.

Lists and dictionaries

Why this matters

A list is an ordered collection (great for chat history or batches). A dictionary maps keys to values (great for settings, JSON-like records, and tool parameters). Most AI APIs expect nested dictionaries and lists, so comfort here is non-negotiable.

Example code
payload_shape.py
# A chat-style message list you will see in many LLM APIs.
messages = [
    {"role": "system", "content": "You are a careful assistant."},
    {"role": "user", "content": "Summarize this in one line."},
]

# Settings often live in a dict you pass around or load from a file.
request_body = {
    "model": "example-model",
    "messages": messages,
    "temperature": 0.2,
}

print(len(messages), "messages so far")
print(request_body["model"])
How this shows up in AI

Prompts, tool arguments, retrieval hits, and model responses are almost always dictionaries and lists. RAG pipelines pass chunks as lists; agents pass tool calls as structured dicts. Learning this shape makes API docs readable.

Functions: one job, reusable

Why this matters

A function packages a recipe: inputs, steps, outputs. Duplicated code hides bugs—when you fix a retry or a header format, you want one place to change it. Small functions also make notebooks and scripts easier to test.

Example code
helpers.py
# Type hints after colons are optional but help readers and editors.

def build_user_message(text: str) -> dict:
    return {"role": "user", "content": text.strip()}


def estimate_input_cost(chars: int, price_per_1k_chars: float) -> float:
    # Toy estimate only — real pricing uses tokens, not raw characters.
    return (chars / 1000) * price_per_1k_chars


msg = build_user_message("  Hello!  ")
print(msg)
print(estimate_input_cost(4000, 0.002))
How this shows up in AI

You will wrap “format the prompt,” “attach citations,” “normalize tool output,” and “compute token counts” in functions. Shared helpers keep LangChain-style chains and quick scripts consistent.

Loops and decisions

Why this matters

if/else chooses what to do; for repeats work over a list. Real pipelines process many files, users, or chunks—loops are how you scale a script without copy-paste. Guard clauses (early “if something is wrong, stop”) prevent bad API calls.

Example code
batch_guards.py
# Pretend these are document chunks you will embed or send to a model.
chunks = ["Short note", "", "Another paragraph"]
max_len = 20

clean = []
for piece in chunks:
    text = piece.strip()
    if not text:
        continue  # skip empties
    if len(text) > max_len:
        text = text[:max_len] + "…"
    clean.append(text)

print(clean)
How this shows up in AI

Ingestion loops over files; RAG loops over chunks; agents loop over tool steps. You use if to skip PII, cap length, route to a cheaper model, or stop when an answer is good enough.

JSON: the shape APIs speak

Why this matters

JSON is text that encodes lists and dictionaries in a standard way. LLM APIs send JSON over the wire; structured outputs (tool calls, JSON mode) return JSON strings you must parse. Python’s json module turns JSON text into live objects—and back again for saving logs.

Example code
json_roundtrip.py
import json

raw = '{"tool": "search", "args": {"q": "latest rust release"}}'

data = json.loads(raw)
print(data["tool"], data["args"]["q"])

data["args"]["top_k"] = 3
printed = json.dumps(data, indent=2, sort_keys=True)
print(printed)
How this shows up in AI

Parsing function-call payloads, validating schema-like responses, caching retrieval results, and writing trace files all use json.loads and json.dumps. Agents and RAG tooling exchange JSON constantly.

HTTP requests and secrets from the environment

Why this matters

Talking to hosted models means HTTP: your script sends a request; the server returns a status code and body. Secrets (API keys) must not live inside source files where they leak in git screenshots. Putting keys in environment variables is a simple baseline; never print keys into logs.

Example code
safe_ping.py
# pip install httpx — similar to requests but supports modern async patterns too.
import os

import httpx


def main() -> None:
    api_key = os.environ.get("EXAMPLE_API_KEY")
    if not api_key:
        raise SystemExit("Set EXAMPLE_API_KEY before running.")

    headers = {"Authorization": f"Bearer {api_key}"}
    # Example only — replace URL with your provider's health/metadata endpoint.
    resp = httpx.get("https://httpbin.org/headers", headers=headers, timeout=10.0)
    print("status", resp.status_code)


if __name__ == "__main__":
    main()
How this shows up in AI

Every OpenAI-compatible or bespoke model endpoint follows the same rhythm: headers with auth, JSON body with messages/tools, inspect status codes, parse JSON replies. Guards around missing keys stop silent failures in CI/CD.

Paths and logs you can trust

Why this matters

File paths look simple until you jump between laptops, containers, or Windows versus Mac/Linux. The pathlib module builds paths safely. Structured logging timestamps lines and separates details from user-facing prints—critical when an LLM latency spike needs diagnosis.

Example code
paths_logs.py
import logging
from pathlib import Path

# logging.basicConfig is fine for tutorials; prod uses richer handlers.
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

root = Path(__file__).resolve().parent
data_dir = root / "data"
data_dir.mkdir(exist_ok=True)

prompt_path = data_dir / "system_prompt.txt"
prompt_path.write_text("You cite sources.", encoding="utf-8")

logging.info("wrote_prompt path=%s bytes=%s", prompt_path, prompt_path.stat().st_size)
How this shows up in AI

Prompt templates, retrieval indexes, cached embeddings, evaluation sets, and trace dumps all touch the filesystem. Logs that include latency, provider, model, and correlation IDs mirror what observability stacks expect in production LLM apps (Path B — Observability previews that layer).

B1

Models as products

Hosted models behave like billed APIs: limits, tiers, prompts, and safety policies change how you ship. These topics connect day-to-day code to invoices and uptime.

How LLM APIs work

Why this matters

You rarely run the neural net on your laptop in production—you call an HTTP API that hides weights, hardware, patching, and scale. Responses come back as status codes and bodies (almost always JSON). If you know the flow—auth header, JSON payload, parsing the assistant message—you can integrate any vendor that follows similar patterns. Most products also expose streaming (tokens arrive as they’re generated), tool calling (model asks to run a named function), and rate limits (how many requests per minute you may send).

Example code
chat_once.py
# pip install httpx — replace URL + model with your provider’s docs.
import json
import os

import httpx

API_URL = "https://api.example.com/v1/chat/completions"  # placeholder


def call_chat(user_text: str) -> dict:
    payload = {
        "model": "example-small",
        "messages": [
            {"role": "system", "content": "Be concise."},
            {"role": "user", "content": user_text},
        ],
        "temperature": 0,
    }
    headers = {
        "Authorization": f"Bearer {os.environ.get('API_KEY','')}",
        "Content-Type": "application/json",
    }
    resp = httpx.post(API_URL, headers=headers, json=payload, timeout=60.0)
    resp.raise_for_status()
    data = resp.json()
    msg = data["choices"][0]["message"]["content"]
    return {"reply": msg, "raw": data}


print(json.dumps(call_chat("What is an API?"), indent=2)[: 800])
How this shows up in AI

Streaming keeps chat UIs responsive: you render partial tokens instead of waiting for the whole answer. Tools turn the model into an orchestrator (search CRM, query SQL)—your code executes the tool and sends results back as the next user message. Reliable apps add retries with backoff for 429/5xx and clamp timeouts so hung calls do not block your service.

Tokens & pricing

Why this matters

Vendors bill and cap usage in tokens, not in words or characters. A token might be part of a word, a whole word, or punctuation—the exact split depends on the model’s tokenizer. Under the hood, byte-pair encoders (BPE) repeatedly merge frequent character pairs until the vocabulary holds tens of thousands of sub-word units, so rare strings still map to known pieces. Special tokens mark chat roles, end-of-sequence, and vendor-specific control segments—always apply the provider’s chat template so those markers line up with what the base model expects. Inputs and outputs usually have separate prices; long prompts cost more before the model says anything. Each model also has a context window: a maximum number of tokens the model can “see” in one turn (prompt plus answer combined). Passing that limit returns an error unless you trim or summarize. Code and identifiers tokenize more densely than prose; measure real prompts with tiktoken (OpenAI-compatible models) or the provider’s counter instead of guessing from word count.

Example code
estimate_cost.py
# Rough classroom math — prod should use the provider’s tokenizer or official counter.


def chars_to_ballpark_tokens(text: str) -> int:
    # Very loose rule of thumb for English-like text (~4 chars per token).
    stripped = text.strip()
    return max(1, len(stripped) // 4)


def dollars_for_call(input_tokens: int, output_tokens: int, price_in: float, price_out: float) -> float:
    # price_* = USD per 1 million tokens from the pricing page.
    return (input_tokens * price_in / 1_000_000) + (output_tokens * price_out / 1_000_000)


prompt = "Summarize this policy in three bullets…" + "x" * 2000
answer = "• Point one\n• Point two\n• Point three"

in_tok = chars_to_ballpark_tokens(prompt)
out_tok = chars_to_ballpark_tokens(answer)
bill = dollars_for_call(in_tok, out_tok, price_in=0.15, price_out=0.60)
print("ballpark tokens", in_tok, out_tok, "USD ~", round(bill, 6))
How this shows up in AI

Product teams set budgets per user or per feature using token math. Engineering checks that RAG context, tool results, and chat history still fit the window. Batch or off-peak APIs often trade latency for lower price when you preprocess large backlogs. Always confirm numbers against the provider’s live pricing page—rates change.

Context, temperature & sampling

Why this matters

The context window is everything in one pass: system prompt, tools, retrieved chunks, history, and the answer so far. Larger windows help, but research shows “lost in the middle”—models attend more to the start and end of long context. For RAG, put the sharpest evidence first and last in the stuffed context, not buried mid-prompt. Temperature scales randomness in token sampling: near 0 for deterministic JSON, evals, and regression tests; ~0.7 for chatty UX; high values for brainstorming only. top_p (nucleus sampling) keeps the smallest set of tokens whose cumulative probability exceeds p—nuanced overlap with temperature, so avoid cranking both up together. max_tokens caps generation length; you pay for actual completions, not the cap. stop sequences cut off the assistant when it would otherwise role-play the next user turn or leak markdown fences—useful for clean extraction.

Example code
tiktoken_guard.py
# pip install tiktoken — exact counts for OpenAI-style models
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
text = open("system_prompt.txt", encoding="utf-8").read()
n = len(enc.encode(text))
print("system prompt tokens", n)

# Guard before calling the API (pseudo budget)
MAX_IN = 120_000
if n > MAX_IN:
    raise ValueError("system prompt exceeds planned budget")
How this shows up in AI

Every feature flag on “creativity” is really a sampling policy change. Structured pipelines keep temperature at 0, separate free-form reasoning from a second “extract to schema” call when quality demands it.

Cost & efficiency

Why this matters

Model spend can grow quietly: long prompts, chatty agents, and repeated system instructions on every call add up. Efficiency is not “being cheap”—it is shipping the same quality with fewer tokens and fewer round-trips. Common levers: cache stable answers, compress prompts, route easy questions to smaller models, and avoid sending the same document chunks twice in one session.

Example code
routing_and_cache.py
# Tiny pattern: cheap path for “easy” questions, cache for repeated exact prompts.

from functools import lru_cache


@lru_cache(maxsize=256)
def fake_model_call(prompt: str, tier: str) -> str:
    # Replace with real HTTP calls—tier might be "small" vs "large".
    return f"[{tier}] answer for: {prompt[:40]}…"


def route_prompt(user_text: str) -> str:
    text = user_text.lower()
    easy = any(k in text for k in ("define", "what is", "hello"))
    tier = "small" if len(user_text) < 240 and easy else "large"
    return fake_model_call(user_text.strip(), tier)


for q in ("What is an API?", "What is an API?", "Long analysis with lots of context…" * 5):
    print(route_prompt(q))
How this shows up in AI

Observability dashboards often plot dollars per successful task, not tokens alone. Retrieval-heavy apps pay for embedding calls plus the chat call—remember both. Compression (summaries of prior turns instead of raw chat logs) preserves quality while cutting input tokens when done carefully with eval hooks.

Prompting for reliability

Why this matters

The API does not guess your product rules—you spell them out in prompts and optional safety/policy layers. Treat the stack as a protocol: system messages carry durable configuration (role, constraints, output contract); user messages carry the task; assistant turns are what the model has already said and may include tool results. Structure long inputs with clear delimiters—XML-style tags like <context>/<question> help the model treat retrieved text as data, not instructions. Some APIs let you prefill the assistant turn (for example starting with {) to bias toward JSON. Models are stateless on the wire: you resend history every call, so trim with a window, summary, or deque as chats grow.

Zero-shot relies on instructions alone; few-shot threads example input/output pairs to lock format and tone—quality beats quantity, and dynamically retrieved examples (similar queries from a bank) often beat a static list. Chain-of-thought asks for intermediate reasoning so later tokens condition on those steps; pairing scratchpads (<thinking>…</thinking>) with a clean user-facing answer keeps UX tidy. Self-consistency samples multiple answers then votes—expensive but useful for borderline decisions. For agents, the ReAct loop (thought → tool call → observation) is the pattern behind most frameworks; understand it in plain code before leaning on abstractions.

Avoid vague instructions, overstuffed system prompts (>~500 tokens of rules often erodes adherence), purely negative directions (“don’t…”), ambiguous pronouns, and multi-intent single calls—chain small tasks in code instead. When JSON fails, combine schema APIs, Pydantic validation, and bounded retries rather than hoping prose instructions alone.

Example code
structured_prompt.py
# Build prompts as data structures you can reuse, log, and A/B test.
import json


def support_messages(ticket: str) -> list[dict]:
    schema_note = (
        "Return JSON with keys: summary (string), severity (low|med|high), "
        "next_action (string). No extra keys."
    )
    return [
        {
            "role": "system",
            "content": (
                "You triage internal support tickets. If information is missing, set severity to low "
                "and explain what is missing in next_action."
            ),
        },
        {"role": "user", "content": schema_note + "\n\nTicket:\n" + ticket},
    ]


sample = "VPN drops every day at 4pm for team B; no error codes captured."
payload = support_messages(sample)
print(json.dumps(payload, indent=2))
How this shows up in AI

Pair JSON mode or schema-constrained decoding with parsers, retries, and offline rubrics; add a small “judge” pass when quality gates dollars or compliance.

Structured outputs & schemas

Why this matters

Raw completions are hard to parse like an untyped API payload. Prefer generation-time constraints (JSON Schema / structured output modes) over “please return JSON” prose alone. When those APIs exist, they stop the model from emitting commentary or invalid field names before tokens leave the decoder. Elsewhere, pair Pydantic (or JSON Schema) with validation and short repair prompts. Use temperature 0 for extraction. For long chain-of-thought tasks, let the model reason in one call, then run a second, cheap call that extracts fields into a schema—structure and reasoning stay decoupled.

Example code
typed_parse_stub.py
# Conceptual: provider-native parse + Pydantic model
from pydantic import BaseModel, Field
from typing import Literal

class Triage(BaseModel):
    severity: Literal["low", "med", "high"]
    summary: str = Field(max_length=280)
    needs_human: bool

# Many SDKs accept `response_format=Triage` and return `.parsed`
How this shows up in AI

LangChain’s with_structured_output(MyModel) wraps the same idea for LCEL graphs—debug failures by knowing whether the model or the parser regressed.

Function calling & tool choice

Why this matters

Function calling is the portable agent primitive: the model emits a structured call; your runtime executes; results return as tool messages. Tool descriptions drive routing—say when to use and when not to use a tool. Use tool_choice to force a specific tool, require any tool, or leave the model in auto mode. Models may issue parallel tool calls; execute with asyncio.gather and return each result with the matching tool_call_id.

Example code
tool_loop_contract.py
# Contract shape many hosts accept (names vary slightly by vendor)
tool_meta = {
    "type": "function",
    "function": {
        "name": "lookup_policy",
        "description": (
            "Search internal policy docs. Use for product and compliance questions; "
            "do not use for chit-chat."
        ),
        "parameters": {
            "type": "object",
            "properties": {"q": {"type": "string"}},
            "required": ["q"],
        },
    },
}
How this shows up in AI

The same loop powers chat-with-data copilots and LangGraph ToolNode—execution stays in your codebase, not inside the weights.

Prompt injection & untrusted context

Why this matters

Any text you did not author—user input, web pages, uploaded PDFs, retrieved chunks—can try to override the system policy (“ignore previous instructions…”). Indirect injection via RAG is especially sneaky: a malicious document in the index hijacks generation after retrieval. Defend with labeled delimiters, least-privilege tool sets, structured outputs for side effects, human approval before destructive tools, rate limits, and logging. Third-party guardrail libraries can help, but architecture (no secrets in prompts, allowlists for tools) matters more than wording tricks alone.

How this shows up in AI

Treat untrusted content as data, not as instructions—state that quoted material cannot change policy—and keep adversarial strings in eval CI when stakes are high.

B2

RAG

Retrieval-Augmented Generation means: fetch trustworthy snippets from your own store, paste them beside the question, then ask the model to answer using that context. Layer in conversational rephrasing, hybrid BM25+vector retrieval, reranking, and advanced chunk strategies when accuracy plateaus.

When to RAG vs long context vs fine-tune

Why this matters

RAG fits when facts change often, when answers must cite internal docs, or when stuffing the whole corpus into one prompt is impossible or too expensive. Very long contexts help when documents are modest in size—but cost, latency, and “needle in a haystack” errors still bite. Fine-tuning teaches tone or specialized behavior; it is a poor substitute for up-to-date knowledge unless you continually retrain. Many teams combine paths: cheap RAG baseline, selective fine-tuning, occasional long-context dumps for audits.

Example code
route_by_need.py
# Policy sketch inside your server — numeric thresholds are placeholders you tune.

from dataclasses import dataclass


@dataclass
class Question:
    text: str
    needs_internal_kb: bool
    doc_tokens_estimate: int  # rough guess before retrieval
    corpus_changes_daily: bool


def pick_strategy(q: Question) -> str:
    if q.corpus_changes_daily and q.needs_internal_kb:
        return "rag"
    if q.doc_tokens_estimate < 6000 and not q.needs_internal_kb:
        return "long_context"
    return "rag_plus_evaluate_finetune_for_style"


print(pick_strategy(Question("Q3 rebate policy", True, 9000, True)))
How this shows up in AI

Product calls here are compliance (only approved sources), freshness (pricing pages), latency budgets, and the cost of re-ingesting embeddings whenever docs change—each knob affects whether RAG stays the right backbone.

Chunking strategies

Why this matters

Models see chunks, not whole PDFs—splitting decides whether the right sentences land together. Too-large chunks bury the answer in noise for embedding search; tiny chunks lose surrounding context (“this section” loses which product it meant). Overlap (repeat a few sentences between neighbors) lowers the chance an answer sits on a boundary. Structured docs deserve structure-aware splitting: headings first, tables as units, fenced code blocks kept intact where possible.

Example code
chunk_overlap.py
# Character-based splitter for demos — production often uses tokenizer counts + doc structure.

from typing import Iterator


def chunk_words(text: str, size: int, overlap: int) -> Iterator[tuple[int, str]]:
    words = text.split()
    step = max(1, size - overlap)
    for start in range(0, len(words), step):
        piece = words[start : start + size]
        if not piece:
            break
        yield start, " ".join(piece)


md = ("# Returns\nEligible within 30 days. # Shipping\nCarrier delays apply.\n" * 3)
for idx, (start, chunk) in enumerate(chunk_words(md, size=12, overlap=3)):
    print(idx, "offset", start, repr(chunk[:60]), "…")
How this shows up in AI

Chunk IDs become retrieval keys—you log which chunk/version/collection produced an answer when users challenge output in support. Re-chunking a corpus usually forces re-embedding budget planning.

Embeddings & vector search

Why this matters

An embedding model turns text into fixed-length vectors so “similar meaning” tends to land near neighbors in math space. At query time you embed the question, fetch the nearest stored chunks (often k between 5 and 50), then hand those snippets to the chat model. Metadata filters (team, locale, ACL) shrink the search universe so private docs never surface for outsiders. Keyword search still matters—combine with vectors (hybrid) when users type SKUs or rare tokens embeddings blur together.

Example code
cosine_neighbors.py
# Toy vector search — real stacks use Pinecone, pgvector, FAISS, etc.
import math


def cosine(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(y * y for y in b))
    return dot / (na * nb + 1e-9)


def top_k(query_vec: list[float], corpus: list[tuple[str, list[float]]], k: int = 2):
    scored = [(cosine(query_vec, vec), title) for title, vec in corpus]
    scored.sort(reverse=True)
    return scored[:k]


docs = [
    ("refund policy", [1.0, 0.0, 0.1]),
    ("shipping delays", [0.2, 0.9, 0.0]),
]
q = [0.95, 0.05, 0.05]
print(top_k(q, docs))
How this shows up in AI

Index choice (HNSW, IVF) trades recall vs CPU; batch embedding jobs run on schedules; query-time filters enforce auth the same way your REST API would—never trust the model to hide data the retriever should not fetch.

RAG quality

Why this matters

Bad RAG looks smart but lies with confidence. You measure retrieval (did the right chunk appear in the top set?) and generation (did the model stay grounded?). Rerankers rescore a wider candidate list with a heavier model to push the best paragraph to the top. Citations force the assistant to point at chunk IDs or quotes users can verify. When nothing matches, the product should say “not in the knowledge base” instead of inventing—an explicit abstain path.

Example code
recall_spot_check.py
# Recall@k for a labeled query→gold_chunk_id dataset (minimal illustration).

def recall_at_k(rows: list[tuple[str, str, list[str]]], k: int) -> float:
    hits = 0
    for _, gold_id, retrieved_ids in rows:
        top = retrieved_ids[:k]
        hits += int(gold_id in top)
    return hits / len(rows) if rows else 0.0


labeled = [
    ("q1", "chunk_12", ["chunk_07", "chunk_12", "chunk_03"]),
    ("q2", "chunk_99", ["chunk_99", "chunk_01", "chunk_02"]),
]
print("Recall@" + str(2), recall_at_k(labeled, 2))
How this shows up in AI

Human graders spot-check brittle domains; dashboards track regression when embeddings or chunks change. Quality work is iterative: tighten prompts, widen k, rerank, or fix chunking—not only swap LLMs.

RAG architecture

Why this matters

Shipping RAG spans ingestion (crawl/upload → clean → chunk → embed → index), serving (retrieve + prompt + guardrails), and governance (ACLs, versioning, auditing). Broken ingestion poisons answers quietly; staleness hurts trust but refetch spikes cost. Auth-aware retrieval means the embedding index stores tenant metadata and filters queries the same way your database would.

Example code
ingest_outline.py
# State machine vibe for ingestion jobs — each step emits logs/metrics.
from typing import Literal

Stage = Literal["fetched", "parsed", "chunked", "embedded", "indexed"]


def advance(current: Stage) -> Stage:
    order = {"fetched": "parsed", "parsed": "chunked", "chunked": "embedded", "embedded": "indexed"}
    return order[current]


doc = {"id": "wiki/returns", "version": 17, "acl_role": "public", "stage": "fetched"}
while doc["stage"] != "indexed":
    doc["stage"] = advance(doc["stage"])
print(doc)
How this shows up in AI

Incident response replays retrieval decisions (which chunk, which model version); blue/green index swaps prove safe rollout; feature flags isolate new chunking rules—your architecture should make those switches boring and observable.

Conversational RAG & query rephrasing

Why this matters

Follow-up questions (“How long does it take?”) lack the subject from earlier turns, so naive vector search matches the wrong neighborhood. Fix it with a standalone question step: an LLM call that rewrites the latest user message using chat history into a self-contained query, then embed/search against that text. LangChain’s create_history_aware_retriever + create_retrieval_chain encode the same two-stage pattern—rephrase for retrieval, then answer with citations.

How this shows up in AI

Pair with session stores (Redis, Postgres) or RunnableWithMessageHistory so tenant-scoped session IDs keep memory bounded.

Hybrid search — BM25 + vectors

Why this matters

Dense embeddings excel at paraphrase; BM25 nails SKU-like tokens, error codes, and names with little semantic overlap. Hybrid retrieval runs both rankers and merges lists—reciprocal rank fusion (RRF) is a robust merge without fragile score normalization. In LangChain, EnsembleRetriever with weighted retrievers approximates the pattern; hosted vector DBs (for example Qdrant) may offer server-side hybrid for speed.

How this shows up in AI

Expect measurable recall gains on enterprise corpora; tune weights on a labeled eval set rather than by intuition alone.

Reranking, HyDE & parent–child chunks

Why this matters

Two-stage retrieve → rerank: pull 15–30 candidates with fast bi-encoders, then score query–passage pairs with a cross-encoder or hosted rerank API—precision jumps because the scorer sees both sides jointly. HyDE asks the LLM for a short hypothetical answer, embeds that text, and searches with it—helpful when queries are tiny and docs are formal; skip when hallucinated hypotheticals could derail retrieval. Parent–child chunking indexes small child chunks for recall but feeds parents to the LLM for context; multi-vector retrievers can also index summaries or synthetic questions per chunk for better recall diversity.

How this shows up in AI

These techniques pair with metrics like RAGAS (faithfulness, answer relevance) in CI—prove value on your own golden pairs before stacking complexity.

B3

LangChain & LangGraph

Frameworks package common LLM patterns—LCEL pipes, templates with history slots, tool calls, and graphs with branches, checkpoints, and loops. Use them when repetition outgrows bespoke scripts; escape to plain code when the framework fights your design.

LangChain building blocks

Why this matters

LCEL treats every step as a Runnable with the same surface (invoke, batch, stream, async twins). The pipe operator composes prompt | llm | parser like Unix streams. RunnableParallel fans out—for example RAG that needs context from a retriever and question from a passthrough—while RunnablePassthrough forwards raw inputs. RunnableLambda wraps ordinary functions so bespoke transforms stay composable. Use .stream()/.astream() to tunnel tokens through the whole stack without one-off wiring.

Example code
runnable_shapes.py
# Conceptual shape without imports — mirrors LCEL-style composition.


def prompt(user: str) -> str:
    return f"Answer in bullets under 80 words:\n{user}"


def fake_llm(text: str) -> str:
    return '{"bullets":["one","two"]}'  # pretend JSON reply


def pipeline(user: str) -> str:
    return fake_llm(prompt(user))


print(pipeline("What is RAG?"))
How this shows up in AI

Production apps chain retrievers + rerankers + LLM + guardrails; frameworks attach callbacks for logging token usage and tool calls—pair that with your observability stack so each step is a named span.

LangGraph

Why this matters

Some flows are not straight lines: escalate to a human, loop until a constraint passes, retry with a different prompt, or route by language. LangGraph models steps as a graph with explicit state (TypedDict plus reducers like add_messages); nodes read state, do work, emit updates. Compile with MemorySaver for dev, PostgresSaver (or SQLite) for durable threads keyed by thread_id. interrupt_before gates dangerous nodes; recursion_limit caps cycles; the Send API fans out parallel workers for map-reduce style workloads. Cycles are allowed for iterative reasoning but need hard bounds so spend stays predictable.

Example code
graph_state_stub.py
# Minimal state bag + router — LangGraph adds persistence and real graph APIs on top.
from typing import TypedDict


class ChatState(TypedDict):
    question: str
    retrieved: list[str]
    verdict: str


def retrieve(state: ChatState) -> ChatState:
    state["retrieved"] = ["chunk_a", "chunk_b"]
    return state


def answer_or_retry(state: ChatState) -> str:
    ok = len(state["retrieved"]) >= 1
    return "generate" if ok else "retrieve_again"


s: ChatState = {"question": "pricing?", "retrieved": [], "verdict": ""}
s = retrieve(s)
print("router →", answer_or_retry(s))
How this shows up in AI

Human-in-the-loop review for refunds, legal approvals, or moderation hits map cleanly to interrupts—then the graph resumes with the reviewer’s note as new state.

Orchestration patterns

Why this matters

Supervisor: one model delegates subtasks to specialists (router-style). Map-reduce: fan-out summaries over many documents, then collapse into one answer with strict token caps. Structured extraction: first pass classifies, second pass pulls JSON fields—cheaper than one giant prompt when documents are long. Choose patterns that match latency budgets and observable failure points.

Example code
map_reduce_words.py
# Toy map-reduce text compression — production uses real chunk summarization + cite maps.


def summarize_piece(text: str) -> str:
    return "summary:" + text[:40] + "…"


docs = ["long doc A …" * 10, "long doc B …" * 10]
partials = [summarize_piece(d) for d in docs]
final = "MERGE: " + " | ".join(partials)
print(final)
How this shows up in AI

Customer-facing copilots often mix these patterns: classify intent → branch to RAG vs transactional tool → merge with brand voice—tracing each branch ID makes support tickets debuggable.

Prompt templates & session memory

Why this matters

ChatPromptTemplate keeps system/human prompts versioned; MessagesPlaceholder injects rolling history without string surgery. LLMs remain stateless at the HTTP layer—your service stores chat turns, optionally in Redis/SQL. Wrap chains with RunnableWithMessageHistory or trim with token-aware utilities so sessions never overflow pricing or context windows. Summary memory (compress older turns) is just another LLM hop you schedule when budgets bite.

How this shows up in AI

Multi-tenant SaaS should namespace session IDs (tenant:user:session) and TTL ephemeral stores—mirrors patterns you already use for web sessions.

B4

Agentic AI

Agents decide which tools to call and when to stop—via classic executors or LangGraph—with room for MCP-backed integrations and multi-agent routing. That flexibility costs tokens, latency, and safety surface—design budgets like you would for any external API loop.

Agents vs fixed workflows

Why this matters

A fixed workflow is scripted: if step A then step B—predictable, easy to test. An agent chooses actions from a menu until it believes it is done—dynamic control flow instead of a designer-fixed chain. Use agents when the path varies a lot; stick to workflows when correctness and auditability beat autonomy. Watch for failure modes: infinite loops, hallucinated tool names, goal drift, and context bloat from oversized observations; mitigate with schemas, allowlists, and truncation. Every agent needs stop rules: maximum steps, maximum wall time, maximum spend, and escape hatches when tools fail repeatedly.

Example code
bounded_loop.py
# Guard rails around a toy tool loop — swap tool executor for real HTTP/SQL with auth.


def tool_search(query: str) -> str:
    return "doc:Refund within 30 days" if "refund" in query.lower() else "no_hit"


def agent_like_turn(user: str, max_steps: int = 3) -> str:
    steps = 0
    context = []
    q = user
    while steps < max_steps:
        hit = tool_search(q)
        context.append(hit)
        if hit != "no_hit":
            return "\n".join(context)
        q += " (please broaden query)"
        steps += 1
    return "gave_up_after_budget"


print(agent_like_turn("refund policy"))
How this shows up in AI

Customer automation often blends both: deterministic checkout for purchases, lightweight agent only for ambiguous support tickets—with the agent never allowed to finalize money movement without human approval.

Tool design

Why this matters

Tools are contracts: JSON arguments in, predictable JSON-ish results out—plus HTTP status semantics if remote. Good tools are idempotent when possible (same args → same effect), return actionable errors (“row locked, retry in 2s”), and avoid surprise side effects the model cannot see. Describe each tool to the model with tight parameter schemas so junk calls fail fast in validation, not in production data.

Example code
tool_schema.py
# Example tool definition you pass to many OpenAI-style tool slots.

lookup_customer = {
    "type": "function",
    "function": {
        "name": "lookup_customer",
        "description": "Fetch CRM record by email; read-only.",
        "parameters": {
            "type": "object",
            "properties": {
                "email": {"type": "string", "format": "email"},
            },
            "required": ["email"],
            "additionalProperties": False,
        },
    },
}

print(lookup_customer["function"]["name"])
How this shows up in AI

Allowlists at the gateway block tools in dev/stage; rate limits per tool protect shared databases; redact PII from tool logs while keeping enough context to debug misfires.

LangChain executors & streaming steps

Why this matters

AgentExecutor wraps a policy (classic ReAct text parsing or modern tool-calling) and runs the observe → act loop with limits: max_iterations, wall-clock caps, and parsing-error recovery. Prefer create_tool_calling_agent when the model supports native function calls—fewer brittle string parsers. Stream UX via astream_events(..., version="v2") to surface “tool started / tool finished” to clients. When flows need checkpoints, guards per node, or subgraphs, migrate the same tools into LangGraph instead of growing executor kludge.

How this shows up in AI

Understanding the executor loop explains what LangGraph replaced: explicit graph state plus persistence instead of an opaque while-loop.

Model Context Protocol (MCP)

Why this matters

MCP standardizes how tools, resources, and prompts surface to agents—stdio or HTTP/SSE transports, one server many clients. Build internal MCP servers for CRM, warehouses, or ticketing; connect them through LangChain adapters so LangGraph agents auto-discover tools instead of hand-written wrappers for every API.

How this shows up in AI

Think “USB-C for enterprise integrations”: share a MCP server across chat clients, IDEs, and batch jobs with consistent auth and auditing.

Planning & control

Why this matters

ReAct-style loops interleave reasoning text with tool calls; plan–execute–verify writes a checklist first to reduce thrash. After any tool failure, surface the exception to the model in a normalized message and optionally switch to safer tools. Control means you own the supervisor policy: escalate, retry with backoff, or refuse when confidence drops.

Example code
plan_then_act.py
# Two-phase mental model mirrored in prompts you send to models.

def plan_prompt(task: str) -> str:
    return (
        "List 3 concrete steps before using tools."
        + " After each step, mark done/partial."
        + f"\nTask: {task}"
    )


def act_prompt(task: str, plan: str) -> str:
    return f"Execute this plan with tools only when needed.\nPlan:\n{plan}\nTask:\n{task}"


task = "Ship replacement for order 7781 if policy allows"
plan = "1) fetch order 2) check policy 3) create ticket"
print(plan_prompt(task))
print(act_prompt(task, plan))
How this shows up in AI

Recovery templates (“tool X failed: network — ask user to retry”) keep loops from spiraling; verification can be a smaller model that checks JSON output against business rules before commit.

Memory for agents

Why this matters

Short-term memory is the recent chat window and scratchpad summaries. Long-term memory stores stable facts (preferences, project IDs) across sessions—usually in a database you control, not only in the model weights. Any memory path is a PII risk: scrub before write, encrypt at rest, and let users delete their history. Entity-centric stores help the agent remember “this user prefers metric units” without replaying full transcripts every call.

Example code
memory_store.py
# Tiny JSON-on-disk stand-in for a real vector + SQL profile store.
import json
from pathlib import Path


def remember(user_id: str, key: str, value: str, base: Path) -> None:
    path = base / f"{user_id}.json"
    data = json.loads(path.read_text(encoding="utf-8")) if path.exists() else {}
    data[key] = value
    path.write_text(json.dumps(data, indent=2), encoding="utf-8")


root = Path("/tmp/agent_mem_demo")
root.mkdir(exist_ok=True)
remember("u42", "timezone", "Asia/Kolkata", root)
print((root / "u42.json").read_text(encoding="utf-8"))
How this shows up in AI

Session summarization policies (what to keep vs drop) should be versioned with your prompt—otherwise old memories contradict new product rules after a policy change.

Multi-agent (light)

Why this matters

Multiple roles (researcher, critic, writer) can improve quality but multiply failure modes: conflicting instructions, duplicate tool calls, and ping-pong latency. Common patterns: a supervisor router that delegates to specialists; a swarm where peers hand off via a dedicated tool; or hierarchical stacks for very large missions. Higher-level kits like CrewAI encode role/goal/backstory templates—fast to prototype, but LangGraph wins when you need durable checkpoints and custom branching. Keep handoffs explicit: one agent outputs a structured object the next must parse; cap fan-out width so many workers do not hammer the same APIs. Shared memory needs access control so one tenant never reads another’s scratchpad.

Example code
handoff_payload.py
# Contract between two agent roles — JSON keeps boundaries crisp.
import json

handoff = {
    "from_role": "research",
    "to_role": "writer",
    "facts": ["SLA = 99.9%", "data region = EU"],
    "citations": ["doc/ops#uptime"],
    "open_questions": [],
}
print(json.dumps(handoff, indent=2))
How this shows up in AI

Start with a single supervisor before true peer agents—most products win more from better tools and evals than from cast size.

B5

Operating LLM apps

Shipping is the easy part—staying healthy means evaluations that regress prompts before users notice, traces you can read, guardrails that fail closed, and drills when models silently drift. This block is about running systems customers trust.

Observability

Why this matters

Traces stitch one user request across retrievals, model calls, rerankers, and tools. Spans are the substeps with start time, status, and tags (model name, token counts, cache hit). Logging policy matters: never log raw prompts with secrets or unmasked PII. Pair metrics (p95 latency, error rate, dollars per success) with qualitative review when quality regresses after a prompt or model change without a code deploy.

Example code
trace_event.py
# Structured log line compatible with many trace collectors (pseudo OpenTelemetry-ish fields).
import json
import time


def emit_span(name: str, trace_id: str, attrs: dict) -> None:
    evt = {
        "severity": "INFO",
        "name": name,
        "trace_id": trace_id,
        "ts": time.time(),
        "attrs": attrs,
    }
    print(json.dumps(evt))


emit_span(
    "llm.chat",
    trace_id="4f2c…",
    attrs={"model": "example-pro", "latency_ms": 842, "prompt_hash": "sha256:…", "finish_reason": "stop"},
)
How this shows up in AI

Alerts on rising tool error rates catch broken integrations before users do; SLO dashboards separate first-token latency from full completion latency for streaming apps.

Evaluation, datasets & LangSmith

Why this matters

LLM apps degrade silently—new prompts, embeddings, or model versions can tank quality without stack traces. Build a golden dataset (dozens to hundreds of real prompts with reference answers) and run unit evals on retrievers, integration evals on RAG stacks, and periodic end-to-end suites on releases. Combine cheap checks (JSON schema pass rate, regex, precision@k) with LLM-as-judge scoring using a strong model at temperature 0. Frameworks like RAGAS summarize faithfulness and context relevance for retrieval pipelines.

LangSmith (and similar APM-for-LLM products) gives trace trees for every chain/graph run, hosts versioned datasets, and automates experiment comparisons when you change prompts or models—set LANGCHAIN_TRACING_V2=true, project keys, and tag spans with tenant_id/deploy_id so you can slice production failures.

How this shows up in AI

Gate merges when offline metrics slip beyond a threshold; sample live traffic into audit queues to grow the golden set from real incidents.

Guardrails

Why this matters

Guardrails are policy layers around the stack: block disallowed topics, validate JSON before side effects, constrain tool calls to an allowlist, and run moderation on user input and model output. They should fail closed on uncertainty for risky actions (payments, data export) and degrade gracefully for low-risk chat. Add cost controls for agents—per-run token budgets, circuit breakers on flapping tools, and sanitation on lengths/known injection phrases. Tests should include adversarial strings and known jailbreak patterns relevant to your audience.

Example code
allowlist_tools.py
# Fail closed if the model names a tool outside the session allowlist.

def sanitize_tool_call(name: str, allow: set[str]) -> str:
    if name not in allow:
        raise PermissionError(f"tool {name} not enabled for this session")
    return name


session_allow = {"search_kb", "create_ticket"}
print(sanitize_tool_call("search_kb", session_allow))
How this shows up in AI

Offline eval sets trigger guard regressions in CI when prompts change; online shadow mode runs new filters without impacting customers until metrics look safe.

Deployment & async ingestion patterns

Why this matters

AI APIs need the same discipline as other microservices: multi-stage Docker images that cache embeddings weights, secrets from a manager—not baked into layers—and split /health/live vs /health/ready so load balancers wait for vector stores to warm. Long LLM streams mean higher idle timeouts on gateways (often 60s defaults break SSE). Event buses shine for async ingestion (raw doc topic → embed consumer → index notification) and for kicking off agents without blocking HTTP—Kafka backpressure even helps throttle spend.

How this shows up in AI

Tag cloud spend per feature, mirror structured request logs with retrieval scores, and rehearse rollbacks for embedding model swaps—they change geometry in vector stores.

Capstone — copilot with RAG + LangGraph + tool

Why this matters

The capstone forces you to connect the syllabus: ingestion + retrieval, a branching graph or workflow, tools with real safeguards, and observability you would not be embarrassed to show in an incident review. Ship a thin vertical slice—not a dissertation—then iterate on retrieval quality before chasing novelty features.

Example code
capstone_checklist.py
# Executable checklist you can tick while building the capstone project.

checks = [
    "ingest + version docs",
    "retrieval + citations in prompt",
    "graph or explicit router with max steps",
    '>=1 tool with schema validation',
    "trace ids across LLM + tools",
    "basic guard on tool allowlist",
]
for i, c in enumerate(checks, start=1):
    print(f"{i:02d}. [ ] {c}")
How this shows up in AI

Treat the capstone like a portfolio piece: README with threat model, cost estimate per 1k questions, and a short eval table (baseline vs your system on ten real tasks). Full brief: capstone brief.

B6

Multimodal usage

Models can consume images, audio, or PDF bytes—not only typed text. The engineering work shifts to ingestion, resizing, redacting sensitive pixels, and choosing between OCR pipelines versus native vision APIs.

Using VLMs & files

Why this matters

A VLM (vision-language model) reasons about pixels and text together—useful for UI screenshots, diagrams, or photos of handwritten notes. Uploaded PDFs may be handled as images per page, extracted text, or hybrid. OCR shines on clean scans and spreadsheets; native vision can outperform OCR on charts or photos but costs more. Pick a modality router: send clear text transcripts to the cheaper text model, and route only ambiguous or visual pages to a VLM.

Example code
content_part_stub.py
# Typical chat payload shape mixing text plus an image reference (names vary by vendor).


def user_message_multimodal(question: str, image_ref: str) -> dict:
    return {
        "role": "user",
        "content": [
            {"type": "input_text", "text": question},
            {"type": "input_image", "image_url": image_ref},
        ],
    }


print(user_message_multimodal("What error does this dialog show?", "https://cdn.example.com/ui.png"))
How this shows up in AI

Preprocess images (max resolution, strip EXIF that might leak GPS), virus-scan uploads, and review whether internal screenshots may be sent to external APIs under your compliance rules.

Stretch — multimodal agent

Why this matters

A screenshot-driven agent closes the loop from “see UI” → “describe state” → “query runbooks” → “maybe call tool” with citations. The difficulty is rarely the demo—it is reliable evaluation (does blur or cropping kill accuracy?) and safety (screenshots leak tokens, names, HIPAA). Budget human review paths when this flow touches money or privilege changes.

Example code
multimodal_pipeline.py
# Linear pipeline placeholders — swap each step with your real VLM/RAG/tool connectors.

def describe_ui(image_ref: str) -> str:
    return "[VLM caption] Fatal error FK-442 on Save"


def rag_answer(caption: str) -> str:
    return "[RAG] FK-442: retry after cache purge — support/442"


def maybe_tool(caption: str) -> str | None:
    return '{"tool":"flush_cache"}' if "FK-442" in caption else None


cap = describe_ui("uploads/snap.png")
ans = rag_answer(cap)
print(cap, "-->", ans, maybe_tool(cap))
How this shows up in AI

Pair this stretch with Path A multimodal reading for encoder tradeoffs; keep an audit log tying image IDs to retrieval chunks and tool calls for post-incident review.

B7

Open weights, fine-tuning & local serving

When API bills pinch or you need air-gapped deployments, LoRA-tuned open models and local runners complement hosted RAG—after you have evals proving when the tradeoff is worth it.

Prompting vs RAG vs fine-tuning

Why this matters

Start with prompting and RAG: prompts steer format and tone; RAG supplies facts that change faster than training cycles. Fine-tuning teaches style, narrow reasoning habits, or fits a smaller model to a task—it is a weak lever for fresh factual knowledge compared with retrieval. Reach for it when you need persistent behavior without giant system prompts, latency/cost reduction on a focused task, or proprietary phrasing that refuses to stick with few-shot demos alone.

LoRA / QLoRA essentials

Why this matters

LoRA freezes base weights and trains tiny low-rank adapters on attention projections—far less VRAM and storage than full fine-tunes. QLoRA quantizes the base model (often 4-bit) so consumer or single-GPU servers can host adaptation. Expect to invest in curated instruction/response pairs, monitoring eval loss for overfitting, and merging adapters for deployment when you want zero PEFT dependencies at inference.

How this shows up in AI

Pair adapter training with the same golden-task metrics you use for RAG so you do not chase loss curves that mislead on real customer queries.

Ollama & OpenAI-compatible local APIs

Why this matters

Ollama packages pull/run flows for open models and exposes an OpenAI-compatible HTTP surface—swap base URLs in LangChain (ChatOllama, OllamaEmbeddings) to run dev, CI, or on-prem workloads without cloud tokens. Modelfile lets you bake system instructions or point at merged weights from QLoRA runs.

How this shows up in AI

Great for dry-runs and sensitive data residency; still invest in the same observability and safety layers—local models remain prompt-injection surfaces.