How LLM APIs work
Why this matters
You rarely run the neural net on your laptop in production—you call an HTTP API that hides weights,
hardware, patching, and scale. Responses come back as status codes and bodies (almost always JSON). If you know
the flow—auth header, JSON payload, parsing the assistant message—you can integrate any vendor that follows similar
patterns. Most products also expose streaming (tokens arrive as they’re generated), tool calling (model asks
to run a named function), and rate limits (how many requests per minute you may send).
Example code
# pip install httpx — replace URL + model with your provider’s docs.
import json
import os
import httpx
API_URL = "https://api.example.com/v1/chat/completions" # placeholder
def call_chat(user_text: str) -> dict:
payload = {
"model": "example-small",
"messages": [
{"role": "system", "content": "Be concise."},
{"role": "user", "content": user_text},
],
"temperature": 0,
}
headers = {
"Authorization": f"Bearer {os.environ.get('API_KEY','')}",
"Content-Type": "application/json",
}
resp = httpx.post(API_URL, headers=headers, json=payload, timeout=60.0)
resp.raise_for_status()
data = resp.json()
msg = data["choices"][0]["message"]["content"]
return {"reply": msg, "raw": data}
print(json.dumps(call_chat("What is an API?"), indent=2)[: 800])
How this shows up in AI
Streaming keeps chat UIs responsive: you render partial tokens instead of waiting for the whole answer. Tools turn
the model into an orchestrator (search CRM, query SQL)—your code executes the tool and sends results back as the
next user message. Reliable apps add retries with backoff for 429/5xx and clamp timeouts so hung calls do not
block your service.
Tokens & pricing
Why this matters
Vendors bill and cap usage in tokens, not in words or characters. A token might be part of a word, a whole word,
or punctuation—the exact split depends on the model’s tokenizer. Under the hood, byte-pair encoders (BPE) repeatedly
merge frequent character pairs until the vocabulary holds tens of thousands of sub-word units, so rare strings still map to
known pieces. Special tokens mark chat roles, end-of-sequence, and vendor-specific control segments—always apply the
provider’s chat template so those markers line up with what the base model expects. Inputs and outputs usually have
separate prices; long prompts cost more before the model says anything. Each model also has a context window:
a maximum number of tokens the model can “see” in one turn (prompt plus answer combined). Passing that limit returns an error
unless you trim or summarize. Code and identifiers tokenize more densely than prose; measure real prompts with
tiktoken (OpenAI-compatible models) or the provider’s counter instead of guessing from word count.
Example code
# Rough classroom math — prod should use the provider’s tokenizer or official counter.
def chars_to_ballpark_tokens(text: str) -> int:
# Very loose rule of thumb for English-like text (~4 chars per token).
stripped = text.strip()
return max(1, len(stripped) // 4)
def dollars_for_call(input_tokens: int, output_tokens: int, price_in: float, price_out: float) -> float:
# price_* = USD per 1 million tokens from the pricing page.
return (input_tokens * price_in / 1_000_000) + (output_tokens * price_out / 1_000_000)
prompt = "Summarize this policy in three bullets…" + "x" * 2000
answer = "• Point one\n• Point two\n• Point three"
in_tok = chars_to_ballpark_tokens(prompt)
out_tok = chars_to_ballpark_tokens(answer)
bill = dollars_for_call(in_tok, out_tok, price_in=0.15, price_out=0.60)
print("ballpark tokens", in_tok, out_tok, "USD ~", round(bill, 6))
How this shows up in AI
Product teams set budgets per user or per feature using token math. Engineering checks that RAG context, tool
results, and chat history still fit the window. Batch or off-peak APIs often trade latency for lower price when
you preprocess large backlogs. Always confirm numbers against the provider’s live pricing page—rates change.
Context, temperature & sampling
Why this matters
The context window is everything in one pass: system prompt, tools, retrieved chunks, history, and the answer so far.
Larger windows help, but research shows “lost in the middle”—models attend more to the start and end of long context.
For RAG, put the sharpest evidence first and last in the stuffed context, not buried mid-prompt.
Temperature scales randomness in token sampling: near 0 for deterministic JSON, evals, and regression tests; ~0.7 for chatty
UX; high values for brainstorming only.
top_p (nucleus sampling) keeps the smallest set of tokens whose cumulative probability exceeds p—nuanced overlap with temperature, so avoid cranking both up together.
max_tokens caps generation length; you pay for actual completions, not the cap.
stop sequences cut off the assistant when it would otherwise role-play the next user turn or leak markdown fences—useful for clean extraction.
Example code
# pip install tiktoken — exact counts for OpenAI-style models
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = open("system_prompt.txt", encoding="utf-8").read()
n = len(enc.encode(text))
print("system prompt tokens", n)
# Guard before calling the API (pseudo budget)
MAX_IN = 120_000
if n > MAX_IN:
raise ValueError("system prompt exceeds planned budget")
How this shows up in AI
Every feature flag on “creativity” is really a sampling policy change. Structured pipelines keep temperature at 0, separate free-form reasoning
from a second “extract to schema” call when quality demands it.
Cost & efficiency
Why this matters
Model spend can grow quietly: long prompts, chatty agents, and repeated system instructions on every call add up.
Efficiency is not “being cheap”—it is shipping the same quality with fewer tokens and fewer round-trips. Common
levers: cache stable answers, compress prompts, route easy questions to smaller models, and avoid sending the same
document chunks twice in one session.
Example code
# Tiny pattern: cheap path for “easy” questions, cache for repeated exact prompts.
from functools import lru_cache
@lru_cache(maxsize=256)
def fake_model_call(prompt: str, tier: str) -> str:
# Replace with real HTTP calls—tier might be "small" vs "large".
return f"[{tier}] answer for: {prompt[:40]}…"
def route_prompt(user_text: str) -> str:
text = user_text.lower()
easy = any(k in text for k in ("define", "what is", "hello"))
tier = "small" if len(user_text) < 240 and easy else "large"
return fake_model_call(user_text.strip(), tier)
for q in ("What is an API?", "What is an API?", "Long analysis with lots of context…" * 5):
print(route_prompt(q))
How this shows up in AI
Observability dashboards often plot dollars per successful task, not tokens alone. Retrieval-heavy apps pay for
embedding calls plus the chat call—remember both. Compression (summaries of prior turns instead of raw chat logs)
preserves quality while cutting input tokens when done carefully with eval hooks.
Prompting for reliability
Why this matters
The API does not guess your product rules—you spell them out in prompts and optional safety/policy layers. Treat the
stack as a protocol: system messages carry durable configuration (role, constraints, output contract);
user messages carry the task; assistant turns are what the model has already said and may include tool results.
Structure long inputs with clear delimiters—XML-style tags like <context>/<question> help the model treat retrieved text as data, not instructions.
Some APIs let you prefill the assistant turn (for example starting with {) to bias toward JSON. Models are stateless on the wire: you resend history every call, so trim with a window, summary, or deque as chats grow.
Zero-shot relies on instructions alone; few-shot threads example input/output pairs to lock format and tone—quality beats quantity, and dynamically retrieved examples (similar queries from a bank) often beat a static list.
Chain-of-thought asks for intermediate reasoning so later tokens condition on those steps; pairing scratchpads (<thinking>…</thinking>) with a clean user-facing answer keeps UX tidy.
Self-consistency samples multiple answers then votes—expensive but useful for borderline decisions. For agents, the
ReAct loop (thought → tool call → observation) is the pattern behind most frameworks; understand it in plain code before leaning on abstractions.
Avoid vague instructions, overstuffed system prompts (>~500 tokens of rules often erodes adherence), purely negative directions (“don’t…”), ambiguous pronouns, and multi-intent single calls—chain small tasks in code instead.
When JSON fails, combine schema APIs, Pydantic validation, and bounded retries rather than hoping prose instructions alone.
Example code
# Build prompts as data structures you can reuse, log, and A/B test.
import json
def support_messages(ticket: str) -> list[dict]:
schema_note = (
"Return JSON with keys: summary (string), severity (low|med|high), "
"next_action (string). No extra keys."
)
return [
{
"role": "system",
"content": (
"You triage internal support tickets. If information is missing, set severity to low "
"and explain what is missing in next_action."
),
},
{"role": "user", "content": schema_note + "\n\nTicket:\n" + ticket},
]
sample = "VPN drops every day at 4pm for team B; no error codes captured."
payload = support_messages(sample)
print(json.dumps(payload, indent=2))
How this shows up in AI
Pair JSON mode or schema-constrained decoding with parsers, retries, and offline rubrics; add a small “judge” pass when quality gates dollars or compliance.
Structured outputs & schemas
Why this matters
Raw completions are hard to parse like an untyped API payload. Prefer generation-time constraints (JSON Schema /
structured output modes) over “please return JSON” prose alone. When those APIs exist, they stop the model from emitting
commentary or invalid field names before tokens leave the decoder. Elsewhere, pair Pydantic (or JSON Schema) with validation
and short repair prompts. Use temperature 0 for extraction. For long chain-of-thought tasks, let the model reason in one
call, then run a second, cheap call that extracts fields into a schema—structure and reasoning stay decoupled.
Example code
# Conceptual: provider-native parse + Pydantic model
from pydantic import BaseModel, Field
from typing import Literal
class Triage(BaseModel):
severity: Literal["low", "med", "high"]
summary: str = Field(max_length=280)
needs_human: bool
# Many SDKs accept `response_format=Triage` and return `.parsed`
How this shows up in AI
LangChain’s with_structured_output(MyModel) wraps the same idea for LCEL graphs—debug failures by knowing whether the model or the parser regressed.
Function calling & tool choice
Why this matters
Function calling is the portable agent primitive: the model emits a structured call; your runtime executes; results return as tool messages. Tool descriptions drive routing—say when to use and when not to use a tool. Use tool_choice to force a specific
tool, require any tool, or leave the model in auto mode. Models may issue parallel tool calls; execute with asyncio.gather and return each result with the matching tool_call_id.
Example code
# Contract shape many hosts accept (names vary slightly by vendor)
tool_meta = {
"type": "function",
"function": {
"name": "lookup_policy",
"description": (
"Search internal policy docs. Use for product and compliance questions; "
"do not use for chit-chat."
),
"parameters": {
"type": "object",
"properties": {"q": {"type": "string"}},
"required": ["q"],
},
},
}
How this shows up in AI
The same loop powers chat-with-data copilots and LangGraph ToolNode—execution stays in your codebase, not inside the weights.
Prompt injection & untrusted context
Why this matters
Any text you did not author—user input, web pages, uploaded PDFs, retrieved chunks—can try to override the system policy (“ignore previous instructions…”). Indirect injection via RAG is especially sneaky: a malicious document in the index hijacks generation after retrieval. Defend with labeled delimiters, least-privilege tool sets, structured outputs for side effects, human approval before destructive tools, rate limits, and logging. Third-party guardrail libraries can help, but architecture (no secrets in prompts, allowlists for tools) matters more than wording tricks alone.
How this shows up in AI
Treat untrusted content as data, not as instructions—state that quoted material cannot change policy—and keep adversarial strings in eval CI when stakes are high.