Text as data
Why this matters
Models ingest discrete token IDs produced by deterministic tokenizers—not raw Unicode characters. Vocabs trade off granularity (subword merges)
versus out-of-vocab robustness versus sequence length ballooning. Sparse TF-IDF bags summarize documents when deep models overshoot budgets; dense
embedding layers learn co-occurrence structure end-to-end in neural stacks.
Example code
TEXT = ("shipment delayed shipment refund")
def word_ids(text: str):
words = text.lower().split()
vocab = {"<unk>": 0}
for w in words:
if w not in vocab:
vocab[w] = len(vocab)
ids = [vocab.get(tok, 0) for tok in words]
return ids, vocab
ids, vocab = word_ids(TEXT)
print("ids", ids, "vocab_entries", len(vocab))
How this shows up in AI
Hallucinations often correlate with brittle tokenizations of numbers or SKUs—instrument token counts when debugging perplexity regressions tied to multilingual
content.
Pretraining objectives
Why this matters
Next-token prediction (causal LM) aligns weights with forecasting future symbols—GPT-style chats inherit that curriculum. Masked LM (BERT-style)
reconstructs withheld tokens leveraging bidirectional context. Instruction tuning layers polite assistant behavior atop base weights—the same backbone can feel wildly
different after curated prompt/response supervised finetuning.
Example code
# Cross-entropy toy for predicting one vocab index from logits.
import math
def softmax_vec(z):
m = max(z)
ex = [math.exp(v - m) for v in z]
s = sum(ex)
return [w / s for w in ex]
def nll_ce(logits, target_idx: int) -> float:
probs = softmax_vec(logits)
return -math.log(probs[target_idx] + 1e-12)
logits = [2.2, 1.8, 5.9, 0.]
print("loss_masked_token", round(nll_ce(logits, 2), 3))
How this shows up in AI
Scaling laws quantify how predictable loss decreases with compute/data/parameters—why frontier labs hoard GPUs and meticulous cleaning pipelines before architecture hacks.
Fine-tuning
Why this matters
Full fine-tune updates every weight—powerful yet heavy for billion-parameter beasts. Parameter-efficient tweaks such as
LoRA attach low-rank adapters to linear layers, effectively perturbing frozen weights W with a tiny outer product
AB⊤ instead of rewriting the full matrix. Matching data formatting (conversation templates, masking rules) avoids silent train/serve skew.
Example code
import numpy as np
rng = np.random.default_rng(7)
W = rng.normal(scale=0.1, size=(8, 32))
A = rng.normal(scale=0.02, size=(8, 4))
B = rng.normal(scale=0.02, size=(32, 4))
W_eff = W + A @ B.T
x = rng.normal(size=(64, 8))
print("delta_norm", np.linalg.norm((A @ B.T)))
How this shows up in AI
Specialized assistants (finance, codegen, multilingual support) routinely stack LoRA on shared base checkpoints while keeping infra frozen for multi-tenant hosting.
Hugging Face stack
Why this matters
The ecosystem stitches together tokenizer binaries, model config JSON, pretrained weights (safetensors/.bin), Trainer scaffolding, and Accelerate primitives for device placement.
datasets stream parquet/json at scale instead of rewriting loaders per project—inference knobs (`pipeline`, TorchScript, ONNX, bitsandbytes quantization) underpin many production gateways.
Example code
# pip install transformers accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "hf-internal-testing/tiny-random-gpt2"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "Hi"
inputs = tok(prompt, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=16, do_sample=False)
print(tok.decode(out[0]))
How this shows up in AI
Most open-weight LLMs ship first as HF repos—knowing Trainer callbacks, checkpoints, tokenizer quirks is how Path A grads graduate into Path B infra owners.
Evaluation & robustness
Why this matters
Perplexity summarizes open-ended language modeling prowess but hides failure modes (“confident hallucinations”). Task suites (classification, QA, grounding) quantify targeted skills;
curated stress tests catch adversarial synonyms, negation swaps, multilingual drift. Robustness blends metrics with qualitative failure archaeology—clusters of slips reveal missing data regimes.
Example code
import math
token_negative_log_probs = [2.11, 3.97, 2.83, 1.65, 2.21] # per-token NLL bits
avg_nll = sum(token_negative_log_probs) / len(token_negative_log_probs)
ppl = math.exp(avg_nll)
print("perplexity ~", round(ppl, 2))
How this shows up in AI
Production LLM dashboards mix automatic metrics plus human pairwise preferences—evaluation design is inseparable from product risk appetite.