sharpbyte.dev
← Learning hub
Path A · foundations-ml

ML, DL, NLP, LLM depth

A0 is the Python-for-AI spine: syntax through async, NumPy and Pandas, generators, standard-library tooling, and production-safe configuration—the detailed checklist behind a typical multi-topic fast-track, expressed topic-by-topic without a rigid weekly calendar. A1–A5 deepen probability and classical ML, deep learning, Transformers and NLP stacks, production ML, and multimodal systems—not only calling APIs. For cross-track links see bridge topics.

A0

Python for ML & LLM engineering

The Python spine before modeling: syntax and collections, functions and decorators, OOP patterns, environments and imports, pathlib plus JSON/JSONL corpora, errors and retries, typing with Pydantic, async IO for APIs, NumPy and Pandas for tensors and tables, generators for streaming, logging-heavy stdlib habits, and typed configuration with secrets hygiene—then A1 picks up probability and classical ML.

Python for ML code

Why this matters

Training scripts are miniature products—pinned deps avoid drift, portable paths unblock CI, structured logging survives on-call paging, and one smoke test detects silent tensor-shape bugs teammates might miss.

Example code
train_hygiene.py
# Seeds + portable paths + JSONL metrics—the spine of reproducible labs.
import json
import logging
import random
from pathlib import Path

import numpy as np


def set_seed(seed: int = 42) -> None:
    random.seed(seed)
    np.random.seed(seed)


def main() -> None:
    logging.basicConfig(level=logging.INFO)
    root = Path(__file__).resolve().parent
    out = root / "runs/demo"
    out.mkdir(parents=True, exist_ok=True)
    set_seed()
    row = {"loss": 0.31, "step": 120}
    (out / "metrics.jsonl").write_text(json.dumps(row) + "\n", encoding="utf-8")
    logging.info("wrote_metric path=%s", out / "metrics.jsonl")


if __name__ == "__main__":
    main()
How this shows up in AI

Fine-tunes, RLHF-ish preference jobs, offline eval shards, or batch distillations all reuse this hygiene—swap transport (HTTP APIs vs GPUs) atop the same layout.

NumPy

Why this matters

Deep learning boils down to tensor algebra. Vectorization moves work into fast kernels; broadcasting aligns shapes without painful copies—think adding per-class biases across an entire minibatch in one shot. Comfort with reductions and axes maps straight to softmax, pooling, normalization, grad norms during training logs.

Example code
numpy_shapes.py
import numpy as np

scores = np.array([[1., 2., 3.], [1., 1., 10.]])  # batch × vocab slice
bias = np.array([0.1, -0.2, 0.05])
print("per-row sums", (scores + bias).sum(axis=1))

W = np.array([[0.4], [1.2]])
X = np.array([[1., 0.], [2., -1.]])
print("linear", X @ W)
How this shows up in AI

Attention logits, softmax normalizers, cosine similarity for embeddings, and rolling token statistics in analysis notebooks—all read like NumPy broadcasts once the habit sticks.

Pandas

Why this matters

Labeled data usually lands as spreadsheets with missing cells, merges, categorical codes—you must sanitize before tensors. The lethal mistake is split leakage (same user/session in train and validation) turning metrics into fantasies—design splits with explicit grouping keys from day zero.

Example code
group_splits.py
# pip install pandas — keep entire groups on one fold.
import pandas as pd

vals = pd.Series(range(12))
df = pd.DataFrame({
    "user": ["A"] * 6 + ["B"] * 6,
    "feat": vals,
    "label": [1, 0] * 6,
})

df["split"] = df["user"].map({"A": "train", "B": "val"})
print(df.groupby("split")["label"].mean())
How this shows up in AI

Dialogue corpora grouped by mailbox, SKU-level storefront tables, toxicity audit sheets, telemetry exports—anything with repeated entity IDs needs the same group-aware discipline before RAG ingestion or evaluator slices.

A1

Math and classical ML

Probabilistic thinking plus classical baselines clarify whether deep models earn their complexity—or only memorize noise.

Probability & statistics

Why this matters

Loss curves are noisy random processes; hypotheses need confidence—not vibes. Sampling variance explains why rerun A beats rerun B without any code change; confusion matrices decompose failures into skewed recalls; calibration asks whether a “70% probable” headline is honest over thousands of forecasts.

Example code
simulate_and_ci.py
import numpy as np


def bernoulli_draws(p: float, trials: int, repeats: int = 8000) -> tuple[float, float]:
    rng = np.random.default_rng(0)
    ratios = rng.binomial(trials, p, repeats) / trials
    return float(ratios.mean()), float(ratios.std())


mean, sd = bernoulli_draws(0.55, trials=200)
print(f"empirical_hit_rate_mu={mean:.3f} sigma={sd:.3f}")
How this shows up in AI

Held-out perplexity deltas, pairwise human preference tests between model revisions, abstention threshold tuning for classifiers—all lean on the same Monte Carlo instincts.

Optimization

Why this matters

Training minimizes a surrogate loss over parameters using noisy gradients pulled from minibatches—too large a learning rate blows up, too tiny crawls forever. Bias–variance tradeoffs explain under/overfitting arcs; diagnosing which regime you sit in directs data collection versus architecture tweaks versus regularization knobs.

Example code
gradient_descent_toy.py
# Fit y ≈ w·x via vanilla GD — mirrors NN inner loops stripped bare.

def loss(w: float, xs, ys) -> float:
    pred = [w * x for x in xs]
    return sum((p - y) ** 2 for p, y in zip(pred, ys)) / len(xs)


def grad(w: float, xs, ys) -> float:
    preds = [w * x for x in xs]
    return sum(2 * (p - y) * x for p, y, x in zip(preds, ys, xs)) / len(xs)


xs, ys = [1., 2., 3.], [2.1, 3.9, 6.2]
w = 0.
for step in range(200):
    w -= 0.05 * grad(w, xs, ys)
print("w_hat", round(w, 3), "loss", round(loss(w, xs, ys), 4))
How this shows up in AI

Adam schedules, warmup, gradient clipping during LLM fits, cosine LR restarts—all dress the same descent skeleton you just practiced. Loss spikes you debug in PyTorch originate here.

Classical ML

Why this matters

Logistic regression, gradient boosting, random forests routinely beat sprawling neural nets when data is scarce or structured. Knowing pipelines (scaling → model) and stratified folds keeps benchmarking honest—they are sanity checks plus strong production baselines before multimillion-parameter experiments.

Example code
baseline_pipeline.py
# pip install scikit-learn
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


X = [[0.1, 10], [-0.2, 8], [0.4, 12], [-0.05, 9]]
y = [0, 1, 1, 0]

model = Pipeline(
    steps=[
        ("scale", ColumnTransformer([
            ("num", StandardScaler(), [0, 1]),
        ], remainder="passthrough")),
        ("clf", LogisticRegression(max_iter=200)),
    ]
)

scores = cross_val_score(model, X, y, cv=2)
print("cv_acc", scores.mean())
How this shows up in AI

Hybrid stacks pair classical rankers/boosters atop dense retrieval features—the deep model handles language, sklearn handles calibrated gating budgets.

A2

Deep learning

Composable differentiable blocks, autograd-powered loops, convolutional priors on grids, recurrence on sequences—all stepping stones to Transformers on Path A (and multimodal VLMs later).

Neural network mechanics

Why this matters

Networks stack affine maps + nonlinearities (activations) so bends can approximate rich functions given width/depth/backprop-fed data. Training parallelizes via batched tensors on GPUs where matrix ops dominate; mixed precision swaps float32 halves for throughput with careful loss scaling—all production LLM infra inherits this mental stack.

Example code
two_layer_numpy.py
import numpy as np


def relu(x):
    return np.maximum(0, x)


def forward(x, W1, b1, W2, b2):
    z1 = relu(x @ W1 + b1)
    return z1 @ W2 + b2  # linear head for regression demo


rng = np.random.default_rng(0)
x = rng.normal(size=(4, 3))
W1 = rng.normal(scale=0.1, size=(3, 16))
b1 = np.zeros(16)
W2 = rng.normal(scale=0.1, size=(16, 2))
b2 = np.zeros(2)
print(forward(x, W1, b1, W2, b2).shape)
How this shows up in AI

Every transformer block repeats “linear projections + softmax nonlinearity + residual”; MLP tails inside GPT-style stacks are glorified dense blocks from this blueprint.

PyTorch training loop

Why this matters

Implementing loops yourself crystallizes epochs, minibatches, .backward() clearing, checkpoints, gradients turning into weight updates—even if later you lean on Trainer classes. Muscle memory reading loss logs and gradient norms originates here before Hugging Face abstractions blur the edges.

Example code
tiny_fit.py
# pip install torch — toy regression on Gaussian blobs.
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset


torch.manual_seed(0)
xs = torch.randn(512, 10)
ys = xs[:, [2]] + 0.1 * xs[:, [7]] + torch.randn(512, 1) * 0.2
loader = DataLoader(TensorDataset(xs, ys), batch_size=64, shuffle=True)

model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 1))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
crit = nn.MSELoss()

for epoch in range(3):
    for xb, yb in loader:
        optimizer.zero_grad()
        loss = crit(model(xb), yb)
        loss.backward()
        optimizer.step()
print("final_loss", float(loss.detach()))
How this shows up in AI

LoRA fine-tuning, preference optimization, distillation—all wrap the identical loop semantics with extra loss terms plus frozen parameter masks.

Computer vision primer

Why this matters

Convolutions share weights across spatial locations respecting translation structure; pooling trades resolution for receptive field width or invariance. Transfer learning hot-starts convolutional towers learned on gigantic photo corpora—a pattern mirrored later when “image patching” feeds vision Transformers powering VLMs on Path B/A5.

Example code
tiny_conv.py
# pip install torch
import torch
from torch import nn

x = torch.randn(2, 3, 32, 32)  # N,C,H,W
stem = nn.Conv2d(3, 16, kernel_size=3, padding=1)
pool = nn.AdaptiveAvgPool2d((1, 1))
feat = torch.flatten(pool(stem(x)), start_dim=1)
print("feature_shape", feat.shape)
How this shows up in AI

Modern VLMs turn images into sequences of patches—understanding convolution + pooling primes you for why patch embeddings work and why resolution inflates tokens.

Sequences & RNN era

Why this matters

Recurrence threads a hidden memory across time steps; seq2seq maps input sequences to outputs for translation or tagging. Sequential dependencies create bottleneck pressure—long horizons forget details; serial depth made massive parallel training painful. Attention removed that bottleneck, yet RNN intuition (state carried forward) survives in streaming agents and chunked inference stories.

Example code
vanilla_rnn_step.py
import numpy as np


def tanh(x):
    return np.tanh(x)


def rnn_cell(h, x, W_h, W_x, b):
    return tanh(h @ W_h + x @ W_x + b)


rng = np.random.default_rng(1)
h = rng.normal(scale=0.1, size=32)
W_h = rng.normal(scale=0.05, size=(32, 32))
W_x = rng.normal(scale=0.05, size=(8, 32))
b = np.zeros(32)
for tok in range(6):
    vec = rng.normal(size=8)
    h = rnn_cell(h, vec, W_h, W_x, b)
print("final_hidden_norm", float(np.linalg.norm(h)))
How this shows up in AI

Today’s chat models seldom ship raw RNNs, but appreciating vanishing gradients and temporal bottlenecks lets you sympathize why attention + parallel attention blocks became default.

A3

NLP & Transformers

Tokens become IDs, attention aligns positions, pretrained objectives shape priors—you now own the internals behind modern LLMs and the HF toolchain that ships them daily.

Text as data

Why this matters

Models ingest discrete token IDs produced by deterministic tokenizers—not raw Unicode characters. Vocabs trade off granularity (subword merges) versus out-of-vocab robustness versus sequence length ballooning. Sparse TF-IDF bags summarize documents when deep models overshoot budgets; dense embedding layers learn co-occurrence structure end-to-end in neural stacks.

Example code
token_vocab.py
TEXT = ("shipment delayed shipment refund")


def word_ids(text: str):
    words = text.lower().split()
    vocab = {"<unk>": 0}
    for w in words:
        if w not in vocab:
            vocab[w] = len(vocab)
    ids = [vocab.get(tok, 0) for tok in words]
    return ids, vocab


ids, vocab = word_ids(TEXT)
print("ids", ids, "vocab_entries", len(vocab))
How this shows up in AI

Hallucinations often correlate with brittle tokenizations of numbers or SKUs—instrument token counts when debugging perplexity regressions tied to multilingual content.

Transformer architecture

Why this matters

Self-attention mixes every position with weighted views of others in one parallelizable step—capturing dependencies without recurrence. Decoder-only GPT-style models mask future tokens; encoder–decoder pairs still matter for constrained generation (translation). Positional encodings tell the layers where each token sits because attention itself is permutation equivariant absent that hint.

Example code
softmax_attention_toy.py
import numpy as np


def softmax_rows(x):
    x = x - x.max(axis=1, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=1, keepdims=True)


Q = np.array([[1., 0.], [0., 2.]])  # seq × d
K = np.array([[1., 1.], [2., 0.]])
V = np.array([[0.5, 2.], [1., -1.]])
scores = (Q @ K.T) / np.sqrt(K.shape[1])
attn = softmax_rows(scores)
out = attn @ V
print("attn", np.round(attn, 2), "\nvalues", np.round(out, 2))
How this shows up in AI

Inference engines optimize these matmul-heavy blocks with fused kernels, KV caches, speculative decoding—all engineering built atop the attention primitives you traced.

Pretraining objectives

Why this matters

Next-token prediction (causal LM) aligns weights with forecasting future symbols—GPT-style chats inherit that curriculum. Masked LM (BERT-style) reconstructs withheld tokens leveraging bidirectional context. Instruction tuning layers polite assistant behavior atop base weights—the same backbone can feel wildly different after curated prompt/response supervised finetuning.

Example code
mlm_loss_stub.py
# Cross-entropy toy for predicting one vocab index from logits.
import math


def softmax_vec(z):
    m = max(z)
    ex = [math.exp(v - m) for v in z]
    s = sum(ex)
    return [w / s for w in ex]


def nll_ce(logits, target_idx: int) -> float:
    probs = softmax_vec(logits)
    return -math.log(probs[target_idx] + 1e-12)


logits = [2.2, 1.8, 5.9, 0.]
print("loss_masked_token", round(nll_ce(logits, 2), 3))
How this shows up in AI

Scaling laws quantify how predictable loss decreases with compute/data/parameters—why frontier labs hoard GPUs and meticulous cleaning pipelines before architecture hacks.

Fine-tuning

Why this matters

Full fine-tune updates every weight—powerful yet heavy for billion-parameter beasts. Parameter-efficient tweaks such as LoRA attach low-rank adapters to linear layers, effectively perturbing frozen weights W with a tiny outer product AB instead of rewriting the full matrix. Matching data formatting (conversation templates, masking rules) avoids silent train/serve skew.

Example code
lora_matmul_numpy.py
import numpy as np

rng = np.random.default_rng(7)
W = rng.normal(scale=0.1, size=(8, 32))
A = rng.normal(scale=0.02, size=(8, 4))
B = rng.normal(scale=0.02, size=(32, 4))
W_eff = W + A @ B.T
x = rng.normal(size=(64, 8))
print("delta_norm", np.linalg.norm((A @ B.T)))
How this shows up in AI

Specialized assistants (finance, codegen, multilingual support) routinely stack LoRA on shared base checkpoints while keeping infra frozen for multi-tenant hosting.

Hugging Face stack

Why this matters

The ecosystem stitches together tokenizer binaries, model config JSON, pretrained weights (safetensors/.bin), Trainer scaffolding, and Accelerate primitives for device placement. datasets stream parquet/json at scale instead of rewriting loaders per project—inference knobs (`pipeline`, TorchScript, ONNX, bitsandbytes quantization) underpin many production gateways.

Example code
hf_quickstart.py
# pip install transformers accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer


model_id = "hf-internal-testing/tiny-random-gpt2"

tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "Hi"
inputs = tok(prompt, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=16, do_sample=False)
print(tok.decode(out[0]))
How this shows up in AI

Most open-weight LLMs ship first as HF repos—knowing Trainer callbacks, checkpoints, tokenizer quirks is how Path A grads graduate into Path B infra owners.

Evaluation & robustness

Why this matters

Perplexity summarizes open-ended language modeling prowess but hides failure modes (“confident hallucinations”). Task suites (classification, QA, grounding) quantify targeted skills; curated stress tests catch adversarial synonyms, negation swaps, multilingual drift. Robustness blends metrics with qualitative failure archaeology—clusters of slips reveal missing data regimes.

Example code
perplexity_from_nll.py
import math

token_negative_log_probs = [2.11, 3.97, 2.83, 1.65, 2.21]  # per-token NLL bits
avg_nll = sum(token_negative_log_probs) / len(token_negative_log_probs)
ppl = math.exp(avg_nll)
print("perplexity ~", round(ppl, 2))
How this shows up in AI

Production LLM dashboards mix automatic metrics plus human pairwise preferences—evaluation design is inseparable from product risk appetite.

A4

Toward production ML (model-centric)

Making models cheap to run, observable in production, and guarded against harmful failure cases—before or alongside LLM product layers.

Efficient inference

Why this matters

Quantization stores weights/activations with fewer bits—trading accuracy for memory and matmul throughput once kernels match. Batching amortizes kernel-launch overhead but raises tail latency if you wait to fill buckets. Autoregressive transformers cache past KV tensors so each new token can reuse previously computed keys and values instead of rebuilding the entire attention matrix from scratch every step.

Example code
quantize_roundtrip.py
import numpy as np

w = np.linspace(-1., 1., num=20).astype(np.float32)


def fake_int8(arr, scale):
    q = np.clip(np.round(arr / scale), -127, 127).astype(np.int8)
    dq = q.astype(np.float32) * scale
    return q, dq


q, dq = fake_int8(w, scale=0.02)
print("max_abs_error", float(np.max(np.abs(w - dq))))
How this shows up in AI

LLM serving stacks mix GPU batching, KV-cache paging, speculative decoding, and quantization recipes (GPTQ, AWQ)—this section is the vocab for those design reviews.

Serving basics

Why this matters

Serving means loading weights + runtime (PyTorch, ONNX, TensorRT, vLLM…) behind an API with autoscaling replicas. Latency dominates interactive chat; throughput wins batch ETL jobs—you negotiate both with platform folks using percentiles, concurrency, and hardware topology.

Example code
queueing_stub.py
# Tiny back-of-envelope: effective batch size vs queue wait.

def avg_wait_ms(arrival_ms: float, infer_ms: float) -> float:
    # M/M/1-ish toy: double load doubles wait — real systems need traces.
    load = infer_ms / arrival_ms
    if load >= 1:
        return float("inf")
    return infer_ms * load / (1 - load)


print("approx_wait", avg_wait_ms(arrival_ms=40, infer_ms=30))
How this shows up in AI

Streamed LLM tokens change how you measure SLA (time-to-first-byte vs completion), but autoscaling + queue depth math still governs meltdowns on traffic spikes.

Safety & alignment (overview)

Why this matters

Models inherit dataset biases, exhibit toxicity, and face jailbreak attempts that bypass polite refusals—especially when connected to tools amplifying side effects. Responsible teams pair quantitative red-teaming with human oversight for high-risk sectors (health, finance). Alignment is ongoing operations, not one training run.

Example code
policy_gate.py
# Pseudocode policy layer before model output leaves your VPC.

FORBIDDEN = {"classified payload", "credential dump"}


def sanitize(text: str) -> str:
    lower = text.lower()
    if any(bad in lower for bad in FORBIDDEN):
        raise PermissionError("blocked_policy_category")
    return text


print(sanitize("Here is tomorrow's weather"))
How this shows up in AI

Path B guardrails stack concrete controls (allowlists, moderation APIs) atop this threat model—reuse the same incident runbooks when regressions appear.

A5

Multimodal & capstone

Align modalities in shared embedding spaces or fuse tokens early, finish with a reproducible tuning project documenting behavior and boundaries.

Multimodal alignment

Why this matters

Dual-encoder models such as CLIP push images and captions into one vector space scored by cosine similarity—great retrieval without autoregressive text. Newer fused VLMs project image patches straight into the same Transformer stream as textual tokens—enabling conversational reasoning but demanding more compute alignment data. Understanding both explains when to search vs caption vs tool-call multimodal pipelines.

Example code
clip_similarity_stub.py
import numpy as np


def l2_normalize(vec):
    n = np.linalg.norm(vec, axis=-1, keepdims=True) + 1e-9
    return vec / n


rng = np.random.default_rng(3)
img_vec = rng.normal(size=(1, 8))
txt_vec = rng.normal(size=(1, 8))
similarity = (l2_normalize(img_vec) * l2_normalize(txt_vec)).sum()
print("cos_dot", float(similarity))
How this shows up in AI

Product teams remix CLIP-ish retrievers for marketing asset search while shipping fused VLMs for chat-style photo analysis—pricing and latency diverge massively between the shapes.

Vision-language models

Why this matters

Image pixels become a sequence of flattened patches (possibly after CNN stem) that receive positional embeddings like words. Larger native resolution ⇒ more patches ⇒ longer contexts and hotter GPUs—balancing crop sizes vs fidelity is architectural, not an afterthought.

Example code
patch_flatten.py
import numpy as np

rng = np.random.default_rng(4)
image = rng.integers(0, 255, size=(224, 224, 3))
patch = 16
tiles = []
for i in range(0, 224, patch):
    for j in range(0, 224, patch):
        tiles.append(image[i : i + patch, j : j + patch].reshape(-1))
stack = np.stack(tiles, axis=0)
print("num_patch_tokens", stack.shape[0])
How this shows up in AI

Multimodal eval harnesses cite patch coverage and OCR ground truth—infra teams watch token-per-image budgets beside text-only SLAs.

Capstone — train, evaluate, model card

Why this matters

You demonstrate end-to-end ownership: scrape/curate a dataset, tune (full or LoRA) with reproducible configs, benchmark against sane baselines, and publish a succinct model card covering scope, pitfalls, fairness notes, deployment limits. This is proof you graduate from consumer of APIs to builder of responsibly documented models.

Example code
model_card_outline.py
# Skeleton you can paste into README + expand.
MODEL_CARD = {"name": "<tuning-run>", "base_checkpoint": "…", "data_sources": [], "intended_users": [], "metrics": {}, "limitations": [], "ethical_review": "pending/supplied"}


for key, val in MODEL_CARD.items():
    print(key, ":", val)
How this shows up in AI

Hiring loops and auditors increasingly expect parity between marketing claims and written cards—invest as much diligence here as plotting loss curves. Official brief: capstone spec.

Optional next on Path B: multimodal stretch.