Three pillars, one trace id. Correlating prompts, retrievals, tool calls, and streaming chunks requires a single request_id / OpenTelemetry trace root. Spans should name the semantic step—retrieve.hybrid, rerank.cross_encoder, llm.completion—not only “HTTP POST.”
Redaction by default. Log payloads with field-level redaction or hashes; store full text only in secured, TTL’d buckets when needed for replay. Security teams audit what leaves the VPC.
Metrics that matter. TTFT, inter-token latency percentiles, retrieval candidate count, rerank drop rate, validator reject rate, cost per request, tokens in/out by model version. Generic CPU metrics miss user-visible pain.
Structured events. Emit JSON lines for grounding decisions, citation ids, and prompt template ids so warehouse joins power weekly quality reviews—not grep.
Product linkage. Tag spans with feature, tenant_tier, experiment_id so PMs slice regressions without filing ‘mystery infra’ tickets.
Trace shape
flowchart LR
R[Request] --> T[Trace]
T --> RET[Retrieve span]
T --> RN[Rerank span]
T --> L[LLM span]
T --> V[Validate span]