Core Architecture & Internals
Every Kafka outage traced to “mystery lag” or “lost messages” eventually lands here: how the commit log is laid out on disk, how partitions map to brokers, and what the broker actually does when a produce or fetch request arrives. This is the foundation chapter—go deepest here before tuning producers or consumers.
The commit log
Kafka’s central abstraction is brutally simple: an append-only, ordered, immutable sequence of records. Everything else—topics, partitions, consumer groups—is naming and sharding layered on top of that log.
What a commit log is
A commit log (write-ahead log, transaction log, journal) records events in the order they arrive. New records are always appended at the end. Existing bytes are never updated in place—corrections are new records (or tombstones under compaction). Consumers track a position (offset) in this sequence.
Why append-only beats random writes
On HDD, sequential writes approach full disk bandwidth; random writes seek and stall. On SSD, random writes trigger write amplification—the FTL erases and rewrites larger blocks than your 1 KB message. Kafka batches many small records into large segment appends, keeping the device in sequential mode.
When a producer sends a batch, the partition leader appends bytes to the active .log segment file, updates the sparse offset index, and only then acknowledges—depending on your acks setting and ISR state.
How Kafka maps the log to disk
Each partition is a directory on the broker filesystem:
/var/kafka/data/orders-2/ ← topic orders, partition 2 ├── 00000000000000000000.log ← segment: message bytes ├── 00000000000000000000.index ← sparse offset → file position ├── 00000000000000000000.timeindex ← timestamp → offset ├── 00000000000000123456.log ← rolled segment (previous active) ├── 00000000000000123456.index └── 00000000000000123456.timeindex
Log segment rolling
Active segments grow until log.segment.bytes (default 1 GB) or log.roll.ms / log.roll.hours triggers a roll. A new empty segment becomes active; the old one is eligible for retention deletion or compaction.
Offset as position—not a message ID
Offset 42 means “the 43rd record in this partition log” (0-indexed). It is not a UUID, not a timestamp, and not stable across topics. Replay means resetting that pointer backward; compaction may change which record sits at a given offset over time for keyed topics.
| Property | Default | Production | Impact |
|---|---|---|---|
log.segment.bytes |
1 GB | 512 MB – 1 GB | Smaller segments = more files, faster retention deletes; larger = fewer file handles |
log.roll.hours |
168 (7 days) | 24–168 | Time-based roll even if segment not full—helps retention boundary alignment |
log.retention.hours |
168 | Per-topic (24–720+) | How long delete-policy data survives on disk |
On brokers with many partitions, prefer fewer large segments over many tiny segments—each segment is three files (log, index, timeindex) and consumes file descriptors.
Topics & partitions
A topic is a named log. A partition is that log split into a shard—Kafka’s unit of parallelism and ordering. Think of partitions as lanes on a highway: more lanes move more traffic, but each lane preserves order.
Topic as a named log
Producers publish to a topic name (orders, user-events). The cluster maps the topic to N partitions spread across brokers. Consumers subscribe to the topic; the group coordinator assigns specific partitions to each member.
Partition: parallelism AND ordering
Ordering is guaranteed only within one partition. If userId=42 must always be processed in order, all events for that user must land in the same partition—typically via key-based partitioning: partition = murmur2(key) % numPartitions.
Partition count decision
- Throughput target ÷ per-partition throughput (~10–100 MB/s depending on hardware)
- Consumer parallelism — max consumers in a group = partition count
- Ordering scope — more partitions = finer parallelism but weaker global order
You cannot reduce partition count—only increase. Plan partition count at topic creation; changing later requires careful migration and does not shrink existing data layout.
Partition assignment to brokers
Each partition has one leader broker (handles all reads and writes) and follower replicas (RF−1). The controller assigns leaders to balance load. Clients always talk to the leader for a partition; followers replicate by fetching from the leader.
flowchart TB
subgraph topic["Topic: orders (3 partitions, RF=3)"]
P0["Partition 0\nLeader: Broker 1"]
P1["Partition 1\nLeader: Broker 2"]
P2["Partition 2\nLeader: Broker 3"]
end
B1[Broker 1]
B2[Broker 2]
B3[Broker 3]
P0 --> B1
P0 -.replica.-> B2
P0 -.replica.-> B3
P1 --> B2
P1 -.replica.-> B3
P1 -.replica.-> B1
P2 --> B3
P2 -.replica.-> B1
P2 -.replica.-> B2
Hot partition problem
Bad key choice—countryCode="US" for 70% of traffic, or null keys with sticky partitioning that still skews—concentrates load on one partition. One hot partition caps throughput for the whole topic and creates single-consumer bottlenecks.
“How do you scale a hot partition?” — honest answers: fix the key (salt/high-cardinality), split into multiple topics, or accept that ordering for that key serializes processing. You cannot add partitions and magically rebalance existing key distribution.
Interactive partition visualizer
See how record keys map to partitions using Kafka’s real murmur2 hash. Toggle hot-key mode to watch one lane absorb all traffic.
partition = (murmur2(key) & 0x7fffffff) % numPartitions — same as DefaultPartitioner
High-cardinality keys (user ID, order ID) spread load but preserve per-entity order. Low-cardinality keys (country, status) maximize ordering scope but invite hot partitions—choose deliberately.
Broker internals
A Kafka broker is a JVM process that owns partition logs, serves client requests, and participates in cluster metadata. Understanding leader/follower roles and ISR mechanics is prerequisite to any durability discussion.
Broker roles
- Controller — one broker per cluster (KRaft: controller quorum); manages partition leader election, topic creation, ISR changes
- Leader — per partition; handles all produce and fetch requests for that partition
- Follower — replicates the leader’s log; can be elected leader if in ISR
ZooKeeper mode (legacy) vs KRaft
| Aspect | ZooKeeper (pre-4.0) | KRaft (Kafka 3.3+ GA, 4.0 default) |
|---|---|---|
| Metadata store | External ZK ensemble (3–5 nodes) | Internal __cluster_metadata topic |
| Controller failover | ZK watches + controller election | Raft quorum—faster, fewer edge cases |
| Ops burden | Two systems to patch, monitor, backup | Kafka-only operational surface |
| Partition ceiling | ~200K metadata entries practical limit | Millions of partitions (tested at LinkedIn scale) |
ISR (In-Sync Replicas)
ISR is the set of replicas caught up with the leader—followers whose lag is within replica.lag.time.max.ms (default 30s). Only ISR members are eligible for leader election (when unclean.leader.election.enable=false, the production default).
Replication flow
- Producer sends batch to partition leader
- Leader appends to local log, updates in-memory high watermark candidate
- Followers fetch from leader and append to their logs
- Leader advances high watermark (HW) to the offset replicated by all ISR members (for acks=all)
- Consumers only read up to HW on followers; leader may expose up to LEO (log end offset)
Leader log: [0][1][2][3][4][5] LEO=6 Follower log: [0][1][2][3] LEO=4 (lagging) High Watermark: ^ HW=4 (min offset replicated to ALL ISR) If follower catches up to 6 → HW advances to 6
min.insync.replicas — the durability knob
When acks=all, the leader checks |ISR| ≥ min.insync.replicas before acknowledging. If ISR shrinks below this threshold, produce requests fail with NOT_ENOUGH_REPLICAS—availability traded for durability.
# Broker / topic config
default.replication.factor: 3
min.insync.replicas: 2
unclean.leader.election.enable: false
# Producer (Spring Kafka)
spring.kafka.producer.acks: all
spring.kafka.producer.retries: 2147483647
spring.kafka.producer.enable-idempotence: true
ISR shrink under load — slow followers drop out of ISR; with min.insync.replicas=2 and only one replica in sync, producers with acks=all stall. Monitor UnderReplicatedPartitions and follower fetch latency.
Unclean leader election — unclean.leader.election.enable=true restores availability by electing an out-of-sync replica, but you lose unreplicated data. Production: always false.
Production triad: replication.factor=3 + min.insync.replicas=2 + acks=all. Survives one broker loss without write outage or data loss.
Storage internals
Kafka’s performance secret is not a clever JVM heap structure—it is letting the operating system’s page cache do the heavy lifting, then bypassing userspace on the read path with zero-copy.
Log segment anatomy
| File | Purpose | Lookup |
|---|---|---|
.log |
Sequential message bytes (batch format) | Read sequentially from file position |
.index |
Sparse map: offset → byte position in .log | Binary search, then scan forward |
.timeindex |
Sparse map: timestamp → offset | Time-based offset lookup (offsetsForTimes) |
Index lookup: finding offset N
Indexes store entries every index.interval.bytes (default 4 KB of messages)—not per message. To read offset 1,000,000: binary search the index for the nearest entry ≤ 1,000,000, seek the .log file to that position, scan forward until the target offset.
Page cache — deliberately not JVM heap
Kafka brokers run with modest heap (4–6 GB) and rely on OS page cache for hot data. Recently written and frequently read segments stay in RAM without GC pressure. Implication: GC pauses don’t stall I/O, but broker restarts are slow—cold cache must warm from disk.
On fetch, the broker reads from page cache (if warm) and uses sendfile() to push bytes to the socket. The JVM never deserializes message payloads for plain replication or consumer fetch—only metadata parsing for metrics and quotas.
Zero-copy transfer
Traditional path: Disk → kernel buffer → JVM heap → kernel socket buffer → NIC (4 copies) Kafka sendfile path: Disk → page cache → socket buffer → NIC (2 copies, no userspace)
Log compaction vs retention
- Retention (delete) — log.cleanup.policy=delete; remove segments older than retention.ms or exceeding retention.bytes
- Compaction — log.cleanup.policy=compact; keep latest value per key, tombstone deletes after delete.retention.ms
- Combined — compact,delete; compact keyed state, then apply time/size retention on non-latest records
Compaction internals
A log cleaner thread pool scans “dirty” segments (ratio controlled by min.cleanable.dirty.ratio). It builds a map of key → latest offset, rewrites clean segments, and swaps them in. max.compaction.lag.ms bounds how long a key can live before compaction must process it.
Log compaction misconceptions — compaction is asynchronous and laggy under write pressure. Consumers cannot assume a compacted topic has only one record per key at all times; they may see superseded values until the cleaner runs.
LinkedIn uses compacted topics for config and state changelog backing Kafka Streams. Confluent Schema Registry stores schemas in the compacted _schemas topic.
Network layer
Kafka speaks a custom binary protocol over TCP—not HTTP, not AMQP. Understanding request types and the reactor thread model explains broker CPU profiles and producer batching behavior.
Request types (selected)
| API key | Request | Who sends |
|---|---|---|
| Produce | Append record batches to partition leaders | Producer |
| Fetch | Read records from leader (consumers & followers) | Consumer, follower replica |
| Metadata | Topic/partition → broker leader mapping | All clients (cached) |
| FindCoordinator | Locate consumer group coordinator broker | Consumer |
| OffsetCommit / Fetch | Consumer group offset management | Consumer |
Reactor pattern on the broker
flowchart LR NIC[TCP connections] NT[num.network.threads\naccept + parse] RQ[Request queue] IO[num.io.threads\ndisk + replication] NIC --> NT --> RQ --> IO IO --> RQ --> NT --> NIC
Network threads read bytes, parse request headers, and enqueue. I/O threads execute produce/fetch against disk and replication. Saturated network threads → NetworkProcessorAvgIdlePercent drops; saturated I/O threads → RequestHandlerAvgIdlePercent drops.
Producer-side batch accumulation
Before bytes hit the wire, the producer holds records in the RecordAccumulator until batch.size fills or linger.ms expires—artificial delay that trades latency for throughput.
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
replica.socket.receive.buffer.bytes=65536
Rule of thumb: num.io.threads ≈ 2 × number of data disks; num.network.threads = 3–8 per broker depending on connection count. Increase when idle percent metrics drop below 0.2 / 0.3.
KRaft mode deep dive
KRaft (Kafka Raft) replaces ZooKeeper with an internal Raft consensus layer. Kafka 4.0 removed ZK entirely—every new cluster you build should be KRaft-native.
Raft consensus in brief
A small set of controller quorum nodes (typically 3 or 5) elect a leader via Raft. The leader accepts metadata changes (create topic, partition reassignment, ACL updates), appends them to its log, and replicates to followers. A change is committed when a quorum acknowledges it.
KRaft roles
- Controller — participates in metadata quorum (dedicated or combined)
- Broker — serves produce/fetch; reads metadata from committed log
- Combined — single process acts as broker + controller (dev/small clusters)
__cluster_metadata topic
Cluster state—broker registrations, topic configs, partition assignments, feature flags—is stored as records in the internal __cluster_metadata topic, replicated via Raft (not standard Kafka partition replication). Brokers consume this log to learn who leads each partition.
flowchart TB
Admin[kafka-topics.sh / Admin API]
CL[Controller Leader\nRaft quorum]
CF[Controller Followers]
CM[("__cluster_metadata")]
B1[Broker 1]
B2[Broker 2]
Admin --> CL
CL --> CM
CL --> CF
CM --> B1
CM --> B2
Migration from ZooKeeper
Kafka 3.x provides dual-write migration tooling: export ZK metadata, bootstrap KRaft quorum, verify, cut over. Plan maintenance windows—controller failover behavior differs during migration. Greenfield: skip ZK entirely.
Performance improvements
- Faster controlled shutdown — metadata updates propagate without ZK session timeouts
- Faster partition leadership changes — fewer watch-notification delays
- Higher metadata scalability — millions of partitions vs ZK practical limits
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@localhost:9093
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
log.dirs=/var/kafka/data
metadata.log.dir=/var/kafka/metadata
Production: separate controller and broker roles on different nodes—controller quorum should not compete with heavy fetch/produce I/O. Use 3 controllers for most clusters; 5 only when controller fault tolerance must survive two simultaneous losses.
LinkedIn validated KRaft at millions of partitions before GA. Confluent Cloud runs KRaft-backed clusters by default—customers never provisioned ZooKeeper.