Core Architecture & Internals

The commit log

Kafka’s central abstraction is brutally simple: an append-only, ordered, immutable sequence of records. Everything else—topics, partitions, consumer groups—is naming and sharding layered on top of that log.

What a commit log is

A commit log (write-ahead log, transaction log, journal) records events in the order they arrive. New records are always appended at the end. Existing bytes are never updated in place—corrections are new records (or tombstones under compaction). Consumers track a position (offset) in this sequence.

Why append-only beats random writes

On HDD, sequential writes approach full disk bandwidth; random writes seek and stall. On SSD, random writes trigger write amplification—the FTL erases and rewrites larger blocks than your 1 KB message. Kafka batches many small records into large segment appends, keeping the device in sequential mode.

🔬 Under the Hood

When a producer sends a batch, the partition leader appends bytes to the active .log segment file, updates the sparse offset index, and only then acknowledges—depending on your acks setting and ISR state.

How Kafka maps the log to disk

Each partition is a directory on the broker filesystem:

/var/kafka/data/orders-2/          ← topic orders, partition 2
├── 00000000000000000000.log      ← segment: message bytes
├── 00000000000000000000.index    ← sparse offset → file position
├── 00000000000000000000.timeindex ← timestamp → offset
├── 00000000000000123456.log      ← rolled segment (previous active)
├── 00000000000000123456.index
└── 00000000000000123456.timeindex

Log segment rolling

Active segments grow until log.segment.bytes (default 1 GB) or log.roll.ms / log.roll.hours triggers a roll. A new empty segment becomes active; the old one is eligible for retention deletion or compaction.

Offset as position—not a message ID

Offset 42 means “the 43rd record in this partition log” (0-indexed). It is not a UUID, not a timestamp, and not stable across topics. Replay means resetting that pointer backward; compaction may change which record sits at a given offset over time for keyed topics.

Property	Default	Production	Impact
`log.segment.bytes`	1 GB	512 MB – 1 GB	Smaller segments = more files, faster retention deletes; larger = fewer file handles
`log.roll.hours`	168 (7 days)	24–168	Time-based roll even if segment not full—helps retention boundary alignment
`log.retention.hours`	168	Per-topic (24–720+)	How long delete-policy data survives on disk

💡 Pro Tip

On brokers with many partitions, prefer fewer large segments over many tiny segments—each segment is three files (log, index, timeindex) and consumes file descriptors.

Topics & partitions

A topic is a named log. A partition is that log split into a shard—Kafka’s unit of parallelism and ordering. Think of partitions as lanes on a highway: more lanes move more traffic, but each lane preserves order.

Topic as a named log

Producers publish to a topic name (orders, user-events). The cluster maps the topic to N partitions spread across brokers. Consumers subscribe to the topic; the group coordinator assigns specific partitions to each member.

Partition: parallelism AND ordering

Ordering is guaranteed only within one partition. If userId=42 must always be processed in order, all events for that user must land in the same partition—typically via key-based partitioning: partition = murmur2(key) % numPartitions.

Partition count decision

Throughput target ÷ per-partition throughput (~10–100 MB/s depending on hardware)
Consumer parallelism — max consumers in a group = partition count
Ordering scope — more partitions = finer parallelism but weaker global order

⚠️ Pitfall

You cannot reduce partition count—only increase. Plan partition count at topic creation; changing later requires careful migration and does not shrink existing data layout.

Partition assignment to brokers

Each partition has one leader broker (handles all reads and writes) and follower replicas (RF−1). The controller assigns leaders to balance load. Clients always talk to the leader for a partition; followers replicate by fetching from the leader.

flowchart TB
  subgraph topic["Topic: orders (3 partitions, RF=3)"]
    P0["Partition 0\nLeader: Broker 1"]
    P1["Partition 1\nLeader: Broker 2"]
    P2["Partition 2\nLeader: Broker 3"]
  end
  B1[Broker 1]
  B2[Broker 2]
  B3[Broker 3]
  P0 --> B1
  P0 -.replica.-> B2
  P0 -.replica.-> B3
  P1 --> B2
  P1 -.replica.-> B3
  P1 -.replica.-> B1
  P2 --> B3
  P2 -.replica.-> B1
  P2 -.replica.-> B2

Hot partition problem

Bad key choice—countryCode="US" for 70% of traffic, or null keys with sticky partitioning that still skews—concentrates load on one partition. One hot partition caps throughput for the whole topic and creates single-consumer bottlenecks.

🎯 Interview Tip

“How do you scale a hot partition?” — honest answers: fix the key (salt/high-cardinality), split into multiple topics, or accept that ordering for that key serializes processing. You cannot add partitions and magically rebalance existing key distribution.

Interactive partition visualizer

See how record keys map to partitions using Kafka’s real murmur2 hash. Toggle hot-key mode to watch one lane absorb all traffic.

Partitions: 6 Records Hot key (all records same key)

partition = (murmur2(key) & 0x7fffffff) % numPartitions — same as DefaultPartitioner

⚖️ Trade-off

High-cardinality keys (user ID, order ID) spread load but preserve per-entity order. Low-cardinality keys (country, status) maximize ordering scope but invite hot partitions—choose deliberately.

Broker internals

A Kafka broker is a JVM process that owns partition logs, serves client requests, and participates in cluster metadata. Understanding leader/follower roles and ISR mechanics is prerequisite to any durability discussion.

Broker roles

Controller — one broker per cluster (KRaft: controller quorum); manages partition leader election, topic creation, ISR changes
Leader — per partition; handles all produce and fetch requests for that partition
Follower — replicates the leader’s log; can be elected leader if in ISR

ZooKeeper mode (legacy) vs KRaft

Aspect	ZooKeeper (pre-4.0)	KRaft (Kafka 3.3+ GA, 4.0 default)
Metadata store	External ZK ensemble (3–5 nodes)	Internal __cluster_metadata topic
Controller failover	ZK watches + controller election	Raft quorum—faster, fewer edge cases
Ops burden	Two systems to patch, monitor, backup	Kafka-only operational surface
Partition ceiling	~200K metadata entries practical limit	Millions of partitions (tested at LinkedIn scale)

ISR (In-Sync Replicas)

ISR is the set of replicas caught up with the leader—followers whose lag is within replica.lag.time.max.ms (default 30s). Only ISR members are eligible for leader election (when unclean.leader.election.enable=false, the production default).

Replication flow

Producer sends batch to partition leader
Leader appends to local log, updates in-memory high watermark candidate
Followers fetch from leader and append to their logs
Leader advances high watermark (HW) to the offset replicated by all ISR members (for acks=all)
Consumers only read up to HW on followers; leader may expose up to LEO (log end offset)

  Leader log:     [0][1][2][3][4][5]  LEO=6
  Follower log:   [0][1][2][3]         LEO=4  (lagging)
  High Watermark:              ^ HW=4 (min offset replicated to ALL ISR)

  If follower catches up to 6 → HW advances to 6

min.insync.replicas — the durability knob

When acks=all, the leader checks |ISR| ≥ min.insync.replicas before acknowledging. If ISR shrinks below this threshold, produce requests fail with NOT_ENOUGH_REPLICAS—availability traded for durability.

# Broker / topic config
default.replication.factor: 3
min.insync.replicas: 2
unclean.leader.election.enable: false

# Producer (Spring Kafka)
spring.kafka.producer.acks: all
spring.kafka.producer.retries: 2147483647
spring.kafka.producer.enable-idempotence: true

⚠️ Pitfall

ISR shrink under load — slow followers drop out of ISR; with min.insync.replicas=2 and only one replica in sync, producers with acks=all stall. Monitor UnderReplicatedPartitions and follower fetch latency.

⚖️ Trade-off

Unclean leader election — unclean.leader.election.enable=true restores availability by electing an out-of-sync replica, but you lose unreplicated data. Production: always false.

⚙️ Config

Production triad: replication.factor=3 + min.insync.replicas=2 + acks=all. Survives one broker loss without write outage or data loss.

Storage internals

Kafka’s performance secret is not a clever JVM heap structure—it is letting the operating system’s page cache do the heavy lifting, then bypassing userspace on the read path with zero-copy.

Log segment anatomy

File	Purpose	Lookup
`.log`	Sequential message bytes (batch format)	Read sequentially from file position
`.index`	Sparse map: offset → byte position in .log	Binary search, then scan forward
`.timeindex`	Sparse map: timestamp → offset	Time-based offset lookup (offsetsForTimes)

Index lookup: finding offset N

Indexes store entries every index.interval.bytes (default 4 KB of messages)—not per message. To read offset 1,000,000: binary search the index for the nearest entry ≤ 1,000,000, seek the .log file to that position, scan forward until the target offset.

Page cache — deliberately not JVM heap

Kafka brokers run with modest heap (4–6 GB) and rely on OS page cache for hot data. Recently written and frequently read segments stay in RAM without GC pressure. Implication: GC pauses don’t stall I/O, but broker restarts are slow—cold cache must warm from disk.

🔬 Under the Hood

On fetch, the broker reads from page cache (if warm) and uses sendfile() to push bytes to the socket. The JVM never deserializes message payloads for plain replication or consumer fetch—only metadata parsing for metrics and quotas.

Zero-copy transfer

  Traditional path:
  Disk → kernel buffer → JVM heap → kernel socket buffer → NIC  (4 copies)

  Kafka sendfile path:
  Disk → page cache → socket buffer → NIC  (2 copies, no userspace)

Log compaction vs retention

Retention (delete) — log.cleanup.policy=delete; remove segments older than retention.ms or exceeding retention.bytes
Compaction — log.cleanup.policy=compact; keep latest value per key, tombstone deletes after delete.retention.ms
Combined — compact,delete; compact keyed state, then apply time/size retention on non-latest records

Compaction internals

A log cleaner thread pool scans “dirty” segments (ratio controlled by min.cleanable.dirty.ratio). It builds a map of key → latest offset, rewrites clean segments, and swaps them in. max.compaction.lag.ms bounds how long a key can live before compaction must process it.

⚠️ Pitfall

Log compaction misconceptions — compaction is asynchronous and laggy under write pressure. Consumers cannot assume a compacted topic has only one record per key at all times; they may see superseded values until the cleaner runs.

📦 Real World

LinkedIn uses compacted topics for config and state changelog backing Kafka Streams. Confluent Schema Registry stores schemas in the compacted _schemas topic.

Network layer

Kafka speaks a custom binary protocol over TCP—not HTTP, not AMQP. Understanding request types and the reactor thread model explains broker CPU profiles and producer batching behavior.

Request types (selected)

API key	Request	Who sends
Produce	Append record batches to partition leaders	Producer
Fetch	Read records from leader (consumers & followers)	Consumer, follower replica
Metadata	Topic/partition → broker leader mapping	All clients (cached)
FindCoordinator	Locate consumer group coordinator broker	Consumer
OffsetCommit / Fetch	Consumer group offset management	Consumer

Reactor pattern on the broker

flowchart LR
  NIC[TCP connections]
  NT[num.network.threads\naccept + parse]
  RQ[Request queue]
  IO[num.io.threads\ndisk + replication]
  NIC --> NT --> RQ --> IO
  IO --> RQ --> NT --> NIC

Network threads read bytes, parse request headers, and enqueue. I/O threads execute produce/fetch against disk and replication. Saturated network threads → NetworkProcessorAvgIdlePercent drops; saturated I/O threads → RequestHandlerAvgIdlePercent drops.

Producer-side batch accumulation

Before bytes hit the wire, the producer holds records in the RecordAccumulator until batch.size fills or linger.ms expires—artificial delay that trades latency for throughput.

num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
replica.socket.receive.buffer.bytes=65536

⚙️ Config

Rule of thumb: num.io.threads ≈ 2 × number of data disks; num.network.threads = 3–8 per broker depending on connection count. Increase when idle percent metrics drop below 0.2 / 0.3.

KRaft mode deep dive

KRaft (Kafka Raft) replaces ZooKeeper with an internal Raft consensus layer. Kafka 4.0 removed ZK entirely—every new cluster you build should be KRaft-native.

Raft consensus in brief

A small set of controller quorum nodes (typically 3 or 5) elect a leader via Raft. The leader accepts metadata changes (create topic, partition reassignment, ACL updates), appends them to its log, and replicates to followers. A change is committed when a quorum acknowledges it.

KRaft roles

Controller — participates in metadata quorum (dedicated or combined)
Broker — serves produce/fetch; reads metadata from committed log
Combined — single process acts as broker + controller (dev/small clusters)

__cluster_metadata topic

Cluster state—broker registrations, topic configs, partition assignments, feature flags—is stored as records in the internal __cluster_metadata topic, replicated via Raft (not standard Kafka partition replication). Brokers consume this log to learn who leads each partition.

flowchart TB
  Admin[kafka-topics.sh / Admin API]
  CL[Controller Leader\nRaft quorum]
  CF[Controller Followers]
  CM[("__cluster_metadata")]
  B1[Broker 1]
  B2[Broker 2]
  Admin --> CL
  CL --> CM
  CL --> CF
  CM --> B1
  CM --> B2

Migration from ZooKeeper

Kafka 3.x provides dual-write migration tooling: export ZK metadata, bootstrap KRaft quorum, verify, cut over. Plan maintenance windows—controller failover behavior differs during migration. Greenfield: skip ZK entirely.

Performance improvements

Faster controlled shutdown — metadata updates propagate without ZK session timeouts
Faster partition leadership changes — fewer watch-notification delays
Higher metadata scalability — millions of partitions vs ZK practical limits

process.roles=broker,controller
node.id=1
controller.quorum.voters=1@localhost:9093
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
log.dirs=/var/kafka/data
metadata.log.dir=/var/kafka/metadata

💡 Pro Tip

Production: separate controller and broker roles on different nodes—controller quorum should not compete with heavy fetch/produce I/O. Use 3 controllers for most clusters; 5 only when controller fault tolerance must survive two simultaneous losses.

📦 Real World

LinkedIn validated KRaft at millions of partitions before GA. Confluent Cloud runs KRaft-backed clusters by default—customers never provisioned ZooKeeper.