How Apache Kafka works at scale

Kafka is often introduced as a “message queue.” In production it behaves more like a distributed commit log: producers append immutable records to partitioned topics; brokers replicate them for durability; consumer groups read at their own pace and track offsets.

We work through the design in order—requirements first, numbers second, architecture third, client APIs last—using Apache Kafka as the reference, with notes where managed services (MSK, Confluent Cloud) wrap the same primitives.

What you should be able to do after reading:

Separate the three loops—produce, replicate, consume—and name what each guarantees.
List functional and non-functional requirements and map them to acks, ISR, and consumer groups.
Walk one record from producer batch → partition log → follower ack → consumer fetch → offset commit.
Explain partition keys, consumer rebalancing, retention vs compaction, and at-least-once vs exactly-once.
Read the technical section: topic config, producer/consumer settings, and admin operations.

Step 0 — How we will work through the problem

Ordered thinking beats memorizing a broker diagram. Use this sequence when you adopt or design around Kafka:

Clarify scope. Event bus only, or stream processing (Kafka Streams/Flink)? Ordering per key? Global ordering?
Write requirements. Functional = publish, subscribe, retain. Non-functional = throughput, durability, latency, retention.
Do napkin math. MB/s per partition, partition count, retention disk—so you do not run three brokers into a wall.
Draw three loops before picking topic names.
Tell one story—one event, one partition, one consumer group—then failure cases (broker down, slow consumer, rebalance).

flowchart LR
  subgraph produce [Produce loop]
    P[Producer] --> BR[Broker leader]
  end
  subgraph replicate [Replicate loop]
    BR --> F1[Follower]
    BR --> F2[Follower]
    F1 --> ISR[ISR ack]
    F2 --> ISR
  end
  subgraph consume [Consume loop]
    ISR --> CG[Consumer group]
    CG --> OFF[Offset commit]
  end

Step 1 — Functional requirements (what teams need from Kafka)

Capability	Requirement	Why it shapes design
Publish	Many producers append to named topics	Topic = logical channel; partitions = parallelism
Subscribe	Many consumers read the same topic independently	Consumer groups + offsets per group
Order	Total order within one partition	Partition key routes related events together
Retain	Configurable time/size retention	Disk planning; replay for new consumers
Replay	Reset offset and re-read history	Backfill, recovery, new downstream service
Compact	Keep latest value per key (changelog topics)	Compaction vs delete retention
Connect	Integrate DBs and queues (Kafka Connect)	Separate ops surface; still same log model
Process	Stateful stream joins (Streams/Flink)	Uses partitions + changelog topics internally

Functional details worth stating clearly

Kafka is pull-based. Consumers fetch when ready; slow readers do not push back pressure into producers directly—they fall behind (lag).

Messages are immutable. You do not update offset 42; you append offset 43 with a correction event if the business model allows.

Out of scope today (say it aloud). Building Kafka from scratch, or using it as a primary database without compaction discipline—park those debates.

Step 2 — Non-functional requirements (engineering promises)

Category	Target (typical)	How we meet it	If we miss it
Throughput	MB/s–GB/s per cluster	Enough partitions, batching, compression	Backpressure upstream
Latency — produce	ms–tens of ms p99	`acks=1` vs `all`, linger.ms tuning	Downstream stale data
Durability	Survive broker loss without loss (when configured)	Replication factor ≥ 3, `min.insync.replicas`	Permanent gap in audit trail
Availability	Cluster stays writable on leader fail	ISR promotion, controller failover	Publish outage
Ordering	Per-partition FIFO	Single consumer per partition in group	Inventory count wrong
Retention	Days to weeks (events) or forever (compact keys)	`retention.ms`, `cleanup.policy`	Disk full; legal hold issues
Isolation	ACLs per topic/principal	SASL/SSL, ACL authorizer	Cross-team data leaks

Key idea: Durability and latency trade off through acks and replication. Pick explicitly per topic—not one global default for the whole company.

Step 3 — Napkin math (partitions, disk, and lag)

Suppose 20 KB average record × 50k records/s → ~1 GB/s ingress (before replication multiply by replication factor).
One partition on one broker is often sized around tens of MB/s practical throughput—rule of thumb partitions ≥ max desired parallel consumers in a group.
Retention: 1 TB/day per cluster with RF=3 → 3 TB/day disk on the fleet before compaction.
Consumer lag = high watermark − committed offset; lag of millions means hours of catch-up at current fetch rate.

Step 4 — Architecture: brokers, topics, and the three loops

A cluster has many brokers. A topic is split into partitions, each an append-only log file sequence on disk. One broker is the leader per partition; followers replicate. ZooKeeper (older) or KRaft (modern) stores cluster metadata and elects leaders. Producers talk to partition leaders; consumers discover leaders via metadata.

flowchart TB
  subgraph cluster [Kafka cluster]
    B1[Broker 1]
    B2[Broker 2]
    B3[Broker 3]
    CTL[Controller KRaft]
  end
  subgraph topic [Topic orders events]
    P0[Partition 0 leader on B1]
    P1[Partition 1 leader on B2]
    P2[Partition 2 leader on B3]
  end
  PR[Producers] --> P0
  PR --> P1
  CG[Consumer group A] --> P0
  CG --> P1
  CG --> P2
  CTL --> B1
  CTL --> B2
  CTL --> B3

Step 5 — Walk one record end to end

Producer serializes key order_id=8812, value JSON; partitioner maps key → partition 2.
Batch accumulates records for linger.ms; sends Produce request to leader of partition 2.
Leader appends to local log segment; assigns offset 184_502_991.
Followers fetch from leader and write to their replicas; when all ISR members ack, leader responds per acks=all.
Consumer in group fulfillment polls, receives batch from partition 2; processes order.
Commit — after successful side effect, consumer commits offset 184_502_991 to internal topic __consumer_offsets.
Failure — if consumer crashes before commit, another member in the group replays from last committed offset (at-least-once).

sequenceDiagram
  participant P as Producer
  participant L as Leader P2
  participant F as Followers
  participant C as Consumer
  P->>L: Produce batch
  L->>F: replicate
  F-->>L: caught up
  L-->>P: ack offset
  C->>L: Fetch
  L-->>C: records
  C->>L: Commit offset

Step 6 — Topics, partitions, and keys

Topic — namespace (e.g. payments.captured). Partition — unit of parallelism and ordering. Key — optional; default partitioner hash(key) % numPartitions keeps same key on same partition.

More partitions → higher throughput ceiling, more files, longer rebalance times.
Hot partition — skewed keys (one mega-merchant) bottleneck one broker; salt keys or dedicated topic.
No key — round-robin spread; order not preserved across records.

# Create topic (CLI illustrative)
kafka-topics.sh --create \
  --topic payments.captured \
  --partitions 24 \
  --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config retention.ms=604800000

Step 7 — Producers: batching, compression, and acks

Setting	Effect
`acks=0`	Fire-and-forget; fastest; may lose data
`acks=1`	Leader wrote; follower loss can lose recent data
`acks=all`	Wait for ISR; durable when `min.insync.replicas` met
`enable.idempotence=true`	PID + sequence; dedupe retries per partition
`compression.type=lz4\|zstd`	Less network/disk; CPU on broker/producer
`linger.ms` + `batch.size`	Higher throughput, higher latency

Producers retry on transient errors; without idempotence, retries can create duplicates—consumers must dedupe or use idempotent producer + transactional writes.

Step 8 — Broker storage: segments, indexes, and retention

Each partition log is a series of segments (files). Active segment accepts writes; older segments are sealed. Sparse indexes map offset → file position for fast fetch. Time and size retention delete whole segments when policy triggers.

Log compaction (for changelog topics) keeps latest record per key; tombstones with null value delete keys after delete.retention.ms.

Step 9 — Replication, ISR, and leader election

Replication factor (RF) — copies across brokers. ISR (in-sync replicas) — followers within lag bound of leader. Only ISR members count toward acks=all. If leader dies, controller picks new leader from ISR.

replica.lag.time.max.ms — follower falls out of ISR if too slow.
unclean.leader.election.enable=false — avoid promoting out-of-sync replica (prevents data loss visibility).
Rack awareness — spread replicas across AZs (broker.rack).

Sanity check: RF=3 and min.insync.replicas=2 means you can lose one broker and still accept durable writes; losing two ISR members blocks producers with acks=all.

Step 10 — Consumer groups, fetch, and rebalancing

A consumer group shares work: each partition assigned to at most one consumer in the group at a time. Different groups read the same topic independently (fan-out).

Rebalance — on member join/leave, partitions reassigned (range, round-robin, sticky, cooperative sticky protocols). Rebalance pauses consumption briefly—keep sessions stable with max.poll.interval.ms and process records quickly.

# Pseudo: consumer subscribes
consumer.subscribe(["payments.captured"])
while running:
    records = consumer.poll(timeout=1s)
    for r in records:
        process(r)
    consumer.commit_sync()  # or commit_async

Step 11 — Offsets and delivery semantics

Semantics	How	Risk
At-most-once	Commit offset before process	Lost messages on crash
At-least-once	Process then commit (default)	Duplicates on retry
Exactly-once	Transactions + idempotent producer + read-process-write in one transaction	Complexity, throughput cost

Offsets stored in __consumer_offsets (compact topic). Auto commit is easy but can commit before processing finishes—prefer manual commit after side effects.

Step 12 — When to use compaction, headers, and schemas

Compaction — config snapshots, account state, Kafka Streams changelog topics.
Headers — metadata (trace id, content-type) without parsing value.
Schema Registry — Avro/Protobuf/JSON Schema; consumers evolve with compatibility rules (backward/forward).

Step 13 — Kafka Connect and stream processing (adjacent loops)

Kafka Connect — distributed connectors (source: DB CDC → topic; sink: topic → warehouse). Kafka Streams / Flink — stateful processing; stores state in local RocksDB + changelog compacted topics for recovery. Both assume you already sized topics and monitoring lag.

Step 14 — Operations: scaling, monitoring, and upgrades

Add brokers — rebalance partition leaders for disk/CPU balance (cruise control or manual).
Add partitions — increases parallelism; does not split existing keys mid-history.
Metrics — under-replicated partitions, ISR shrink rate, request latency, disk usage, consumer lag per group.
Upgrades — rolling broker restarts; check inter-broker protocol and client compatibility matrix.

Step 15 — Technical layer: APIs and configuration

Surface	What it does
Producer API	`send(ProducerRecord)`, callbacks on metadata/ack
Consumer API	`poll`, `commitSync`, `seek` for replay
Admin API	Create topics, describe configs, alter ISR (dangerous)
Wire protocol	Binary RPC over TCP; request types: ApiVersions, Metadata, Produce, Fetch, OffsetCommit

Producer properties (illustrative):

bootstrap.servers=broker1:9092,broker2:9092
client.id=orders-service
acks=all
enable.idempotence=true
compression.type=zstd
linger.ms=20
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=io.confluent.kafka.serializers.KafkaAvroSerializer
schema.registry.url=https://schema.example.com

Consumer properties (illustrative):

group.id=fulfillment-v3
enable.auto.commit=false
isolation.level=read_committed   # when using transactions
max.poll.records=500
max.poll.interval.ms=300000

Step 16 — Reliability patterns and failure modes

Failure modes

Broker disk full — stop writes; expand disk; drop retention emergency.
Hot partition — one broker hosts leaders for all hot keys; fix keying or add partitions + rebalance.
Poison message — consumer throws; DLQ topic with same key for manual fix; do not block whole partition forever.
Zombie consumer — exceeds max.poll.interval.ms; kicked from group; duplicate processing if commits were odd.

Design patterns

Outbox pattern — DB txn writes row + outbox event; connector publishes to Kafka.
Saga / event choreography — topics per domain event; idempotent consumers.
Dead-letter queue — orders.dlq with original headers + error reason.

Step 17 — Goals → knobs (quick reference)

Goal	Knob
Maximum throughput	More partitions, batching, compression, `acks=1` where safe
Maximum durability	`acks=all`, RF=3, `min.insync.replicas=2`, disable unclean election
Low publish latency	Smaller batches, `linger.ms=0`, fewer replicas acked (trade durability)
Ordered processing per entity	Stable partition key; one consumer per partition in group
Replayable history	Long retention or compacted changelog; document offset reset runbooks

Step 18 — Close the loop (what to practice)

On a whiteboard: three loops, one topic with three partitions, two consumer groups reading independently.

Out loud: difference between acks=1 and acks=all; what ISR means; how at-least-once duplicates happen.

With the technical section: trace a keyed produce to partition log offset and a manual offset commit after processing.

The one line to remember

Kafka is a replicated, partitioned append-only log with pull-based consumers. Producers fill partitions in order; brokers replicate for durability; consumer groups track how far each app has read—design around partitions, ISR, and offsets, not “a queue with a delete button.”