How Apache Kafka works at scale
Kafka is often introduced as a “message queue.” In production it behaves more like a distributed commit log: producers append immutable records to partitioned topics; brokers replicate them for durability; consumer groups read at their own pace and track offsets.
We work through the design in order—requirements first, numbers second, architecture third, client APIs last—using Apache Kafka as the reference, with notes where managed services (MSK, Confluent Cloud) wrap the same primitives.
What you should be able to do after reading:
- Separate the three loops—produce, replicate, consume—and name what each guarantees.
- List functional and non-functional requirements and map them to
acks, ISR, and consumer groups. - Walk one record from producer batch → partition log → follower ack → consumer fetch → offset commit.
- Explain partition keys, consumer rebalancing, retention vs compaction, and at-least-once vs exactly-once.
- Read the technical section: topic config, producer/consumer settings, and admin operations.
Step 0 — How we will work through the problem
Ordered thinking beats memorizing a broker diagram. Use this sequence when you adopt or design around Kafka:
- Clarify scope. Event bus only, or stream processing (Kafka Streams/Flink)? Ordering per key? Global ordering?
- Write requirements. Functional = publish, subscribe, retain. Non-functional = throughput, durability, latency, retention.
- Do napkin math. MB/s per partition, partition count, retention disk—so you do not run three brokers into a wall.
- Draw three loops before picking topic names.
- Tell one story—one event, one partition, one consumer group—then failure cases (broker down, slow consumer, rebalance).
flowchart LR
subgraph produce [Produce loop]
P[Producer] --> BR[Broker leader]
end
subgraph replicate [Replicate loop]
BR --> F1[Follower]
BR --> F2[Follower]
F1 --> ISR[ISR ack]
F2 --> ISR
end
subgraph consume [Consume loop]
ISR --> CG[Consumer group]
CG --> OFF[Offset commit]
end
Step 1 — Functional requirements (what teams need from Kafka)
| Capability | Requirement | Why it shapes design |
|---|---|---|
| Publish | Many producers append to named topics | Topic = logical channel; partitions = parallelism |
| Subscribe | Many consumers read the same topic independently | Consumer groups + offsets per group |
| Order | Total order within one partition | Partition key routes related events together |
| Retain | Configurable time/size retention | Disk planning; replay for new consumers |
| Replay | Reset offset and re-read history | Backfill, recovery, new downstream service |
| Compact | Keep latest value per key (changelog topics) | Compaction vs delete retention |
| Connect | Integrate DBs and queues (Kafka Connect) | Separate ops surface; still same log model |
| Process | Stateful stream joins (Streams/Flink) | Uses partitions + changelog topics internally |
Functional details worth stating clearly
Kafka is pull-based. Consumers fetch when ready; slow readers do not push back pressure into producers directly—they fall behind (lag).
Messages are immutable. You do not update offset 42; you append offset 43 with a correction event if the business model allows.
Out of scope today (say it aloud). Building Kafka from scratch, or using it as a primary database without compaction discipline—park those debates.
Step 2 — Non-functional requirements (engineering promises)
| Category | Target (typical) | How we meet it | If we miss it |
|---|---|---|---|
| Throughput | MB/s–GB/s per cluster | Enough partitions, batching, compression | Backpressure upstream |
| Latency — produce | ms–tens of ms p99 | acks=1 vs all, linger.ms tuning | Downstream stale data |
| Durability | Survive broker loss without loss (when configured) | Replication factor ≥ 3, min.insync.replicas | Permanent gap in audit trail |
| Availability | Cluster stays writable on leader fail | ISR promotion, controller failover | Publish outage |
| Ordering | Per-partition FIFO | Single consumer per partition in group | Inventory count wrong |
| Retention | Days to weeks (events) or forever (compact keys) | retention.ms, cleanup.policy | Disk full; legal hold issues |
| Isolation | ACLs per topic/principal | SASL/SSL, ACL authorizer | Cross-team data leaks |
Key idea: Durability and latency trade off through acks and replication. Pick explicitly per topic—not one global default for the whole company.
Step 3 — Napkin math (partitions, disk, and lag)
- Suppose 20 KB average record × 50k records/s → ~1 GB/s ingress (before replication multiply by replication factor).
- One partition on one broker is often sized around tens of MB/s practical throughput—rule of thumb partitions ≥ max desired parallel consumers in a group.
- Retention: 1 TB/day per cluster with RF=3 → 3 TB/day disk on the fleet before compaction.
- Consumer lag = high watermark − committed offset; lag of millions means hours of catch-up at current fetch rate.
Step 4 — Architecture: brokers, topics, and the three loops
A cluster has many brokers. A topic is split into partitions, each an append-only log file sequence on disk. One broker is the leader per partition; followers replicate. ZooKeeper (older) or KRaft (modern) stores cluster metadata and elects leaders. Producers talk to partition leaders; consumers discover leaders via metadata.
flowchart TB
subgraph cluster [Kafka cluster]
B1[Broker 1]
B2[Broker 2]
B3[Broker 3]
CTL[Controller KRaft]
end
subgraph topic [Topic orders events]
P0[Partition 0 leader on B1]
P1[Partition 1 leader on B2]
P2[Partition 2 leader on B3]
end
PR[Producers] --> P0
PR --> P1
CG[Consumer group A] --> P0
CG --> P1
CG --> P2
CTL --> B1
CTL --> B2
CTL --> B3
Step 5 — Walk one record end to end
- Producer serializes key
order_id=8812, value JSON; partitioner maps key → partition 2. - Batch accumulates records for
linger.ms; sendsProducerequest to leader of partition 2. - Leader appends to local log segment; assigns offset
184_502_991. - Followers fetch from leader and write to their replicas; when all ISR members ack, leader responds per
acks=all. - Consumer in group
fulfillmentpolls, receives batch from partition 2; processes order. - Commit — after successful side effect, consumer commits offset
184_502_991to internal topic__consumer_offsets. - Failure — if consumer crashes before commit, another member in the group replays from last committed offset (at-least-once).
sequenceDiagram
participant P as Producer
participant L as Leader P2
participant F as Followers
participant C as Consumer
P->>L: Produce batch
L->>F: replicate
F-->>L: caught up
L-->>P: ack offset
C->>L: Fetch
L-->>C: records
C->>L: Commit offset
Step 6 — Topics, partitions, and keys
Topic — namespace (e.g. payments.captured). Partition — unit of parallelism and ordering.
Key — optional; default partitioner hash(key) % numPartitions keeps same key on same partition.
- More partitions → higher throughput ceiling, more files, longer rebalance times.
- Hot partition — skewed keys (one mega-merchant) bottleneck one broker; salt keys or dedicated topic.
- No key — round-robin spread; order not preserved across records.
# Create topic (CLI illustrative) kafka-topics.sh --create \ --topic payments.captured \ --partitions 24 \ --replication-factor 3 \ --config min.insync.replicas=2 \ --config retention.ms=604800000
Step 7 — Producers: batching, compression, and acks
| Setting | Effect |
|---|---|
acks=0 | Fire-and-forget; fastest; may lose data |
acks=1 | Leader wrote; follower loss can lose recent data |
acks=all | Wait for ISR; durable when min.insync.replicas met |
enable.idempotence=true | PID + sequence; dedupe retries per partition |
compression.type=lz4|zstd | Less network/disk; CPU on broker/producer |
linger.ms + batch.size | Higher throughput, higher latency |
Producers retry on transient errors; without idempotence, retries can create duplicates—consumers must dedupe or use idempotent producer + transactional writes.
Step 8 — Broker storage: segments, indexes, and retention
Each partition log is a series of segments (files). Active segment accepts writes; older segments are sealed. Sparse indexes map offset → file position for fast fetch. Time and size retention delete whole segments when policy triggers.
Log compaction (for changelog topics) keeps latest record per key; tombstones with null value delete keys after delete.retention.ms.
Step 9 — Replication, ISR, and leader election
Replication factor (RF) — copies across brokers. ISR (in-sync replicas) — followers within lag bound of leader.
Only ISR members count toward acks=all. If leader dies, controller picks new leader from ISR.
replica.lag.time.max.ms— follower falls out of ISR if too slow.unclean.leader.election.enable=false— avoid promoting out-of-sync replica (prevents data loss visibility).- Rack awareness — spread replicas across AZs (
broker.rack).
Sanity check: RF=3 and min.insync.replicas=2 means you can lose one broker and still accept durable writes; losing two ISR members blocks producers with acks=all.
Step 10 — Consumer groups, fetch, and rebalancing
A consumer group shares work: each partition assigned to at most one consumer in the group at a time. Different groups read the same topic independently (fan-out).
Rebalance — on member join/leave, partitions reassigned (range, round-robin, sticky, cooperative sticky protocols).
Rebalance pauses consumption briefly—keep sessions stable with max.poll.interval.ms and process records quickly.
# Pseudo: consumer subscribes
consumer.subscribe(["payments.captured"])
while running:
records = consumer.poll(timeout=1s)
for r in records:
process(r)
consumer.commit_sync() # or commit_async
Step 11 — Offsets and delivery semantics
| Semantics | How | Risk |
|---|---|---|
| At-most-once | Commit offset before process | Lost messages on crash |
| At-least-once | Process then commit (default) | Duplicates on retry |
| Exactly-once | Transactions + idempotent producer + read-process-write in one transaction | Complexity, throughput cost |
Offsets stored in __consumer_offsets (compact topic). Auto commit is easy but can commit before processing finishes—prefer manual commit after side effects.
Step 12 — When to use compaction, headers, and schemas
- Compaction — config snapshots, account state, Kafka Streams changelog topics.
- Headers — metadata (trace id, content-type) without parsing value.
- Schema Registry — Avro/Protobuf/JSON Schema; consumers evolve with compatibility rules (backward/forward).
Step 13 — Kafka Connect and stream processing (adjacent loops)
Kafka Connect — distributed connectors (source: DB CDC → topic; sink: topic → warehouse). Kafka Streams / Flink — stateful processing; stores state in local RocksDB + changelog compacted topics for recovery. Both assume you already sized topics and monitoring lag.
Step 14 — Operations: scaling, monitoring, and upgrades
- Add brokers — rebalance partition leaders for disk/CPU balance (cruise control or manual).
- Add partitions — increases parallelism; does not split existing keys mid-history.
- Metrics — under-replicated partitions, ISR shrink rate, request latency, disk usage, consumer lag per group.
- Upgrades — rolling broker restarts; check inter-broker protocol and client compatibility matrix.
Step 15 — Technical layer: APIs and configuration
| Surface | What it does |
|---|---|
| Producer API | send(ProducerRecord), callbacks on metadata/ack |
| Consumer API | poll, commitSync, seek for replay |
| Admin API | Create topics, describe configs, alter ISR (dangerous) |
| Wire protocol | Binary RPC over TCP; request types: ApiVersions, Metadata, Produce, Fetch, OffsetCommit |
Producer properties (illustrative):
bootstrap.servers=broker1:9092,broker2:9092 client.id=orders-service acks=all enable.idempotence=true compression.type=zstd linger.ms=20 key.serializer=org.apache.kafka.common.serialization.StringSerializer value.serializer=io.confluent.kafka.serializers.KafkaAvroSerializer schema.registry.url=https://schema.example.com
Consumer properties (illustrative):
group.id=fulfillment-v3 enable.auto.commit=false isolation.level=read_committed # when using transactions max.poll.records=500 max.poll.interval.ms=300000
Step 16 — Reliability patterns and failure modes
Failure modes
- Broker disk full — stop writes; expand disk; drop retention emergency.
- Hot partition — one broker hosts leaders for all hot keys; fix keying or add partitions + rebalance.
- Poison message — consumer throws; DLQ topic with same key for manual fix; do not block whole partition forever.
- Zombie consumer — exceeds
max.poll.interval.ms; kicked from group; duplicate processing if commits were odd.
Design patterns
- Outbox pattern — DB txn writes row + outbox event; connector publishes to Kafka.
- Saga / event choreography — topics per domain event; idempotent consumers.
- Dead-letter queue —
orders.dlqwith original headers + error reason.
Step 17 — Goals → knobs (quick reference)
| Goal | Knob |
|---|---|
| Maximum throughput | More partitions, batching, compression, acks=1 where safe |
| Maximum durability | acks=all, RF=3, min.insync.replicas=2, disable unclean election |
| Low publish latency | Smaller batches, linger.ms=0, fewer replicas acked (trade durability) |
| Ordered processing per entity | Stable partition key; one consumer per partition in group |
| Replayable history | Long retention or compacted changelog; document offset reset runbooks |
Step 18 — Close the loop (what to practice)
On a whiteboard: three loops, one topic with three partitions, two consumer groups reading independently.
Out loud: difference between acks=1 and acks=all; what ISR means; how at-least-once duplicates happen.
With the technical section: trace a keyed produce to partition log offset and a manual offset commit after processing.
The one line to remember
Kafka is a replicated, partitioned append-only log with pull-based consumers. Producers fill partitions in order; brokers replicate for durability; consumer groups track how far each app has read—design around partitions, ISR, and offsets, not “a queue with a delete button.”