sharpbyte.dev

How Apache Kafka works at scale

Kafka is often introduced as a “message queue.” In production it behaves more like a distributed commit log: producers append immutable records to partitioned topics; brokers replicate them for durability; consumer groups read at their own pace and track offsets.

We work through the design in order—requirements first, numbers second, architecture third, client APIs last—using Apache Kafka as the reference, with notes where managed services (MSK, Confluent Cloud) wrap the same primitives.

What you should be able to do after reading:

Step 0 — How we will work through the problem

Ordered thinking beats memorizing a broker diagram. Use this sequence when you adopt or design around Kafka:

  1. Clarify scope. Event bus only, or stream processing (Kafka Streams/Flink)? Ordering per key? Global ordering?
  2. Write requirements. Functional = publish, subscribe, retain. Non-functional = throughput, durability, latency, retention.
  3. Do napkin math. MB/s per partition, partition count, retention disk—so you do not run three brokers into a wall.
  4. Draw three loops before picking topic names.
  5. Tell one story—one event, one partition, one consumer group—then failure cases (broker down, slow consumer, rebalance).
flowchart LR
  subgraph produce [Produce loop]
    P[Producer] --> BR[Broker leader]
  end
  subgraph replicate [Replicate loop]
    BR --> F1[Follower]
    BR --> F2[Follower]
    F1 --> ISR[ISR ack]
    F2 --> ISR
  end
  subgraph consume [Consume loop]
    ISR --> CG[Consumer group]
    CG --> OFF[Offset commit]
  end
    

Step 1 — Functional requirements (what teams need from Kafka)

CapabilityRequirementWhy it shapes design
PublishMany producers append to named topicsTopic = logical channel; partitions = parallelism
SubscribeMany consumers read the same topic independentlyConsumer groups + offsets per group
OrderTotal order within one partitionPartition key routes related events together
RetainConfigurable time/size retentionDisk planning; replay for new consumers
ReplayReset offset and re-read historyBackfill, recovery, new downstream service
CompactKeep latest value per key (changelog topics)Compaction vs delete retention
ConnectIntegrate DBs and queues (Kafka Connect)Separate ops surface; still same log model
ProcessStateful stream joins (Streams/Flink)Uses partitions + changelog topics internally

Functional details worth stating clearly

Kafka is pull-based. Consumers fetch when ready; slow readers do not push back pressure into producers directly—they fall behind (lag).

Messages are immutable. You do not update offset 42; you append offset 43 with a correction event if the business model allows.

Out of scope today (say it aloud). Building Kafka from scratch, or using it as a primary database without compaction discipline—park those debates.

Step 2 — Non-functional requirements (engineering promises)

CategoryTarget (typical)How we meet itIf we miss it
ThroughputMB/s–GB/s per clusterEnough partitions, batching, compressionBackpressure upstream
Latency — producems–tens of ms p99acks=1 vs all, linger.ms tuningDownstream stale data
DurabilitySurvive broker loss without loss (when configured)Replication factor ≥ 3, min.insync.replicasPermanent gap in audit trail
AvailabilityCluster stays writable on leader failISR promotion, controller failoverPublish outage
OrderingPer-partition FIFOSingle consumer per partition in groupInventory count wrong
RetentionDays to weeks (events) or forever (compact keys)retention.ms, cleanup.policyDisk full; legal hold issues
IsolationACLs per topic/principalSASL/SSL, ACL authorizerCross-team data leaks

Key idea: Durability and latency trade off through acks and replication. Pick explicitly per topic—not one global default for the whole company.

Step 3 — Napkin math (partitions, disk, and lag)

Step 4 — Architecture: brokers, topics, and the three loops

A cluster has many brokers. A topic is split into partitions, each an append-only log file sequence on disk. One broker is the leader per partition; followers replicate. ZooKeeper (older) or KRaft (modern) stores cluster metadata and elects leaders. Producers talk to partition leaders; consumers discover leaders via metadata.

flowchart TB
  subgraph cluster [Kafka cluster]
    B1[Broker 1]
    B2[Broker 2]
    B3[Broker 3]
    CTL[Controller KRaft]
  end
  subgraph topic [Topic orders events]
    P0[Partition 0 leader on B1]
    P1[Partition 1 leader on B2]
    P2[Partition 2 leader on B3]
  end
  PR[Producers] --> P0
  PR --> P1
  CG[Consumer group A] --> P0
  CG --> P1
  CG --> P2
  CTL --> B1
  CTL --> B2
  CTL --> B3
    

Step 5 — Walk one record end to end

  1. Producer serializes key order_id=8812, value JSON; partitioner maps key → partition 2.
  2. Batch accumulates records for linger.ms; sends Produce request to leader of partition 2.
  3. Leader appends to local log segment; assigns offset 184_502_991.
  4. Followers fetch from leader and write to their replicas; when all ISR members ack, leader responds per acks=all.
  5. Consumer in group fulfillment polls, receives batch from partition 2; processes order.
  6. Commit — after successful side effect, consumer commits offset 184_502_991 to internal topic __consumer_offsets.
  7. Failure — if consumer crashes before commit, another member in the group replays from last committed offset (at-least-once).
sequenceDiagram
  participant P as Producer
  participant L as Leader P2
  participant F as Followers
  participant C as Consumer
  P->>L: Produce batch
  L->>F: replicate
  F-->>L: caught up
  L-->>P: ack offset
  C->>L: Fetch
  L-->>C: records
  C->>L: Commit offset
    

Step 6 — Topics, partitions, and keys

Topic — namespace (e.g. payments.captured). Partition — unit of parallelism and ordering. Key — optional; default partitioner hash(key) % numPartitions keeps same key on same partition.

# Create topic (CLI illustrative)
kafka-topics.sh --create \
  --topic payments.captured \
  --partitions 24 \
  --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config retention.ms=604800000

Step 7 — Producers: batching, compression, and acks

SettingEffect
acks=0Fire-and-forget; fastest; may lose data
acks=1Leader wrote; follower loss can lose recent data
acks=allWait for ISR; durable when min.insync.replicas met
enable.idempotence=truePID + sequence; dedupe retries per partition
compression.type=lz4|zstdLess network/disk; CPU on broker/producer
linger.ms + batch.sizeHigher throughput, higher latency

Producers retry on transient errors; without idempotence, retries can create duplicates—consumers must dedupe or use idempotent producer + transactional writes.

Step 8 — Broker storage: segments, indexes, and retention

Each partition log is a series of segments (files). Active segment accepts writes; older segments are sealed. Sparse indexes map offset → file position for fast fetch. Time and size retention delete whole segments when policy triggers.

Log compaction (for changelog topics) keeps latest record per key; tombstones with null value delete keys after delete.retention.ms.

Step 9 — Replication, ISR, and leader election

Replication factor (RF) — copies across brokers. ISR (in-sync replicas) — followers within lag bound of leader. Only ISR members count toward acks=all. If leader dies, controller picks new leader from ISR.

Sanity check: RF=3 and min.insync.replicas=2 means you can lose one broker and still accept durable writes; losing two ISR members blocks producers with acks=all.

Step 10 — Consumer groups, fetch, and rebalancing

A consumer group shares work: each partition assigned to at most one consumer in the group at a time. Different groups read the same topic independently (fan-out).

Rebalance — on member join/leave, partitions reassigned (range, round-robin, sticky, cooperative sticky protocols). Rebalance pauses consumption briefly—keep sessions stable with max.poll.interval.ms and process records quickly.

# Pseudo: consumer subscribes
consumer.subscribe(["payments.captured"])
while running:
    records = consumer.poll(timeout=1s)
    for r in records:
        process(r)
    consumer.commit_sync()  # or commit_async

Step 11 — Offsets and delivery semantics

SemanticsHowRisk
At-most-onceCommit offset before processLost messages on crash
At-least-onceProcess then commit (default)Duplicates on retry
Exactly-onceTransactions + idempotent producer + read-process-write in one transactionComplexity, throughput cost

Offsets stored in __consumer_offsets (compact topic). Auto commit is easy but can commit before processing finishes—prefer manual commit after side effects.

Step 12 — When to use compaction, headers, and schemas

Step 13 — Kafka Connect and stream processing (adjacent loops)

Kafka Connect — distributed connectors (source: DB CDC → topic; sink: topic → warehouse). Kafka Streams / Flink — stateful processing; stores state in local RocksDB + changelog compacted topics for recovery. Both assume you already sized topics and monitoring lag.

Step 14 — Operations: scaling, monitoring, and upgrades

Step 15 — Technical layer: APIs and configuration

SurfaceWhat it does
Producer APIsend(ProducerRecord), callbacks on metadata/ack
Consumer APIpoll, commitSync, seek for replay
Admin APICreate topics, describe configs, alter ISR (dangerous)
Wire protocolBinary RPC over TCP; request types: ApiVersions, Metadata, Produce, Fetch, OffsetCommit

Producer properties (illustrative):

bootstrap.servers=broker1:9092,broker2:9092
client.id=orders-service
acks=all
enable.idempotence=true
compression.type=zstd
linger.ms=20
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=io.confluent.kafka.serializers.KafkaAvroSerializer
schema.registry.url=https://schema.example.com

Consumer properties (illustrative):

group.id=fulfillment-v3
enable.auto.commit=false
isolation.level=read_committed   # when using transactions
max.poll.records=500
max.poll.interval.ms=300000

Step 16 — Reliability patterns and failure modes

Failure modes

Design patterns

Step 17 — Goals → knobs (quick reference)

GoalKnob
Maximum throughputMore partitions, batching, compression, acks=1 where safe
Maximum durabilityacks=all, RF=3, min.insync.replicas=2, disable unclean election
Low publish latencySmaller batches, linger.ms=0, fewer replicas acked (trade durability)
Ordered processing per entityStable partition key; one consumer per partition in group
Replayable historyLong retention or compacted changelog; document offset reset runbooks

Step 18 — Close the loop (what to practice)

On a whiteboard: three loops, one topic with three partitions, two consumer groups reading independently.

Out loud: difference between acks=1 and acks=all; what ISR means; how at-least-once duplicates happen.

With the technical section: trace a keyed produce to partition log offset and a manual offset commit after processing.

The one line to remember

Kafka is a replicated, partitioned append-only log with pull-based consumers. Producers fill partitions in order; brokers replicate for durability; consumer groups track how far each app has read—design around partitions, ISR, and offsets, not “a queue with a delete button.”