Kafka Core

Learn Apache Kafka from commit log to production clusters

Internals-first guides for developers, data engineers, and platform teams—what happens on disk, on the wire, and in the JVM when you ship event-driven systems at scale.

Browse all sections ↓ The LinkedIn origin story Start with architecture

What is Kafka? The LinkedIn origin story

In 2010, LinkedIn’s data pipeline was a tangle of point-to-point integrations—every new consumer meant wiring another bespoke feed from every producer. Jay Kreps, Neha Narkhede, and Jun Rao built Kafka to solve one problem: decouple producers from consumers with a durable, replayable log that many applications could read independently. They open-sourced it in 2011; it entered the Apache Incubator in 2012.

The name comes from Franz Kafka—the author of surreal bureaucracies—because Kreps wanted a system that captured the absurdity of enterprise data plumbing. The irony stuck; the architecture outlasted the joke. Today Kafka is the default event backbone for LinkedIn, Uber, Netflix, and thousands of banks and retailers.

The core insight: Kafka is a distributed commit log, not a message queue

Traditional message brokers delete messages after delivery. Kafka retains them in an append-only log. Think of a newspaper printing press: editions roll off the press in order; subscribers pick up today’s paper (or yesterday’s, if they want a replay). The press doesn’t track who read page 3—it just keeps printing.

Append-only — producers write to the end of the log; no random updates or deletes (except compaction policies).
Ordered — ordering is guaranteed within a partition, not globally across a topic.
Immutable history — consumers choose where to read (offset); the broker doesn’t “pop” messages off a queue.
Replayable — reset offsets and reprocess the same events for new services, bug fixes, or analytics backfills.

🔬 Under the Hood

Each partition is a directory of segment files (.log, .index, .timeindex) on broker disk. An offset is simply the byte position in that log—not a message ID assigned by the broker.

🎯 Interview Tip

When asked “Kafka vs RabbitMQ,” lead with retention and replay, not throughput alone. Queues optimize for task distribution; logs optimize for event history and fan-out.

Why Kafka beats traditional message brokers at scale

Kafka’s design trades broker-side complexity for sequential I/O, zero-copy networking, and consumer-driven flow control. These aren’t marketing bullets—they explain why a single broker can sustain hundreds of MB/s per disk.

Sequential disk I/O

Append-only writes and reads hit sequential bandwidth on HDD and SSD. Random-access queues thrash disks; Kafka batches into large sequential segments.
Zero-copy transfer

sendfile() moves data from page cache to the NIC without copying through userspace—critical for multi-GB/s egress on commodity hardware.
Consumers pull, brokers don’t push

Consumers fetch at their own pace with poll(). Slow consumers don’t back-pressure the broker into dropping or spilling messages—they just lag.
Retention vs delete-on-consume

Messages live for days or forever (compacted topics). New consumer groups start from earliest or latest without rewiring producers.
Partitions as highway lanes

Parallelism scales with partition count. Each lane (partition) preserves order; more lanes = more throughput and more consumer instances in a group.

Dimension	Traditional queue (RabbitMQ, SQS)	Kafka commit log
Message lifecycle	Deleted after ack	Retained by time/size/policy
Consumer model	Broker pushes / competes for messages	Consumer pulls via offset
Replay	Not supported (dead letter only)	Reset offsets, re-read history
Fan-out	Exchange + multiple queues	Multiple consumer groups, one topic
Ordering	Per queue	Per partition (key-based routing)

⚖️ Trade-off

Kafka optimizes for throughput and durability of history, not sub-millisecond task queues or complex routing DSLs. Use a queue for work distribution; use Kafka for event streams you may need to re-read.

Kafka use cases map

One log, many consumers—each use case exploits retention, ordering per key, or replay differently.

Stream 1

Event streaming

Real-time pipelines: clicks, orders, sensor readings flowing between microservices.
Stream 2

Change data capture

Debezium reads DB transaction logs → Kafka topics mirror row-level changes.
Stream 3

Log aggregation

App and server logs centralized; Elasticsearch or S3 sinks consume the same topic.
Stream 4

Activity tracking

LinkedIn’s original use case—profile views, feed impressions, ad events at billions/day.
Stream 5

Metrics pipeline

High-cardinality telemetry buffered and routed to time-series stores.
Stream 6

Event sourcing

Domain events as source of truth; projections rebuild read models from the log.
Stream 7

Stream processing

Kafka Streams / Flink / ksqlDB derive aggregates, joins, and alerts in flight.

📦 Real World

Uber routes trillions of messages/day through Kafka for dispatch, pricing, and fraud. Netflix uses Kafka for studio workflow events and real-time recommendations. Confluent (founded by Kafka’s creators) runs managed Kafka for thousands of enterprises.

Kafka ecosystem overview

Apache Kafka is the core broker and client APIs. Confluent and the community built the surrounding platform— know what ships in open source vs what requires Confluent Platform or Cloud.

Core platform

Apache Kafka — brokers, producers, consumers, Connect API, Streams API
KRaft mode — built-in Raft metadata quorum (default Kafka 4.0, ZooKeeper removed)
Java clients — official producer/consumer/admin APIs

Stream processing

Kafka Streams — JVM library, embedded in your app, RocksDB state stores
ksqlDB — SQL interface over streams (Confluent)
Apache Flink — external cluster processing with Kafka connectors

Integration & schema

Kafka Connect — framework for source/sink connectors (JDBC, S3, Debezium)
Schema Registry — Avro/Protobuf/JSON Schema with compatibility rules (Confluent)
Debezium — CDC connectors for PostgreSQL, MySQL, MongoDB, Oracle

Multi-cluster & ops

MirrorMaker 2 — cross-cluster replication (Kafka Connect based)
Cruise Control — automated partition rebalancing (LinkedIn)
Burrow — consumer lag monitoring with SLA evaluation

Confluent vs Apache

Apache Kafka — broker, clients, Connect, Streams (Apache 2.0)
Confluent Platform — adds Schema Registry, ksqlDB, Replicator, Control Center
Confluent Cloud — fully managed Kafka (2019), serverless scaling

Spring integration

Spring Kafka — @KafkaListener, KafkaTemplate, error handlers
Spring Cloud Stream — binder abstraction over Kafka (and others)
spring-kafka-test — EmbeddedKafka, Testcontainers patterns

💡 Pro Tip

You can run production Kafka with only Apache artifacts. Add Confluent Schema Registry when multiple teams share topics and you need enforced schema evolution—not on day one of a greenfield prototype.

Kafka timeline Kafka 4.0 (KRaft)

From LinkedIn internal project to the default event backbone of the cloud-native era—and the multi-year migration away from ZooKeeper.

2011
LinkedIn open-sources Kafka

First public release. Solves the “every consumer needs a custom pipeline” problem with a shared commit log.
2012
Apache Incubator

Kafka joins Apache; ZooKeeper manages cluster metadata, controller election, and ACLs.
2014
Kafka Streams

Stream processing as a library—no separate cluster, changelog-backed state stores.
2017
Exactly-once semantics

Idempotent producer + transactions + read_committed consumers ship in Kafka 0.11.
2019
Confluent Cloud GA

Fully managed Kafka; KIP-500 KRaft mode development accelerates—Raft replaces ZooKeeper for metadata.
2021
KRaft preview

Kafka 2.8+ supports KRaft in preview. __cluster_metadata topic stores broker/topic state.
2023
KRaft GA

Production-ready KRaft; faster failover, simpler ops, no external ZK ensemble to babysit.
2024
Kafka 4.0 — ZooKeeper removed

ZK support dropped. New clusters run KRaft only. Migration tooling for existing estates.

Quick architecture: Producers → Topics → Consumers

The mental model every chapter builds on. Records land in partition logs on brokers; consumer groups track their own offsets independently.

Producers

Serialize → partition (key hash) → batch → send to leader broker

Topic / Partitions

Append-only logs on brokers; RF=3 replicas; ISR tracks in-sync followers

Consumer groups

poll() → fetch → process → commit offset to __consumer_offsets

flowchart LR
  P1[Producer A]
  P2[Producer B]
  B1[Broker 1\nLeader P0]
  B2[Broker 2\nLeader P1]
  B3[Broker 3\nLeader P2]
  CG1[Consumer Group\nAnalytics]
  CG2[Consumer Group\nAudit]
  P1 -->|orders-0| B1
  P2 -->|orders-1| B2
  P2 -->|orders-2| B3
  B1 --> CG1
  B2 --> CG1
  B3 --> CG1
  B1 --> CG2
  B2 --> CG2
  B3 --> CG2

⚠️ Pitfall

More consumers than partitions = idle consumers. Partitions are the unit of parallelism and ordering—plan partition count up front; you cannot reduce it later.

Explore the guide — all sections

Twelve deep-dive chapters plus cheat sheets. Recommended path: Architecture → Producers → Consumers → Reliability, then Streams, Connect, and operations as your role requires.

Learning path: Architecture · Producers · Consumers · Reliability · Patterns

Developer

developer

Producers, consumers, consumer groups, offset management, serialization, Spring Kafka, error handling, and testing.

Senior / Data Engineer

senior

Kafka Streams, windowing, schema evolution, exactly-once, CDC pipelines, and multi-cluster patterns.

Platform / Infra

platform

Broker tuning, topic admin, monitoring, capacity planning, security, DR, and KRaft operations.

Learn Apache Kafka from commit log to production clusters

What is Kafka? The LinkedIn origin story

Sequential disk I/O

Zero-copy transfer

Consumers pull, brokers don’t push

Retention vs delete-on-consume

Partitions as highway lanes

Event streaming

Change data capture

Log aggregation

Activity tracking

Metrics pipeline

Event sourcing

Stream processing

Core platform

Stream processing

Integration & schema

Multi-cluster & ops

Confluent vs Apache

Spring integration

LinkedIn open-sources Kafka

Apache Incubator

Kafka Streams

Exactly-once semantics

Confluent Cloud GA

KRaft preview

KRaft GA

Kafka 4.0 — ZooKeeper removed

Developer

Senior / Data Engineer

Platform / Infra

Core Architecture & Internals

Producer Internals

Consumers & Consumer Groups

Delivery Guarantees

Kafka Streams

Kafka Connect & CDC

Schema Registry

Operations & Administration

Architecture Patterns

Spring Kafka Integration

Performance Tuning

Cheat Sheets