Kafka Core

Learn Apache Kafka from commit log to production clusters

Internals-first guides for developers, data engineers, and platform teams—what happens on disk, on the wire, and in the JVM when you ship event-driven systems at scale.

What is Kafka? The LinkedIn origin story

In 2010, LinkedIn’s data pipeline was a tangle of point-to-point integrations—every new consumer meant wiring another bespoke feed from every producer. Jay Kreps, Neha Narkhede, and Jun Rao built Kafka to solve one problem: decouple producers from consumers with a durable, replayable log that many applications could read independently. They open-sourced it in 2011; it entered the Apache Incubator in 2012.

The name comes from Franz Kafka—the author of surreal bureaucracies—because Kreps wanted a system that captured the absurdity of enterprise data plumbing. The irony stuck; the architecture outlasted the joke. Today Kafka is the default event backbone for LinkedIn, Uber, Netflix, and thousands of banks and retailers.

The core insight: Kafka is a distributed commit log, not a message queue

Traditional message brokers delete messages after delivery. Kafka retains them in an append-only log. Think of a newspaper printing press: editions roll off the press in order; subscribers pick up today’s paper (or yesterday’s, if they want a replay). The press doesn’t track who read page 3—it just keeps printing.

🔬 Under the Hood

Each partition is a directory of segment files (.log, .index, .timeindex) on broker disk. An offset is simply the byte position in that log—not a message ID assigned by the broker.

🎯 Interview Tip

When asked “Kafka vs RabbitMQ,” lead with retention and replay, not throughput alone. Queues optimize for task distribution; logs optimize for event history and fan-out.

Why Kafka beats traditional message brokers at scale

Kafka’s design trades broker-side complexity for sequential I/O, zero-copy networking, and consumer-driven flow control. These aren’t marketing bullets—they explain why a single broker can sustain hundreds of MB/s per disk.

Dimension Traditional queue (RabbitMQ, SQS) Kafka commit log
Message lifecycle Deleted after ack Retained by time/size/policy
Consumer model Broker pushes / competes for messages Consumer pulls via offset
Replay Not supported (dead letter only) Reset offsets, re-read history
Fan-out Exchange + multiple queues Multiple consumer groups, one topic
Ordering Per queue Per partition (key-based routing)
⚖️ Trade-off

Kafka optimizes for throughput and durability of history, not sub-millisecond task queues or complex routing DSLs. Use a queue for work distribution; use Kafka for event streams you may need to re-read.

Kafka use cases map

One log, many consumers—each use case exploits retention, ordering per key, or replay differently.

  1. Stream 1

    Event streaming

    Real-time pipelines: clicks, orders, sensor readings flowing between microservices.

  2. Stream 2

    Change data capture

    Debezium reads DB transaction logs → Kafka topics mirror row-level changes.

  3. Stream 3

    Log aggregation

    App and server logs centralized; Elasticsearch or S3 sinks consume the same topic.

  4. Stream 4

    Activity tracking

    LinkedIn’s original use case—profile views, feed impressions, ad events at billions/day.

  5. Stream 5

    Metrics pipeline

    High-cardinality telemetry buffered and routed to time-series stores.

  6. Stream 6

    Event sourcing

    Domain events as source of truth; projections rebuild read models from the log.

  7. Stream 7

    Stream processing

    Kafka Streams / Flink / ksqlDB derive aggregates, joins, and alerts in flight.

📦 Real World

Uber routes trillions of messages/day through Kafka for dispatch, pricing, and fraud. Netflix uses Kafka for studio workflow events and real-time recommendations. Confluent (founded by Kafka’s creators) runs managed Kafka for thousands of enterprises.

Kafka ecosystem overview

Apache Kafka is the core broker and client APIs. Confluent and the community built the surrounding platform— know what ships in open source vs what requires Confluent Platform or Cloud.

Core platform

  • Apache Kafka — brokers, producers, consumers, Connect API, Streams API
  • KRaft mode — built-in Raft metadata quorum (default Kafka 4.0, ZooKeeper removed)
  • Java clients — official producer/consumer/admin APIs

Stream processing

  • Kafka Streams — JVM library, embedded in your app, RocksDB state stores
  • ksqlDB — SQL interface over streams (Confluent)
  • Apache Flink — external cluster processing with Kafka connectors

Integration & schema

  • Kafka Connect — framework for source/sink connectors (JDBC, S3, Debezium)
  • Schema Registry — Avro/Protobuf/JSON Schema with compatibility rules (Confluent)
  • Debezium — CDC connectors for PostgreSQL, MySQL, MongoDB, Oracle

Multi-cluster & ops

  • MirrorMaker 2 — cross-cluster replication (Kafka Connect based)
  • Cruise Control — automated partition rebalancing (LinkedIn)
  • Burrow — consumer lag monitoring with SLA evaluation

Confluent vs Apache

  • Apache Kafka — broker, clients, Connect, Streams (Apache 2.0)
  • Confluent Platform — adds Schema Registry, ksqlDB, Replicator, Control Center
  • Confluent Cloud — fully managed Kafka (2019), serverless scaling

Spring integration

  • Spring Kafka@KafkaListener, KafkaTemplate, error handlers
  • Spring Cloud Stream — binder abstraction over Kafka (and others)
  • spring-kafka-test — EmbeddedKafka, Testcontainers patterns
💡 Pro Tip

You can run production Kafka with only Apache artifacts. Add Confluent Schema Registry when multiple teams share topics and you need enforced schema evolution—not on day one of a greenfield prototype.

Kafka timeline Kafka 4.0 (KRaft)

From LinkedIn internal project to the default event backbone of the cloud-native era—and the multi-year migration away from ZooKeeper.

  1. 2011

    LinkedIn open-sources Kafka

    First public release. Solves the “every consumer needs a custom pipeline” problem with a shared commit log.

  2. 2012

    Apache Incubator

    Kafka joins Apache; ZooKeeper manages cluster metadata, controller election, and ACLs.

  3. 2014

    Kafka Streams

    Stream processing as a library—no separate cluster, changelog-backed state stores.

  4. 2017

    Exactly-once semantics

    Idempotent producer + transactions + read_committed consumers ship in Kafka 0.11.

  5. 2019

    Confluent Cloud GA

    Fully managed Kafka; KIP-500 KRaft mode development accelerates—Raft replaces ZooKeeper for metadata.

  6. 2021

    KRaft preview

    Kafka 2.8+ supports KRaft in preview. __cluster_metadata topic stores broker/topic state.

  7. 2023

    KRaft GA

    Production-ready KRaft; faster failover, simpler ops, no external ZK ensemble to babysit.

  8. 2024

    Kafka 4.0 — ZooKeeper removed

    ZK support dropped. New clusters run KRaft only. Migration tooling for existing estates.

Quick architecture: Producers → Topics → Consumers

The mental model every chapter builds on. Records land in partition logs on brokers; consumer groups track their own offsets independently.

Producers

Serialize → partition (key hash) → batch → send to leader broker

Topic / Partitions

Append-only logs on brokers; RF=3 replicas; ISR tracks in-sync followers

Consumer groups

poll() → fetch → process → commit offset to __consumer_offsets

flowchart LR
  P1[Producer A]
  P2[Producer B]
  B1[Broker 1\nLeader P0]
  B2[Broker 2\nLeader P1]
  B3[Broker 3\nLeader P2]
  CG1[Consumer Group\nAnalytics]
  CG2[Consumer Group\nAudit]
  P1 -->|orders-0| B1
  P2 -->|orders-1| B2
  P2 -->|orders-2| B3
  B1 --> CG1
  B2 --> CG1
  B3 --> CG1
  B1 --> CG2
  B2 --> CG2
  B3 --> CG2
⚠️ Pitfall

More consumers than partitions = idle consumers. Partitions are the unit of parallelism and ordering—plan partition count up front; you cannot reduce it later.

Explore the guide — all sections

Twelve deep-dive chapters plus cheat sheets. Recommended path: ArchitectureProducersConsumersReliability, then Streams, Connect, and operations as your role requires.

Learning path: Architecture · Producers · Consumers · Reliability · Patterns

Developer

developer

Producers, consumers, consumer groups, offset management, serialization, Spring Kafka, error handling, and testing.

Senior / Data Engineer

senior

Kafka Streams, windowing, schema evolution, exactly-once, CDC pipelines, and multi-cluster patterns.

Platform / Infra

platform

Broker tuning, topic admin, monitoring, capacity planning, security, DR, and KRaft operations.