Kafka Architecture Patterns

Event sourcing with Kafka

Event sourcing stores state as a sequence of immutable facts—not the latest row in a table. Kafka’s append-only, partitioned log is a natural event store when you accept replay semantics and operational trade-offs.

Kafka as event store

Append-only: new events are produced; history is never updated in place
Ordered per key: partition by aggregate ID (orderId) preserves causal order for that entity
Replayable: reset consumer offset or spin new projection group to rebuild from any point
Durable: RF + acks=all + retention policy defines how far back you can time-travel

flowchart LR
  CMD[Command handler]
  ES[(Event store topic\norders-events)]
  P1[Projection: SQL read model]
  P2[Projection: search index]
  P3[Projection: analytics]

  CMD -->|append event| ES
  ES --> P1
  ES --> P2
  ES --> P3

Projections

A projection is a consumer that folds events into a queryable read model—PostgreSQL table, Elasticsearch index, Redis cache. Each projection maintains its own consumer group and offset. Crash recovery = resume from last committed offset. Full rebuild = new group + read from earliest.

void apply(OrderEvent event) {
  switch (event.getType()) {
    case ORDER_PLACED -> orderViewRepo.insert(event.toOrderView());
    case ORDER_SHIPPED -> orderViewRepo.updateStatus(event.getOrderId(), SHIPPED);
    case ORDER_CANCELLED -> orderViewRepo.markCancelled(event.getOrderId());
  }
}

Snapshotting

Replaying millions of events per aggregate on cold start is slow. Snapshots persist periodic materialized state plus the offset/event sequence they represent. On startup: load snapshot → replay only events after snapshot offset.

Every N events (or time interval), write snapshot to object store or compacted topic
Snapshot record: { aggregateId, version, state, timestamp }
Handler loads snapshot at version V, applies events V+1…latest

Compacted topics as current state

cleanup.policy=compact retains the latest record per key—useful for “current aggregate state” alongside the full event log. Pattern: event topic (delete retention) for audit + compacted topic for fast key lookup of latest state. Compaction is asynchronous—do not treat it as synchronous read-your-writes.

Approach	Kafka event log	Dedicated store (EventStoreDB)
Optimistic concurrency	App-level version in event payload	Built-in expected version
Stream processing	Native Kafka Streams/Flink	Requires bridge
Ops surface	One platform if Kafka already central	Additional cluster to operate
Event semantics	At-least-once default; EOS with transactions	Purpose-built stream semantics

⚖️ Trade-off

Kafka excels as event store when the org already standardizes on it and projections run in Streams/Flink. EventStoreDB wins when you need first-class aggregate streams, metadata, and subscription APIs without building them on Connect/Streams.

CQRS + Kafka

Command Query Responsibility Segregation splits the write path (commands, invariants, events) from the read path (optimized queries). Kafka carries the immutable event stream between them.

flowchart TB
  API[HTTP / gRPC API]
  CH[Command handler\nwrite model + DB]
  ET[(events topic)]
  RM1[Read model: Order list DB]
  RM2[Read model: Customer 360]
  QAPI[Query API]

  API -->|command| CH
  CH -->|transactional write + event| ET
  ET --> RM1
  ET --> RM2
  RM1 --> QAPI
  RM2 --> QAPI

Write side

API receives command (PlaceOrder, CancelOrder)
Command handler validates invariants against write model (often normalized OLTP schema)
On success: persist state + emit domain event to Kafka (ideally via outbox—see below)
Write DB optimized for consistency, not dashboard queries

Read side

Independent consumers subscribe to the same event stream and maintain denormalized read models—wide tables, materialized views, search documents. Query API reads only from read stores, never from the event log directly in user-facing paths.

Multiple read models from one stream

One orders-events topic feeds: operational SQL view, BI warehouse sink, customer notification service, fraud scoring. Each uses a different consumer group—see Fan-Out pattern. Schema evolution via Registry keeps all projections compatible.

Eventual consistency window

After a successful command, read models lag by milliseconds to seconds. UX must acknowledge this:

Read-your-writes: route query to write DB for same session, or poll until read model catches up (version token)
Optimistic UI: show pending state until projection confirms
Idempotent projections: replay-safe so lag spikes do not corrupt read models

💡 Pro Tip

Include eventId and aggregateVersion in every event. Query APIs can expose lastProcessedVersion so clients know when read model is fresh.

🎯 Interview Tip

“CQRS with Kafka?” — Commands hit write service + DB; events go to Kafka; N consumers build read models. Not mandatory to use Kafka on command ingress—only as the event backbone. Consistency is eventual unless you add synchronous read-your-writes hacks.

Outbox pattern

The dual-write problem: updating PostgreSQL and publishing to Kafka in one business operation without a distributed transaction. The outbox makes the database the single source of truth for “what must be published.”

Dual write problem

Naive approach fails in both directions:

DB commit succeeds → Kafka publish fails → downstream never learns of change
Kafka publish succeeds → DB rolls back → ghost events in the bus

sequenceDiagram
  participant S as Order service
  participant DB as PostgreSQL
  participant CDC as Debezium
  participant K as Kafka

  S->>DB: BEGIN
  S->>DB: INSERT orders ...
  S->>DB: INSERT outbox (event payload)
  S->>DB: COMMIT
  CDC->>DB: read WAL
  CDC->>K: publish domain event
  Note over S,K: Single atomic DB transaction — no direct Kafka write

Outbox table design

CREATE TABLE outbox (
  id            UUID PRIMARY KEY,
  aggregate_id  VARCHAR(64) NOT NULL,
  event_type    VARCHAR(128) NOT NULL,
  payload       JSONB NOT NULL,
  idempotency_key VARCHAR(128) NOT NULL UNIQUE,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Same transaction as business write:
INSERT INTO orders (...) VALUES (...);
INSERT INTO outbox (id, aggregate_id, event_type, payload, idempotency_key)
VALUES (gen_random_uuid(), 'ord-42', 'OrderPlaced', '{"orderId":"ord-42",...}', 'ord-42-OrderPlaced-v1');

Debezium CDC → Kafka

Debezium captures INSERT on outbox table from WAL—no polling, no application thread publishing after commit. Use Outbox Event Router SMT to route by event_type to domain topics and unwrap payload.

{
  "transforms": "outbox",
  "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
  "transforms.outbox.route.topic.replacement": "${routedByValue}",
  "transforms.outbox.table.field.event.key": "aggregate_id",
  "transforms.outbox.table.field.event.payload": "payload",
  "transforms.outbox.table.field.event.type": "event_type",
  "transforms.outbox.table.field.event.id": "id",
  "transforms.outbox.route.by.field": "event_type"
}

Idempotency key

Outbox delivery is at-least-once (CDC retry, Kafka retry). Downstream consumers dedupe on idempotency_key or eventId—unique constraint in application store or Redis TTL cache.

Delivery guarantee

Guaranteed: if business transaction commits, event will eventually appear on Kafka (barring catastrophic WAL loss). Not guaranteed: exactly-once end-to-end without idempotent consumers. Alternative relay: polling outbox table in same service (transactional outbox relay)—higher app complexity, no CDC lag.

📦 Real World

Prune published outbox rows via scheduled job or separate compaction process—CDC replays are not an excuse to let the table grow forever. Some teams move to published_at column and archive after 7 days.

Saga orchestration via Kafka

Long-running business transactions span services without 2PC. A saga orchestrator emits commands, listens for replies, and triggers compensating events on failure—often implemented as a Kafka Streams stateful topology.

sequenceDiagram
  participant O as Orchestrator\n(Kafka Streams)
  participant I as Inventory svc
  participant P as Payment svc
  participant S as Shipping svc

  O->>I: ReserveStock command
  I-->>O: StockReserved reply
  O->>P: ChargePayment command
  P-->>O: PaymentFailed reply
  O->>I: ReleaseStock compensate
  O->>S: CancelShipment compensate

Orchestrator as Kafka Streams topology

Stateful processor maintains saga instance per correlationId in a KeyValueStore: current step, completed steps, timeout deadline. On reply topic event, transition state machine; emit next command or compensation.

enum SagaStep { STARTED, STOCK_RESERVED, PAYMENT_DONE, SHIPPED, COMPENSATING, FAILED }

record SagaState(
    String correlationId,
    SagaStep step,
    Instant startedAt,
    Map<String, String> context
) {}

Command and reply topics

Topic	Direction	Content
`inventory-commands`	Orchestrator → Inventory	ReserveStock, ReleaseStock
`inventory-replies`	Inventory → Orchestrator	StockReserved, StockFailed
`payment-commands`	Orchestrator → Payment	ChargePayment, RefundPayment
`payment-replies`	Payment → Orchestrator	PaymentCaptured, PaymentFailed

Compensating transactions

Compensations are new events, not rollbacks—ReleaseStock, RefundPayment. Each must be idempotent (safe if orchestrator retries). Design compensations to be semantically valid even if forward step partially applied.

Correlation ID propagation

Every command, reply, and compensation carries the same correlationId (and often sagaId) in headers and payload. Services log and trace on this ID—OpenTelemetry baggage should mirror Kafka headers.

correlationId=550e8400-e29b-41d4-a716-446655440000
sagaId=order-saga-ord-42
replyTopic=saga-orchestrator-replies
messageType=StockReserved

Dead letter for stuck sagas

Timeout processor scans state store for sagas past deadline → emit to saga-timeouts or saga.DLT. Human workflow or automated remediation replays/compensates. Never silently drop intermediate states—operators need visibility.

⚠️ Pitfall

Choreography (no central orchestrator) reduces coupling but makes compensation graphs hard to reason about. Orchestration centralizes logic but is a single point of failure—run Streams app with standby tasks and monitor rocksDB state size.

Dead letter queue (DLQ) pattern

Poison messages must not block the entire partition forever. Route failures to a DLT with rich metadata, fix the bug, replay—while healthy records keep flowing.

When to use

Deserialization failures (schema mismatch after deploy)
Business rule violations that will never succeed on retry (invalid country code)
Downstream timeout exhausted—record needs human triage
Not for transient DB blips—use retry with backoff first

DLT naming convention

Standard: {original-topic}.DLT — e.g. orders.DLT. Spring Kafka default follows this. Some orgs add .DLQ or environment prefix prod.orders.DLT.

What to store in DLT

Preserve enough context to debug and replay without guessing:

Original record: key, value, headers (or base64 if binary)
Exception: class name, message, stack trace (truncated)
Source metadata: original topic, partition, offset, timestamp
Consumer context: group id, processing attempt count

{
  "originalTopic": "orders",
  "originalPartition": 3,
  "originalOffset": 918273,
  "originalTimestamp": 1717200000456,
  "consumerGroup": "order-processor",
  "errorClass": "com.acme.ValidationException",
  "errorMessage": "Unknown currency: XYZ",
  "stackTrace": "...",
  "payload": { "orderId": "ord-99", "currency": "XYZ" }
}

Reprocessing from DLT

Fix code or data contract
Deploy fixed consumer
Run one-off reprocessor consumer on orders.DLT → republish to orders or process inline with idempotency
Verify metrics: main topic lag stable, DLT count not growing

Spring Kafka DeadLetterPublishingRecoverer

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Order> template) {
  DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(
      template,
      (record, ex) -> new TopicPartition(record.topic() + ".DLT", record.partition())
  );
  recoverer.setHeadersFunction((record, ex) -> {
    RecordHeaders headers = new RecordHeaders();
    headers.add("dlt-original-offset", longToBytes(record.offset()));
    headers.add("dlt-exception-message", ex.getMessage().getBytes(StandardCharsets.UTF_8));
    return headers;
  });

  DefaultErrorHandler handler = new DefaultErrorHandler(recoverer, new FixedBackOff(1000L, 3));
  handler.addNotRetryableExceptions(ValidationException.class);
  return handler;
}

⚙️ Config

DLT topics need their own retention and ACLs. Alert on orders.DLT message rate > 0 sustained—silent DLT growth is production debt.

Fan-out pattern

One event stream, many independent consumers—each consumer group tracks its own offset and processing speed. This is Kafka’s killer feature vs competing consumer destruction on traditional queues.

flowchart TB
  T[(orders topic)]
  G1[Group: order-processor\noperational DB]
  G2[Group: audit-archiver\nS3 immutable log]
  G3[Group: search-indexer\nElasticsearch]
  G4[Group: analytics-etl\nwarehouse]

  T --> G1
  T --> G2
  T --> G3
  T --> G4

Key property: adding a new consumer group does not steal messages from existing groups. Audit can be 24 hours behind while order processing stays real-time—no coupling.

Use cases on same Order events

Consumer group	Purpose	Lag tolerance
`order-processor`	Fulfillment write path	Seconds
`audit-archiver`	Compliance immutable store	Minutes–hours
`search-indexer`	Customer order search UI	Seconds–minutes
`analytics-etl`	BI warehouse facts	Hours

🔬 Under the Hood

Each group commits offsets to __consumer_offsets under unique (group, topic, partition) keys. Fan-out multiplies consumer connections and fetch traffic—size brokers for aggregate fan-out bytes-out, not single group.

Event-driven microservices

Services communicate through domain events on Kafka—not shared databases. Each service owns its data, publishes facts about its bounded context, and subscribes to others’ events for integration.

Topic ownership

One producing service per domain topic — Order Service owns orders.events
Many consumers — Inventory, Billing, Notification read-only subscribe
No foreign writes — consumers never produce to another team’s topic; they emit their own domain events
ACLs enforce: only User:order-service has Write on orders.events

flowchart LR
  OS[Order Service\nowns orders.events]
  IS[Inventory Service\nowns inventory.events]
  NS[Notification Service\nowns notifications.events]

  OS -->|OrderPlaced| IS
  OS -->|OrderPlaced| NS
  IS -->|StockReserved| OS

Schema Registry as contract

Avro subjects per topic (orders.events-value) with FULL_TRANSITIVE compatibility. Producer CI registers schema; consumers codegen from same artifact. Breaking changes blocked at publish time—not discovered in staging consumers.

Backward compatibility at publish time

Serializer calls registry compatibility check before write. Deploy order: register compatible schema → deploy consumers (new reader) → deploy producer (new writer). Event-driven coupling is schema coupling—govern it like an API gateway.

📦 Real World

Platform teams provide a “topic request” workflow: owning team, retention tier, schema subject, allowed consumers, PII classification. Prevents 400 orphan topics and ACL sprawl.

Kafka as database (anti-pattern analysis)

“We’ll skip PostgreSQL and just use Kafka” fails predictably. Kafka is a log, not a query engine— know what it can substitute for and what it cannot.

Capability	Kafka	RDBMS / NoSQL
Durable append log	Core strength	Transaction log internally, not exposed
Key-based compaction (latest per key)	Compacted topics	Primary key upsert
Replay / time-travel	Offset reset, new consumer groups	Binlog CDC only
Random access by key	O(n) scan or KTable materialization	Indexed lookup
Secondary indexes	Not native	B-tree, GIN, etc.
Complex queries (JOIN, aggregate ad hoc)	Streams/Flink jobs, not interactive SQL on broker	SQL / query DSL
ACID across arbitrary keys/topics	Transactions within Kafka boundary	Multi-row ACID

When to combine Kafka + DB

Kafka: event stream, integration bus, log of record for domain events. Database: queryable current state, indexes, reporting, user-facing reads. Sync via projections (CQRS) or CDC (outbox). This is the dominant production architecture—not either/or.

KTable + GlobalKTable for in-stream lookup

Kafka Streams KTable materializes changelog topic into key-value state—O(1) lookup inside a topology. GlobalKTable replicates entire compacted topic to every instance for join enrichment (product catalog, FX rates). Still not a replacement for PostgreSQL—state is partition-scoped or fully replicated with memory cost.

KTable<String, Product> products = builder.table("products-compacted");
KStream<String, Order> orders = builder.stream("orders");

orders.join(products, (order, product) -> order.enrichWith(product));

⚠️ Pitfall

Using compacted topics as “database” without monitoring compaction lag leads to stale reads. Compaction is background work—queryable state belongs in a store built for reads.

Multi-cluster patterns

Single-cluster Kafka hits regional latency, blast radius, and compliance boundaries. Multi-cluster topologies replicate events across clusters—each pattern trades complexity for availability and locality.

Pattern	Topology	Complexity
Active-passive	Primary cluster; DR standby receives MirrorMaker replication	Lower—failover is directional
Active-active	Two+ clusters, bidirectional replication	Highest—conflict resolution required
Hub-and-spoke	Regional clusters → central aggregation hub	Medium—one-way to hub, analytics/global view

Active-active

Producers write locally for latency; MirrorMaker replicates both directions. Same topic name in both regions— consumers must handle duplicate events and ordering conflicts. Strategies: leader region per aggregate, version vectors, last-write-wins (dangerous), CRDT-style merges.

Active-passive DR

Production traffic on cluster A; cluster B receives async replication. Failover: stop MM2 A→B or reverse direction, redirect DNS/bootstrap servers, accept seconds–minutes RPO. Consumers restart on B from translated offsets. Run failover drills quarterly—offset translation breaks if never tested.

Hub-and-spoke

EU, US, APAC clusters produce locally; MM2 replicates selected topics to global hub for analytics, ML feature store, compliance archive. Spokes do not replicate hub traffic back—avoids active-active conflicts.

MirrorMaker 2 (MM2)

Kafka Connect-based multi-cluster replication. Built-in connectors: MirrorSourceConnector, MirrorCheckpointConnector (offset sync), MirrorHeartbeatConnector.

Offset translation: maps source offset to target offset for consumer failover
Topic remapping: source.topic.replication.factor, prefix target.alias.topic if needed
Sync groups: emit.checkpoints.enabled for consumer offset migration

clusters = primary, dr
primary.bootstrap.servers = kafka-primary:9092
dr.bootstrap.servers = kafka-dr:9092
primary->dr.enabled = true
primary->dr.topics = orders.*, inventory.*
replication.policy.class = org.apache.kafka.connect.mirror.IdentityReplicationPolicy
sync.topic.acls.enabled = false
emit.checkpoints.enabled = true

Confluent Replicator vs MirrorMaker 2

Tool	Notes
MirrorMaker 2	Apache / Confluent open-source; Connect-based; standard for self-managed
Confluent Replicator	Commercial; auto topic creation, advanced monitoring, Confluent Cloud cluster linking; enterprise support
Cluster Linking	Confluent-native active-passive / stretch cluster semantics on CP/CC

⚖️ Trade-off

Multi-cluster doubles operational surface: ACL sync, schema registry per cluster vs centralized, clock skew in active-active. Default to active-passive DR until a product requirement truly needs bidirectional writes.

Request-reply over Kafka

Sometimes a service needs a response—not fire-and-forget. Kafka supports RPC-style patterns using reply topics, correlation IDs, and temporary consumer partitions—accept higher latency and timeout complexity vs HTTP/gRPC.

sequenceDiagram
  participant C as Client service
  participant RT as requests topic
  participant W as Worker service
  participant R as reply topic

  C->>RT: request + correlationId + replyTopic header
  W->>RT: consume request
  W->>R: reply + same correlationId
  R->>C: correlate and complete Future

Reply topic strategies

Strategy	Mechanism	Trade-off
Per-instance reply topic	`replies.order-service.instance-7`	No cross-talk; many topics to manage
Shared reply topic	Single `replies` + correlationId filter	Fewer topics; client filters by ID

Standard headers: KafkaHeaders.REPLY_TOPIC, KafkaHeaders.CORRELATION_ID, optional REPLY_PARTITION for stickiness.

Correlation ID

Client generates UUID per request; stores in ConcurrentHashMap<CorrelationKey, CompletableFuture>. Reply handler matches incoming correlation ID, completes future, removes entry. IDs must be unique per in-flight request.

Timeout handling

Wrap CompletableFuture with timeout (e.g. 30s). On timeout: complete exceptionally, optionally publish to DLT, do not leak map entries (always remove in finally). Worker should still process orphan requests idempotently.

@Bean
public ReplyingKafkaTemplate<String, InventoryRequest, InventoryResponse> replyingTemplate(
    ProducerFactory<String, InventoryRequest> pf,
    ConcurrentMessageListenerContainer<String, InventoryResponse> repliesContainer) {
  return new ReplyingKafkaTemplate<>(pf, repliesContainer);
}

// Send and wait
RequestReplyFuture<String, InventoryRequest, InventoryResponse> future =
    replyingTemplate.sendAndReceive(
        ProducerRecord<>("inventory-requests", request),
        Duration.ofSeconds(30));

InventoryResponse response = future.get(30, TimeUnit.SECONDS);

@KafkaListener(topics = "inventory-requests")
@SendTo  // uses reply topic from request header
public InventoryResponse handle(
    @Payload InventoryRequest request,
    @Header(KafkaHeaders.CORRELATION_ID) byte[] correlationId) {
  return inventoryService.check(request);
}

💡 Pro Tip

Prefer async events for decoupling. Use request-reply only for genuine query needs (inventory check, fraud score) where HTTP/gRPC is unavailable or you standardize all traffic on Kafka. Latency floor is two broker round-trips plus consumer poll interval.

🎯 Interview Tip

“RPC over Kafka?” — Request topic + REPLY_TOPIC header + correlation ID; client blocks on future with timeout. Not true sync—poll interval and rebalance can add jitter. gRPC for sync, Kafka for facts.