Kafka Architecture Patterns
Kafka is not just a message bus—it is the spine for event sourcing, CQRS read models, transactional outbox, saga orchestration, and multi-region replication. These patterns separate production systems from “we put JSON on a topic and hoped.”
Event sourcing with Kafka
Event sourcing stores state as a sequence of immutable facts—not the latest row in a table. Kafka’s append-only, partitioned log is a natural event store when you accept replay semantics and operational trade-offs.
Kafka as event store
- Append-only: new events are produced; history is never updated in place
- Ordered per key: partition by aggregate ID (orderId) preserves causal order for that entity
- Replayable: reset consumer offset or spin new projection group to rebuild from any point
- Durable: RF + acks=all + retention policy defines how far back you can time-travel
flowchart LR CMD[Command handler] ES[(Event store topic\norders-events)] P1[Projection: SQL read model] P2[Projection: search index] P3[Projection: analytics] CMD -->|append event| ES ES --> P1 ES --> P2 ES --> P3
Projections
A projection is a consumer that folds events into a queryable read model—PostgreSQL table, Elasticsearch index, Redis cache. Each projection maintains its own consumer group and offset. Crash recovery = resume from last committed offset. Full rebuild = new group + read from earliest.
void apply(OrderEvent event) {
switch (event.getType()) {
case ORDER_PLACED -> orderViewRepo.insert(event.toOrderView());
case ORDER_SHIPPED -> orderViewRepo.updateStatus(event.getOrderId(), SHIPPED);
case ORDER_CANCELLED -> orderViewRepo.markCancelled(event.getOrderId());
}
}
Snapshotting
Replaying millions of events per aggregate on cold start is slow. Snapshots persist periodic materialized state plus the offset/event sequence they represent. On startup: load snapshot → replay only events after snapshot offset.
- Every N events (or time interval), write snapshot to object store or compacted topic
- Snapshot record: { aggregateId, version, state, timestamp }
- Handler loads snapshot at version V, applies events V+1…latest
Compacted topics as current state
cleanup.policy=compact retains the latest record per key—useful for “current aggregate state” alongside the full event log. Pattern: event topic (delete retention) for audit + compacted topic for fast key lookup of latest state. Compaction is asynchronous—do not treat it as synchronous read-your-writes.
| Approach | Kafka event log | Dedicated store (EventStoreDB) |
|---|---|---|
| Optimistic concurrency | App-level version in event payload | Built-in expected version |
| Stream processing | Native Kafka Streams/Flink | Requires bridge |
| Ops surface | One platform if Kafka already central | Additional cluster to operate |
| Event semantics | At-least-once default; EOS with transactions | Purpose-built stream semantics |
Kafka excels as event store when the org already standardizes on it and projections run in Streams/Flink. EventStoreDB wins when you need first-class aggregate streams, metadata, and subscription APIs without building them on Connect/Streams.
CQRS + Kafka
Command Query Responsibility Segregation splits the write path (commands, invariants, events) from the read path (optimized queries). Kafka carries the immutable event stream between them.
flowchart TB API[HTTP / gRPC API] CH[Command handler\nwrite model + DB] ET[(events topic)] RM1[Read model: Order list DB] RM2[Read model: Customer 360] QAPI[Query API] API -->|command| CH CH -->|transactional write + event| ET ET --> RM1 ET --> RM2 RM1 --> QAPI RM2 --> QAPI
Write side
- API receives command (PlaceOrder, CancelOrder)
- Command handler validates invariants against write model (often normalized OLTP schema)
- On success: persist state + emit domain event to Kafka (ideally via outbox—see below)
- Write DB optimized for consistency, not dashboard queries
Read side
Independent consumers subscribe to the same event stream and maintain denormalized read models—wide tables, materialized views, search documents. Query API reads only from read stores, never from the event log directly in user-facing paths.
Multiple read models from one stream
One orders-events topic feeds: operational SQL view, BI warehouse sink, customer notification service, fraud scoring. Each uses a different consumer group—see Fan-Out pattern. Schema evolution via Registry keeps all projections compatible.
Eventual consistency window
After a successful command, read models lag by milliseconds to seconds. UX must acknowledge this:
- Read-your-writes: route query to write DB for same session, or poll until read model catches up (version token)
- Optimistic UI: show pending state until projection confirms
- Idempotent projections: replay-safe so lag spikes do not corrupt read models
Include eventId and aggregateVersion in every event. Query APIs can expose lastProcessedVersion so clients know when read model is fresh.
“CQRS with Kafka?” — Commands hit write service + DB; events go to Kafka; N consumers build read models. Not mandatory to use Kafka on command ingress—only as the event backbone. Consistency is eventual unless you add synchronous read-your-writes hacks.
Outbox pattern
The dual-write problem: updating PostgreSQL and publishing to Kafka in one business operation without a distributed transaction. The outbox makes the database the single source of truth for “what must be published.”
Dual write problem
Naive approach fails in both directions:
- DB commit succeeds → Kafka publish fails → downstream never learns of change
- Kafka publish succeeds → DB rolls back → ghost events in the bus
sequenceDiagram participant S as Order service participant DB as PostgreSQL participant CDC as Debezium participant K as Kafka S->>DB: BEGIN S->>DB: INSERT orders ... S->>DB: INSERT outbox (event payload) S->>DB: COMMIT CDC->>DB: read WAL CDC->>K: publish domain event Note over S,K: Single atomic DB transaction — no direct Kafka write
Outbox table design
CREATE TABLE outbox (
id UUID PRIMARY KEY,
aggregate_id VARCHAR(64) NOT NULL,
event_type VARCHAR(128) NOT NULL,
payload JSONB NOT NULL,
idempotency_key VARCHAR(128) NOT NULL UNIQUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Same transaction as business write:
INSERT INTO orders (...) VALUES (...);
INSERT INTO outbox (id, aggregate_id, event_type, payload, idempotency_key)
VALUES (gen_random_uuid(), 'ord-42', 'OrderPlaced', '{"orderId":"ord-42",...}', 'ord-42-OrderPlaced-v1');
Debezium CDC → Kafka
Debezium captures INSERT on outbox table from WAL—no polling, no application thread publishing after commit. Use Outbox Event Router SMT to route by event_type to domain topics and unwrap payload.
{
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.route.topic.replacement": "${routedByValue}",
"transforms.outbox.table.field.event.key": "aggregate_id",
"transforms.outbox.table.field.event.payload": "payload",
"transforms.outbox.table.field.event.type": "event_type",
"transforms.outbox.table.field.event.id": "id",
"transforms.outbox.route.by.field": "event_type"
}
Idempotency key
Outbox delivery is at-least-once (CDC retry, Kafka retry). Downstream consumers dedupe on idempotency_key or eventId—unique constraint in application store or Redis TTL cache.
Delivery guarantee
Guaranteed: if business transaction commits, event will eventually appear on Kafka (barring catastrophic WAL loss). Not guaranteed: exactly-once end-to-end without idempotent consumers. Alternative relay: polling outbox table in same service (transactional outbox relay)—higher app complexity, no CDC lag.
Prune published outbox rows via scheduled job or separate compaction process—CDC replays are not an excuse to let the table grow forever. Some teams move to published_at column and archive after 7 days.
Saga orchestration via Kafka
Long-running business transactions span services without 2PC. A saga orchestrator emits commands, listens for replies, and triggers compensating events on failure—often implemented as a Kafka Streams stateful topology.
sequenceDiagram participant O as Orchestrator\n(Kafka Streams) participant I as Inventory svc participant P as Payment svc participant S as Shipping svc O->>I: ReserveStock command I-->>O: StockReserved reply O->>P: ChargePayment command P-->>O: PaymentFailed reply O->>I: ReleaseStock compensate O->>S: CancelShipment compensate
Orchestrator as Kafka Streams topology
Stateful processor maintains saga instance per correlationId in a KeyValueStore: current step, completed steps, timeout deadline. On reply topic event, transition state machine; emit next command or compensation.
enum SagaStep { STARTED, STOCK_RESERVED, PAYMENT_DONE, SHIPPED, COMPENSATING, FAILED }
record SagaState(
String correlationId,
SagaStep step,
Instant startedAt,
Map<String, String> context
) {}
Command and reply topics
| Topic | Direction | Content |
|---|---|---|
inventory-commands | Orchestrator → Inventory | ReserveStock, ReleaseStock |
inventory-replies | Inventory → Orchestrator | StockReserved, StockFailed |
payment-commands | Orchestrator → Payment | ChargePayment, RefundPayment |
payment-replies | Payment → Orchestrator | PaymentCaptured, PaymentFailed |
Compensating transactions
Compensations are new events, not rollbacks—ReleaseStock, RefundPayment. Each must be idempotent (safe if orchestrator retries). Design compensations to be semantically valid even if forward step partially applied.
Correlation ID propagation
Every command, reply, and compensation carries the same correlationId (and often sagaId) in headers and payload. Services log and trace on this ID—OpenTelemetry baggage should mirror Kafka headers.
correlationId=550e8400-e29b-41d4-a716-446655440000
sagaId=order-saga-ord-42
replyTopic=saga-orchestrator-replies
messageType=StockReserved
Dead letter for stuck sagas
Timeout processor scans state store for sagas past deadline → emit to saga-timeouts or saga.DLT. Human workflow or automated remediation replays/compensates. Never silently drop intermediate states—operators need visibility.
Choreography (no central orchestrator) reduces coupling but makes compensation graphs hard to reason about. Orchestration centralizes logic but is a single point of failure—run Streams app with standby tasks and monitor rocksDB state size.
Dead letter queue (DLQ) pattern
Poison messages must not block the entire partition forever. Route failures to a DLT with rich metadata, fix the bug, replay—while healthy records keep flowing.
When to use
- Deserialization failures (schema mismatch after deploy)
- Business rule violations that will never succeed on retry (invalid country code)
- Downstream timeout exhausted—record needs human triage
- Not for transient DB blips—use retry with backoff first
DLT naming convention
Standard: {original-topic}.DLT — e.g. orders.DLT. Spring Kafka default follows this. Some orgs add .DLQ or environment prefix prod.orders.DLT.
What to store in DLT
Preserve enough context to debug and replay without guessing:
- Original record: key, value, headers (or base64 if binary)
- Exception: class name, message, stack trace (truncated)
- Source metadata: original topic, partition, offset, timestamp
- Consumer context: group id, processing attempt count
{
"originalTopic": "orders",
"originalPartition": 3,
"originalOffset": 918273,
"originalTimestamp": 1717200000456,
"consumerGroup": "order-processor",
"errorClass": "com.acme.ValidationException",
"errorMessage": "Unknown currency: XYZ",
"stackTrace": "...",
"payload": { "orderId": "ord-99", "currency": "XYZ" }
}
Reprocessing from DLT
- Fix code or data contract
- Deploy fixed consumer
- Run one-off reprocessor consumer on orders.DLT → republish to orders or process inline with idempotency
- Verify metrics: main topic lag stable, DLT count not growing
Spring Kafka DeadLetterPublishingRecoverer
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Order> template) {
DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(
template,
(record, ex) -> new TopicPartition(record.topic() + ".DLT", record.partition())
);
recoverer.setHeadersFunction((record, ex) -> {
RecordHeaders headers = new RecordHeaders();
headers.add("dlt-original-offset", longToBytes(record.offset()));
headers.add("dlt-exception-message", ex.getMessage().getBytes(StandardCharsets.UTF_8));
return headers;
});
DefaultErrorHandler handler = new DefaultErrorHandler(recoverer, new FixedBackOff(1000L, 3));
handler.addNotRetryableExceptions(ValidationException.class);
return handler;
}
DLT topics need their own retention and ACLs. Alert on orders.DLT message rate > 0 sustained—silent DLT growth is production debt.
Fan-out pattern
One event stream, many independent consumers—each consumer group tracks its own offset and processing speed. This is Kafka’s killer feature vs competing consumer destruction on traditional queues.
flowchart TB T[(orders topic)] G1[Group: order-processor\noperational DB] G2[Group: audit-archiver\nS3 immutable log] G3[Group: search-indexer\nElasticsearch] G4[Group: analytics-etl\nwarehouse] T --> G1 T --> G2 T --> G3 T --> G4
Key property: adding a new consumer group does not steal messages from existing groups. Audit can be 24 hours behind while order processing stays real-time—no coupling.
Use cases on same Order events
| Consumer group | Purpose | Lag tolerance |
|---|---|---|
order-processor | Fulfillment write path | Seconds |
audit-archiver | Compliance immutable store | Minutes–hours |
search-indexer | Customer order search UI | Seconds–minutes |
analytics-etl | BI warehouse facts | Hours |
Each group commits offsets to __consumer_offsets under unique (group, topic, partition) keys. Fan-out multiplies consumer connections and fetch traffic—size brokers for aggregate fan-out bytes-out, not single group.
Event-driven microservices
Services communicate through domain events on Kafka—not shared databases. Each service owns its data, publishes facts about its bounded context, and subscribes to others’ events for integration.
Topic ownership
- One producing service per domain topic — Order Service owns orders.events
- Many consumers — Inventory, Billing, Notification read-only subscribe
- No foreign writes — consumers never produce to another team’s topic; they emit their own domain events
- ACLs enforce: only User:order-service has Write on orders.events
flowchart LR OS[Order Service\nowns orders.events] IS[Inventory Service\nowns inventory.events] NS[Notification Service\nowns notifications.events] OS -->|OrderPlaced| IS OS -->|OrderPlaced| NS IS -->|StockReserved| OS
Schema Registry as contract
Avro subjects per topic (orders.events-value) with FULL_TRANSITIVE compatibility. Producer CI registers schema; consumers codegen from same artifact. Breaking changes blocked at publish time—not discovered in staging consumers.
Backward compatibility at publish time
Serializer calls registry compatibility check before write. Deploy order: register compatible schema → deploy consumers (new reader) → deploy producer (new writer). Event-driven coupling is schema coupling—govern it like an API gateway.
Platform teams provide a “topic request” workflow: owning team, retention tier, schema subject, allowed consumers, PII classification. Prevents 400 orphan topics and ACL sprawl.
Kafka as database (anti-pattern analysis)
“We’ll skip PostgreSQL and just use Kafka” fails predictably. Kafka is a log, not a query engine— know what it can substitute for and what it cannot.
| Capability | Kafka | RDBMS / NoSQL |
|---|---|---|
| Durable append log | Core strength | Transaction log internally, not exposed |
| Key-based compaction (latest per key) | Compacted topics | Primary key upsert |
| Replay / time-travel | Offset reset, new consumer groups | Binlog CDC only |
| Random access by key | O(n) scan or KTable materialization | Indexed lookup |
| Secondary indexes | Not native | B-tree, GIN, etc. |
| Complex queries (JOIN, aggregate ad hoc) | Streams/Flink jobs, not interactive SQL on broker | SQL / query DSL |
| ACID across arbitrary keys/topics | Transactions within Kafka boundary | Multi-row ACID |
When to combine Kafka + DB
Kafka: event stream, integration bus, log of record for domain events. Database: queryable current state, indexes, reporting, user-facing reads. Sync via projections (CQRS) or CDC (outbox). This is the dominant production architecture—not either/or.
KTable + GlobalKTable for in-stream lookup
Kafka Streams KTable materializes changelog topic into key-value state—O(1) lookup inside a topology. GlobalKTable replicates entire compacted topic to every instance for join enrichment (product catalog, FX rates). Still not a replacement for PostgreSQL—state is partition-scoped or fully replicated with memory cost.
KTable<String, Product> products = builder.table("products-compacted");
KStream<String, Order> orders = builder.stream("orders");
orders.join(products, (order, product) -> order.enrichWith(product));
Using compacted topics as “database” without monitoring compaction lag leads to stale reads. Compaction is background work—queryable state belongs in a store built for reads.
Multi-cluster patterns
Single-cluster Kafka hits regional latency, blast radius, and compliance boundaries. Multi-cluster topologies replicate events across clusters—each pattern trades complexity for availability and locality.
| Pattern | Topology | Complexity |
|---|---|---|
| Active-passive | Primary cluster; DR standby receives MirrorMaker replication | Lower—failover is directional |
| Active-active | Two+ clusters, bidirectional replication | Highest—conflict resolution required |
| Hub-and-spoke | Regional clusters → central aggregation hub | Medium—one-way to hub, analytics/global view |
Active-active
Producers write locally for latency; MirrorMaker replicates both directions. Same topic name in both regions— consumers must handle duplicate events and ordering conflicts. Strategies: leader region per aggregate, version vectors, last-write-wins (dangerous), CRDT-style merges.
Active-passive DR
Production traffic on cluster A; cluster B receives async replication. Failover: stop MM2 A→B or reverse direction, redirect DNS/bootstrap servers, accept seconds–minutes RPO. Consumers restart on B from translated offsets. Run failover drills quarterly—offset translation breaks if never tested.
Hub-and-spoke
EU, US, APAC clusters produce locally; MM2 replicates selected topics to global hub for analytics, ML feature store, compliance archive. Spokes do not replicate hub traffic back—avoids active-active conflicts.
MirrorMaker 2 (MM2)
Kafka Connect-based multi-cluster replication. Built-in connectors: MirrorSourceConnector, MirrorCheckpointConnector (offset sync), MirrorHeartbeatConnector.
- Offset translation: maps source offset to target offset for consumer failover
- Topic remapping: source.topic.replication.factor, prefix target.alias.topic if needed
- Sync groups: emit.checkpoints.enabled for consumer offset migration
clusters = primary, dr
primary.bootstrap.servers = kafka-primary:9092
dr.bootstrap.servers = kafka-dr:9092
primary->dr.enabled = true
primary->dr.topics = orders.*, inventory.*
replication.policy.class = org.apache.kafka.connect.mirror.IdentityReplicationPolicy
sync.topic.acls.enabled = false
emit.checkpoints.enabled = true
Confluent Replicator vs MirrorMaker 2
| Tool | Notes |
|---|---|
| MirrorMaker 2 | Apache / Confluent open-source; Connect-based; standard for self-managed |
| Confluent Replicator | Commercial; auto topic creation, advanced monitoring, Confluent Cloud cluster linking; enterprise support |
| Cluster Linking | Confluent-native active-passive / stretch cluster semantics on CP/CC |
Multi-cluster doubles operational surface: ACL sync, schema registry per cluster vs centralized, clock skew in active-active. Default to active-passive DR until a product requirement truly needs bidirectional writes.
Request-reply over Kafka
Sometimes a service needs a response—not fire-and-forget. Kafka supports RPC-style patterns using reply topics, correlation IDs, and temporary consumer partitions—accept higher latency and timeout complexity vs HTTP/gRPC.
sequenceDiagram participant C as Client service participant RT as requests topic participant W as Worker service participant R as reply topic C->>RT: request + correlationId + replyTopic header W->>RT: consume request W->>R: reply + same correlationId R->>C: correlate and complete Future
Reply topic strategies
| Strategy | Mechanism | Trade-off |
|---|---|---|
| Per-instance reply topic | replies.order-service.instance-7 |
No cross-talk; many topics to manage |
| Shared reply topic | Single replies + correlationId filter |
Fewer topics; client filters by ID |
Standard headers: KafkaHeaders.REPLY_TOPIC, KafkaHeaders.CORRELATION_ID, optional REPLY_PARTITION for stickiness.
Correlation ID
Client generates UUID per request; stores in ConcurrentHashMap<CorrelationKey, CompletableFuture>. Reply handler matches incoming correlation ID, completes future, removes entry. IDs must be unique per in-flight request.
Timeout handling
Wrap CompletableFuture with timeout (e.g. 30s). On timeout: complete exceptionally, optionally publish to DLT, do not leak map entries (always remove in finally). Worker should still process orphan requests idempotently.
@Bean
public ReplyingKafkaTemplate<String, InventoryRequest, InventoryResponse> replyingTemplate(
ProducerFactory<String, InventoryRequest> pf,
ConcurrentMessageListenerContainer<String, InventoryResponse> repliesContainer) {
return new ReplyingKafkaTemplate<>(pf, repliesContainer);
}
// Send and wait
RequestReplyFuture<String, InventoryRequest, InventoryResponse> future =
replyingTemplate.sendAndReceive(
ProducerRecord<>("inventory-requests", request),
Duration.ofSeconds(30));
InventoryResponse response = future.get(30, TimeUnit.SECONDS);
@KafkaListener(topics = "inventory-requests")
@SendTo // uses reply topic from request header
public InventoryResponse handle(
@Payload InventoryRequest request,
@Header(KafkaHeaders.CORRELATION_ID) byte[] correlationId) {
return inventoryService.check(request);
}
Prefer async events for decoupling. Use request-reply only for genuine query needs (inventory check, fraud score) where HTTP/gRPC is unavailable or you standardize all traffic on Kafka. Latency floor is two broker round-trips plus consumer poll interval.
“RPC over Kafka?” — Request topic + REPLY_TOPIC header + correlation ID; client blocks on future with timeout. Not true sync—poll interval and rebalance can add jitter. gRPC for sync, Kafka for facts.