Kafka consumer lag grows for hours after a spike
Scenario
After a traffic spike or deploy, consumer lag climbs and stays high for hours. Downstream emails, inventory updates, or analytics are delayed. Business asks “when will we catch up?” You need a triage order: is the consumer stuck, slow, under-scaled, or rebalancing—and whether catch-up rate > produce rate.
After reading, you should be able to:
- Read lag per partition and consumer group and spot hot partitions.
- Follow a ordered triage checklist from health → throughput → config.
- Scale consumers correctly (≤ partition count) and fix slow handlers.
- Handle poison messages, rebalance storms, and
max.poll.intervalviolations.
Why — lag is produce rate minus consume rate
Consumer lag is how far behind the consumer group’s committed offset is from the log end (per partition).
Lag grows when messages arrive faster than you process them, or when processing stops
(crashed consumer, rebalance loop, poison message retry, GC pause exceeding max.poll.interval.ms).
Catching up takes time proportional to lag × processing_time / consumer_throughput.
Lag is not always one problem
| Pattern | Likely cause |
|---|---|
| All partitions lag similarly | Under-scaled consumers or globally slow handler |
| One partition lagging | Hot key, skewed partition, one stuck consumer |
| Lag flat / not decreasing | Consumers not running or not committing offsets |
| Spiky rebalance | Deploy, session timeout, slow poll loop |
| Lag after spike only | Capacity planning; temporary need scale-out |
What — triage order (do in sequence)
-
1. Confirm lag and scope
kafka-consumer-groups.sh --bootstrap-server $BROKER \ --group my-group --describe
NoteLAGper partition,CURRENT-OFFSETvsLOG-END-OFFSET. Metrics:kafka_consumergroup_lag, Burrow, Prometheus Kafka exporter. -
2. Are consumers alive?
— consumer group members count matches expected pods;
logs for crash loop, OOM,
OutOfMemoryError— OOM guide. - 3. Rebalancing? — frequent “Revoking partitions” / “Assignment” logs; fix session timeout, static membership, cooperative rebalance; stabilize deploys.
- 4. Processing time per message — traces/logs: DB slow — DB slow, external HTTP — timeouts; thread dump if synchronous handler blocks.
-
5.
max.poll.interval.msexceeded? — consumer kicked from group ifpoll()not called in time (long handler between polls). Fix: smallermax.poll.records, async processing, or raise interval with care. - 6. Poison message? — one partition stuck; same offset retried; skip to DLQ after N failures; idempotent handler.
- 7. Producer rate still elevated? — incoming bytes/sec on topic; lag shrinks only if consume > produce.
- 8. Consumer count vs partitions — max useful consumers = partition count; extra instances idle.
- 9. Hot partition / key skew — one partition has most lag; revisit partition key — hot key.
Catch-up math (rough)
time_to_clear ≈ total_lag_messages / (consumers × msgs_per_second_per_consumer) # If produce_rate > consume_rate, lag never clears — scale or speed up handler
How — recover and prevent
Immediate mitigation
- Scale consumer deployment to
min(partitions, affordable limit). - Temporarily reduce producer rate or disable non-critical publishers if possible.
- Move poison messages to DLQ manually if identified.
- Increase DB pool / fix downstream only if handler is DB-bound.
Handler and config (Spring Kafka / Java)
# Faster poll loop — process quickly or async max.poll.records: 100 # lower if each record is heavy max.poll.interval.ms: 300000 # must exceed worst-case batch process time session.timeout.ms: 45000 fetch.min.bytes: 1 enable.auto.commit: false # commit after successful process + idempotency # Concurrency: listener concurrency ≤ partitions on that topic
Scale partitions (planned change)
Increasing partitions increases parallelism ceiling but does not reorder per-key semantics—only do with key strategy understood. Cannot decrease partition count easily.
Prevention
- Alert on lag > threshold for 10+ minutes; SLO on processing delay.
- Load test consumer at 2× peak produce rate.
- DLQ + metrics on failure rate; idempotent consumers.
- Right-size
max.poll.recordsvs handler duration; avoid long sync work in listener. - Deploy: rolling with cooperative-sticky assignor to reduce rebalance pause.
Verify recovery
- Lag metric trending down steadily.
- End-to-end delay (produce timestamp → processed) within SLO.
- No rebalance storm in logs during steady state.
Interview one-liner
“I check lag per partition and whether consumers are alive and committing, then rebalance noise, handler slowness and max.poll.interval, poison messages, and whether we have enough consumers for partition count—scale consumers or speed the handler until consume rate beats produce rate.”