Kafka consumer lag grows for hours after a spike

Scenario

After a traffic spike or deploy, consumer lag climbs and stays high for hours. Downstream emails, inventory updates, or analytics are delayed. Business asks “when will we catch up?” You need a triage order: is the consumer stuck, slow, under-scaled, or rebalancing—and whether catch-up rate > produce rate.

After reading, you should be able to:

Read lag per partition and consumer group and spot hot partitions.
Follow a ordered triage checklist from health → throughput → config.
Scale consumers correctly (≤ partition count) and fix slow handlers.
Handle poison messages, rebalance storms, and max.poll.interval violations.

Why — lag is produce rate minus consume rate

Consumer lag is how far behind the consumer group’s committed offset is from the log end (per partition). Lag grows when messages arrive faster than you process them, or when processing stops (crashed consumer, rebalance loop, poison message retry, GC pause exceeding max.poll.interval.ms). Catching up takes time proportional to lag × processing_time / consumer_throughput.

Lag is not always one problem

Pattern	Likely cause
All partitions lag similarly	Under-scaled consumers or globally slow handler
One partition lagging	Hot key, skewed partition, one stuck consumer
Lag flat / not decreasing	Consumers not running or not committing offsets
Spiky rebalance	Deploy, session timeout, slow poll loop
Lag after spike only	Capacity planning; temporary need scale-out

What — triage order (do in sequence)

1. Confirm lag and scope
```
kafka-consumer-groups.sh --bootstrap-server $BROKER \
  --group my-group --describe
```
Note LAG per partition, CURRENT-OFFSET vs LOG-END-OFFSET. Metrics: kafka_consumergroup_lag, Burrow, Prometheus Kafka exporter.
2. Are consumers alive? — consumer group members count matches expected pods; logs for crash loop, OOM, OutOfMemoryError — OOM guide.
3. Rebalancing? — frequent “Revoking partitions” / “Assignment” logs; fix session timeout, static membership, cooperative rebalance; stabilize deploys.
4. Processing time per message — traces/logs: DB slow — DB slow, external HTTP — timeouts; thread dump if synchronous handler blocks.
5. max.poll.interval.ms exceeded? — consumer kicked from group if poll() not called in time (long handler between polls). Fix: smaller max.poll.records, async processing, or raise interval with care.
6. Poison message? — one partition stuck; same offset retried; skip to DLQ after N failures; idempotent handler.
7. Producer rate still elevated? — incoming bytes/sec on topic; lag shrinks only if consume > produce.
8. Consumer count vs partitions — max useful consumers = partition count; extra instances idle.
9. Hot partition / key skew — one partition has most lag; revisit partition key — hot key.

Catch-up math (rough)

time_to_clear ≈ total_lag_messages / (consumers × msgs_per_second_per_consumer)
# If produce_rate > consume_rate, lag never clears — scale or speed up handler

How — recover and prevent

Immediate mitigation

Scale consumer deployment to min(partitions, affordable limit).
Temporarily reduce producer rate or disable non-critical publishers if possible.
Move poison messages to DLQ manually if identified.
Increase DB pool / fix downstream only if handler is DB-bound.

Handler and config (Spring Kafka / Java)

# Faster poll loop — process quickly or async
max.poll.records: 100          # lower if each record is heavy
max.poll.interval.ms: 300000  # must exceed worst-case batch process time
session.timeout.ms: 45000
fetch.min.bytes: 1
enable.auto.commit: false     # commit after successful process + idempotency

# Concurrency: listener concurrency ≤ partitions on that topic

Scale partitions (planned change)

Increasing partitions increases parallelism ceiling but does not reorder per-key semantics—only do with key strategy understood. Cannot decrease partition count easily.

Prevention

Alert on lag > threshold for 10+ minutes; SLO on processing delay.
Load test consumer at 2× peak produce rate.
DLQ + metrics on failure rate; idempotent consumers.
Right-size max.poll.records vs handler duration; avoid long sync work in listener.
Deploy: rolling with cooperative-sticky assignor to reduce rebalance pause.

Verify recovery

Lag metric trending down steadily.
End-to-end delay (produce timestamp → processed) within SLO.
No rebalance storm in logs during steady state.

Interview one-liner

“I check lag per partition and whether consumers are alive and committing, then rebalance noise, handler slowness and max.poll.interval, poison messages, and whether we have enough consumers for partition count—scale consumers or speed the handler until consume rate beats produce rate.”