Kafka consumer lag grows for hours after a spike

Scenario

After a traffic spike or deploy, consumer lag climbs and stays high for hours. Downstream emails, inventory updates, or analytics are delayed. Business asks “when will we catch up?” You need a triage order: is the consumer stuck, slow, under-scaled, or rebalancing—and whether catch-up rate > produce rate.

After reading, you should be able to:

Why — lag is produce rate minus consume rate

Consumer lag is how far behind the consumer group’s committed offset is from the log end (per partition). Lag grows when messages arrive faster than you process them, or when processing stops (crashed consumer, rebalance loop, poison message retry, GC pause exceeding max.poll.interval.ms). Catching up takes time proportional to lag × processing_time / consumer_throughput.

Lag is not always one problem

PatternLikely cause
All partitions lag similarlyUnder-scaled consumers or globally slow handler
One partition laggingHot key, skewed partition, one stuck consumer
Lag flat / not decreasingConsumers not running or not committing offsets
Spiky rebalanceDeploy, session timeout, slow poll loop
Lag after spike onlyCapacity planning; temporary need scale-out

What — triage order (do in sequence)

  1. 1. Confirm lag and scope
    kafka-consumer-groups.sh --bootstrap-server $BROKER \
      --group my-group --describe
    Note LAG per partition, CURRENT-OFFSET vs LOG-END-OFFSET. Metrics: kafka_consumergroup_lag, Burrow, Prometheus Kafka exporter.
  2. 2. Are consumers alive? — consumer group members count matches expected pods; logs for crash loop, OOM, OutOfMemoryErrorOOM guide.
  3. 3. Rebalancing? — frequent “Revoking partitions” / “Assignment” logs; fix session timeout, static membership, cooperative rebalance; stabilize deploys.
  4. 4. Processing time per message — traces/logs: DB slow — DB slow, external HTTP — timeouts; thread dump if synchronous handler blocks.
  5. 5. max.poll.interval.ms exceeded? — consumer kicked from group if poll() not called in time (long handler between polls). Fix: smaller max.poll.records, async processing, or raise interval with care.
  6. 6. Poison message? — one partition stuck; same offset retried; skip to DLQ after N failures; idempotent handler.
  7. 7. Producer rate still elevated? — incoming bytes/sec on topic; lag shrinks only if consume > produce.
  8. 8. Consumer count vs partitions — max useful consumers = partition count; extra instances idle.
  9. 9. Hot partition / key skew — one partition has most lag; revisit partition key — hot key.

Catch-up math (rough)

time_to_clear ≈ total_lag_messages / (consumers × msgs_per_second_per_consumer)
# If produce_rate > consume_rate, lag never clears — scale or speed up handler

How — recover and prevent

Immediate mitigation

  1. Scale consumer deployment to min(partitions, affordable limit).
  2. Temporarily reduce producer rate or disable non-critical publishers if possible.
  3. Move poison messages to DLQ manually if identified.
  4. Increase DB pool / fix downstream only if handler is DB-bound.

Handler and config (Spring Kafka / Java)

# Faster poll loop — process quickly or async
max.poll.records: 100          # lower if each record is heavy
max.poll.interval.ms: 300000  # must exceed worst-case batch process time
session.timeout.ms: 45000
fetch.min.bytes: 1
enable.auto.commit: false     # commit after successful process + idempotency

# Concurrency: listener concurrency ≤ partitions on that topic

Scale partitions (planned change)

Increasing partitions increases parallelism ceiling but does not reorder per-key semantics—only do with key strategy understood. Cannot decrease partition count easily.

Prevention

Verify recovery

  1. Lag metric trending down steadily.
  2. End-to-end delay (produce timestamp → processed) within SLO.
  3. No rebalance storm in logs during steady state.

Interview one-liner

“I check lag per partition and whether consumers are alive and committing, then rebalance noise, handler slowness and max.poll.interval, poison messages, and whether we have enough consumers for partition count—scale consumers or speed the handler until consume rate beats produce rate.”

Related scenarios