Kafka Performance Tuning

Throughput and latency pull in opposite directions—fat batches and lz4 compression maximize MB/s; linger.ms=0 and fetch.min.bytes=1 minimize p99. Measure with official perf tools, tune one layer at a time, and never optimize without a baseline.

developer senior platform Tuning playbook

Producer throughput tuning

Producer throughput is batching + compression + async pipelining. The producer accumulates records per partition, compresses the batch, and sends in fewer requests. Blocking on every send destroys all of that.

Benchmark baseline: kafka-producer-perf-test.sh

Establish MB/s and records/s before changing configs. Run from a host with network proximity to brokers:

bash — producer perf test
kafka-producer-perf-test.sh \
  --topic perf-test \
  --num-records 5000000 \
  --record-size 1024 \
  --throughput -1 \
  --producer-props \
    bootstrap.servers=kafka:9092 \
    acks=1 \
    batch.size=131072 \
    linger.ms=20 \
    compression.type=lz4 \
    buffer.memory=134217728

# Output: records/sec, MB/sec, avg latency, max latency, 50th/95th/99th/999th percentiles

Vary one knob per run. Save results with topic partition count and broker version noted— perf tests on empty clusters overstate real-world numbers.

Throughput-oriented producer config

Property Throughput value Why
batch.size 131072 (128 KB) Larger batches per partition → fewer requests
linger.ms 20 Wait up to 20ms to fill batch — trades latency for size
compression.type lz4 Fast CPU, good ratio; zstd if CPU headroom and WAN
buffer.memory 134217728 (128 MB) More in-flight batch buffer before max.block.ms blocks
acks 1 Leader ack only — faster than all; not durability-safe
Properties — high-throughput producer
batch.size=131072
linger.ms=20
compression.type=lz4
buffer.memory=134217728
acks=1
retries=3
max.in.flight.requests.per.connection=5
⚠️ Pitfall

acks=1 + high throughput is for metrics and load tests—not billing or inventory. Production durability paths use acks=all and accept lower ceiling MB/s.

Async send — never block in a loop

producer.send(record).get() in a tight loop serializes on broker RTT—throughput collapses to ~1 request at a time. Fire async sends and optionally flush at interval or use callback batching.

Java — bad vs good
// BAD — synchronous, ~hundreds msg/s
for (Event e : events) {
  producer.send(new ProducerRecord<>("topic", e.getKey(), e)).get();
}

// GOOD — async pipeline, flush periodically
List<CompletableFuture<RecordMetadata>> futures = new ArrayList<>();
for (Event e : events) {
  futures.add(producer.send(new ProducerRecord<>("topic", e.getKey(), e))
      .thenApply(r -> r.recordMetadata()));
}
CompletableFuture.allOf(futures.toArray(CompletableFuture[]::new)).join();
// or: producer.flush() every N records in streaming loop

Partition count and producer parallelism

Producer maintains a separate batch per partition. More partitions = more parallel in-flight requests to brokers— up to broker and network limits. Diminishing returns past ~100–200 partitions per topic without more brokers. Align with target consumer parallelism (see Partition guidelines).

flowchart LR
  P[Producer]
  B0[Batch partition 0]
  B1[Batch partition 1]
  B2[Batch partition 2]
  BR[Broker leaders]

  P --> B0
  P --> B1
  P --> B2
  B0 --> BR
  B1 --> BR
  B2 --> BR
💡 Pro Tip

Watch batch-size-avg JMX metric—if far below batch.size, increase linger.ms or record rate. Empty batches mean you are latency-tuned, not throughput-tuned.

Consumer throughput tuning

Consumers are fetch-bound and process-bound. Tune fetch size to reduce empty polls, increase max.poll.records, and never run slow I/O inside the poll thread longer than max.poll.interval.ms.

Benchmark: kafka-consumer-perf-test.sh

bash — consumer perf test
# First, load topic with producer perf test, then:
kafka-consumer-perf-test.sh \
  --bootstrap-server kafka:9092 \
  --topic perf-test \
  --messages 5000000 \
  --threads 1 \
  --consumer.config consumer-perf.properties

# consumer-perf.properties:
# fetch.min.bytes=100000
# fetch.max.wait.ms=500
# max.poll.records=1000

Throughput-oriented consumer config

Property Throughput value Effect
fetch.min.bytes 100000 (~100 KB) Broker waits until at least this much data per fetch
fetch.max.wait.ms 500 Max wait to accumulate fetch.min.bytes
max.poll.records 1000 More records per poll loop iteration
fetch.max.bytes 52428800 (50 MB) Upper bound per fetch response
max.partition.fetch.bytes 1048576 (1 MB) Per-partition cap within fetch
Properties — high-throughput consumer
fetch.min.bytes=100000
fetch.max.wait.ms=500
max.poll.records=1000
max.poll.interval.ms=300000
session.timeout.ms=45000
enable.auto.commit=false

Parallel processing

Two levers—often combined:

  • Increase partition count — more buckets for consumer group to assign
  • Increase concurrency — Spring concurrency or multiple consumer instances in same group

Effective parallelism = min(partitions, consumer_threads). Adding 20 threads to 6 partitions wastes 14 threads.

Avoid slow work in the poll loop

If processing exceeds max.poll.interval.ms, coordinator revokes partitions—rebalance storm. Patterns for heavy handlers:

  1. Delegate to thread pool — poll thread hands off records; commit after worker completes (careful with ordering)
  2. pause / resume — pause partition while backlog processes, resume when caught up
  3. Batch + bulk insert — DB bulk write per poll batch instead of per-record round trips
Java — pause while processing backlog
@KafkaListener(id = "heavy-listener", topics = "events")
public void onMessage(ConsumerRecord<String, Event> record,
    Consumer<String, Event> consumer,
    Acknowledgment ack) {
  TopicPartition tp = new TopicPartition(record.topic(), record.partition());
  consumer.pause(List.of(tp));
  workerPool.submit(() -> {
    try {
      heavyProcess(record.value());
      ack.acknowledge();
    } finally {
      consumer.resume(List.of(tp));
    }
  });
}
🎯 Interview Tip

“Consumer slow, rebalance loop?” — Check max.poll.interval.ms vs processing time. Fix: faster processing, smaller max.poll.records, async workers with ordered commit, or static membership to reduce churn cost.

Broker throughput tuning

Brokers are sequential-write log servers on disk and zero-copy send on network. Performance comes from OS page cache, fast filesystems, enough network threads—and not oversized JVM heaps.

Filesystem and mount options

  • XFS or ext4 — standard choices; XFS common on large sequential volumes
  • Separate disks — log dirs on dedicated NVMe/SSD; avoid sharing with OS or ZooKeeper (legacy)
  • noatime — disable access-time updates on log mount; reduces metadata writes
  • RAID10 / JBOD — JBOD with one log dir per disk often beats RAID5 for write latency
bash — fstab excerpt
/dev/nvme0n1  /var/kafka-logs  xfs  defaults,noatime,nodiratime  0  0

Network sysctl

Increase socket buffer defaults so high-throughput TCP streams do not stall on small kernel buffers:

bash — /etc/sysctl.d/kafka.conf
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 1048576 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 250000

Align with broker socket.send.buffer.bytes / socket.receive.buffer.bytes (often 100 KB–1 MB).

VM and page cache

sysctl Recommended Purpose
vm.swappiness 1 Strongly prefer RAM page cache over swap — swapping kills tail latency
vm.dirty_ratio 80 Allow large dirty page cache before sync pressure
vm.dirty_background_ratio 5 Background flush starts earlier — smoother writeback
🔬 Under the Hood

Kafka brokers send data via sendfile from page cache when segments are hot—no user-space copy. That is why 64 GB RAM with 6 GB JVM beats 16 GB RAM with 12 GB JVM: cache holds recent segments for consumer fetch.

JVM heap

4–6 GB max heap; set -Xms equal to -Xmx to avoid resize pauses. Remaining RAM for OS cache. G1GC with low pause target.

bash — KAFKA_HEAP_OPTS
export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC \
  -XX:MaxGCPauseMillis=20 \
  -XX:InitiatingHeapOccupancyPercent=35 \
  -XX:+DisableExplicitGC"

Broker thread pools

Property Guideline
num.io.threads 2 × number of data disks (e.g. 8 disks → 16 threads)
num.network.threads 3–8 baseline; raise if NetworkProcessorAvgIdlePercent < 0.3
num.replica.fetchers 1–4 per broker; increase for cross-AZ replication bandwidth
Properties — broker throughput excerpt
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
replica.fetch.max.bytes=1048576
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
📦 Real World

Cloud brokers: use instance types with high network bandwidth (e.g. 25 Gbps+) and local NVMe. EBS-only volumes cap throughput before Kafka config does—upgrade disk tier before tuning num.io.threads.

Latency tuning

Low-latency Kafka minimizes waiting: no linger batching, no fetch batching, no compression CPU, and replication paths that do not buffer. Throughput configs are the enemy of p99 here.

Knob Throughput Low latency
linger.ms 20 0
acks 1 1 (or all if durability required)
compression.type lz4 none
fetch.min.bytes 100000 1
fetch.max.wait.ms 500 0

Producer — low latency

Properties — low-latency producer
linger.ms=0
batch.size=16384
compression.type=none
acks=1
buffer.memory=67108864

Each record ships immediately (per partition batch may still be small). For sub-10ms paths, co-locate producers with brokers (same AZ). acks=all adds replica round-trip—required for durability, costs latency.

Consumer — low latency

Properties — low-latency consumer
fetch.min.bytes=1
fetch.max.wait.ms=0
max.poll.records=100

Consumer returns as soon as any data is available—more polls, higher CPU, lower end-to-end delay.

Broker — replication fetch

  • num.replica.fetchers — more parallel follower fetch threads
  • replica.fetch.min.bytes=1 — followers do not wait for large fetch batches
  • Keep ISR healthy — out-of-sync replicas force leader-only reads or delayed commits

Avoid on latency-critical paths

  • Compression — CPU cycles add milliseconds; use only if network is the proven bottleneck
  • Large batches / high linger — intentional delay to fill batches
  • TLS without hardware acceleration — software AES/GCM at very high RPS adds up; use AES-NI capable instances
  • Oversized max.poll.records — long processing before next fetch
flowchart TB
  subgraph budget["Typical p99 budget (same AZ)"]
    L[linger 0–5ms]
    N[Network RTT 1–3ms]
    B[Broker append 1–5ms]
    R[Replication if acks=all 2–10ms]
    C[Consumer poll 1–50ms]
  end
  L --> N --> B --> R --> C
⚖️ Trade-off

You cannot max throughput and min latency with one config profile. Run separate producer factories (or clusters) for telemetry firehose vs payment command topics.

Partition count guidelines

Partitions are the unit of parallelism—and operational weight. Too few starves consumers; too many slows rebalance and bloats metadata. Size from measured throughput, not guesses.

Too few partitions

  • Single consumer thread per topic caps throughput
  • Hot partition on skewed key distribution
  • Cannot scale consumer fleet past partition count

Too many partitions

  • More leader elections and metadata on controller
  • Slower consumer group rebalance (more assignments)
  • More file handles and segment files per broker (~2–4 per partition minimum)
  • Producer batches spread thin—smaller batches if key space is sparse

Sizing formula

partitions ≈ target_throughput ÷ single_partition_throughput

single_partition_throughput (rule of thumb):
  HDD brokers:     ~10–30 MB/s
  SSD brokers:     ~30–80 MB/s
  NVMe + 10 Gbps:  ~50–100+ MB/s  (measure with perf test!)

Example:
  Target 600 MB/s ingest, measured 40 MB/s per partition → 600 ÷ 40 = 15 → use 18–24 with headroom

Start with 12–24 for medium workloads; revisit when BytesInPerSec per broker or consumer lag growth indicates saturation. See Topic management—partitions cannot be reduced.

Signal Action
Consumer lag grows, CPU low Add partitions + consumers (or fix slow handler)
One partition lag spikes Key skew—salt keys or custom partitioner
Rebalance > 30s routinely Reduce partitions per broker cluster-wide
Broker > 2000 partitions Add brokers; avoid new high-partition topics
flowchart LR
  Few[Too few\nconsumer bound]
  Sweet[Sweet spot\n12-48 typical topic]
  Many[Too many\nrebalance + metadata cost]

  Few -->|add partitions| Sweet
  Sweet -->|uncapped growth| Many
⚙️ Config

Document partition count rationale in topic catalog: target MB/s, key cardinality, max consumer instances, measured perf test date. Future you will not remember why orders has 48 partitions.

🎯 Interview Tip

“How many partitions?” — Never a universal number. Formula: throughput target ÷ per-partition measured throughput; cap by broker partition budget (~2k–4k/broker). Increasing partitions does not reshuffle existing keyed data.