Kafka Performance Tuning

Producer throughput tuning

Producer throughput is batching + compression + async pipelining. The producer accumulates records per partition, compresses the batch, and sends in fewer requests. Blocking on every send destroys all of that.

Benchmark baseline: kafka-producer-perf-test.sh

Establish MB/s and records/s before changing configs. Run from a host with network proximity to brokers:

kafka-producer-perf-test.sh \
  --topic perf-test \
  --num-records 5000000 \
  --record-size 1024 \
  --throughput -1 \
  --producer-props \
    bootstrap.servers=kafka:9092 \
    acks=1 \
    batch.size=131072 \
    linger.ms=20 \
    compression.type=lz4 \
    buffer.memory=134217728

# Output: records/sec, MB/sec, avg latency, max latency, 50th/95th/99th/999th percentiles

Vary one knob per run. Save results with topic partition count and broker version noted— perf tests on empty clusters overstate real-world numbers.

Throughput-oriented producer config

Property	Throughput value	Why
`batch.size`	131072 (128 KB)	Larger batches per partition → fewer requests
`linger.ms`	20	Wait up to 20ms to fill batch — trades latency for size
`compression.type`	lz4	Fast CPU, good ratio; zstd if CPU headroom and WAN
`buffer.memory`	134217728 (128 MB)	More in-flight batch buffer before `max.block.ms` blocks
`acks`	1	Leader ack only — faster than `all`; not durability-safe

batch.size=131072
linger.ms=20
compression.type=lz4
buffer.memory=134217728
acks=1
retries=3
max.in.flight.requests.per.connection=5

⚠️ Pitfall

acks=1 + high throughput is for metrics and load tests—not billing or inventory. Production durability paths use acks=all and accept lower ceiling MB/s.

Async send — never block in a loop

producer.send(record).get() in a tight loop serializes on broker RTT—throughput collapses to ~1 request at a time. Fire async sends and optionally flush at interval or use callback batching.

// BAD — synchronous, ~hundreds msg/s
for (Event e : events) {
  producer.send(new ProducerRecord<>("topic", e.getKey(), e)).get();
}

// GOOD — async pipeline, flush periodically
List<CompletableFuture<RecordMetadata>> futures = new ArrayList<>();
for (Event e : events) {
  futures.add(producer.send(new ProducerRecord<>("topic", e.getKey(), e))
      .thenApply(r -> r.recordMetadata()));
}
CompletableFuture.allOf(futures.toArray(CompletableFuture[]::new)).join();
// or: producer.flush() every N records in streaming loop

Partition count and producer parallelism

Producer maintains a separate batch per partition. More partitions = more parallel in-flight requests to brokers— up to broker and network limits. Diminishing returns past ~100–200 partitions per topic without more brokers. Align with target consumer parallelism (see Partition guidelines).

flowchart LR
  P[Producer]
  B0[Batch partition 0]
  B1[Batch partition 1]
  B2[Batch partition 2]
  BR[Broker leaders]

  P --> B0
  P --> B1
  P --> B2
  B0 --> BR
  B1 --> BR
  B2 --> BR

💡 Pro Tip

Watch batch-size-avg JMX metric—if far below batch.size, increase linger.ms or record rate. Empty batches mean you are latency-tuned, not throughput-tuned.

Consumer throughput tuning

Consumers are fetch-bound and process-bound. Tune fetch size to reduce empty polls, increase max.poll.records, and never run slow I/O inside the poll thread longer than max.poll.interval.ms.

Benchmark: kafka-consumer-perf-test.sh

# First, load topic with producer perf test, then:
kafka-consumer-perf-test.sh \
  --bootstrap-server kafka:9092 \
  --topic perf-test \
  --messages 5000000 \
  --threads 1 \
  --consumer.config consumer-perf.properties

# consumer-perf.properties:
# fetch.min.bytes=100000
# fetch.max.wait.ms=500
# max.poll.records=1000

Throughput-oriented consumer config

Property	Throughput value	Effect
`fetch.min.bytes`	100000 (~100 KB)	Broker waits until at least this much data per fetch
`fetch.max.wait.ms`	500	Max wait to accumulate fetch.min.bytes
`max.poll.records`	1000	More records per poll loop iteration
`fetch.max.bytes`	52428800 (50 MB)	Upper bound per fetch response
`max.partition.fetch.bytes`	1048576 (1 MB)	Per-partition cap within fetch

fetch.min.bytes=100000
fetch.max.wait.ms=500
max.poll.records=1000
max.poll.interval.ms=300000
session.timeout.ms=45000
enable.auto.commit=false

Parallel processing

Two levers—often combined:

Increase partition count — more buckets for consumer group to assign
Increase concurrency — Spring concurrency or multiple consumer instances in same group

Effective parallelism = min(partitions, consumer_threads). Adding 20 threads to 6 partitions wastes 14 threads.

Avoid slow work in the poll loop

If processing exceeds max.poll.interval.ms, coordinator revokes partitions—rebalance storm. Patterns for heavy handlers:

Delegate to thread pool — poll thread hands off records; commit after worker completes (careful with ordering)
pause / resume — pause partition while backlog processes, resume when caught up
Batch + bulk insert — DB bulk write per poll batch instead of per-record round trips

@KafkaListener(id = "heavy-listener", topics = "events")
public void onMessage(ConsumerRecord<String, Event> record,
    Consumer<String, Event> consumer,
    Acknowledgment ack) {
  TopicPartition tp = new TopicPartition(record.topic(), record.partition());
  consumer.pause(List.of(tp));
  workerPool.submit(() -> {
    try {
      heavyProcess(record.value());
      ack.acknowledge();
    } finally {
      consumer.resume(List.of(tp));
    }
  });
}

🎯 Interview Tip

“Consumer slow, rebalance loop?” — Check max.poll.interval.ms vs processing time. Fix: faster processing, smaller max.poll.records, async workers with ordered commit, or static membership to reduce churn cost.

Broker throughput tuning

Brokers are sequential-write log servers on disk and zero-copy send on network. Performance comes from OS page cache, fast filesystems, enough network threads—and not oversized JVM heaps.

Filesystem and mount options

XFS or ext4 — standard choices; XFS common on large sequential volumes
Separate disks — log dirs on dedicated NVMe/SSD; avoid sharing with OS or ZooKeeper (legacy)
noatime — disable access-time updates on log mount; reduces metadata writes
RAID10 / JBOD — JBOD with one log dir per disk often beats RAID5 for write latency

/dev/nvme0n1  /var/kafka-logs  xfs  defaults,noatime,nodiratime  0  0

Network sysctl

Increase socket buffer defaults so high-throughput TCP streams do not stall on small kernel buffers:

net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 1048576 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 250000

Align with broker socket.send.buffer.bytes / socket.receive.buffer.bytes (often 100 KB–1 MB).

VM and page cache

sysctl	Recommended	Purpose
`vm.swappiness`	1	Strongly prefer RAM page cache over swap — swapping kills tail latency
`vm.dirty_ratio`	80	Allow large dirty page cache before sync pressure
`vm.dirty_background_ratio`	5	Background flush starts earlier — smoother writeback

🔬 Under the Hood

Kafka brokers send data via sendfile from page cache when segments are hot—no user-space copy. That is why 64 GB RAM with 6 GB JVM beats 16 GB RAM with 12 GB JVM: cache holds recent segments for consumer fetch.

JVM heap

4–6 GB max heap; set -Xms equal to -Xmx to avoid resize pauses. Remaining RAM for OS cache. G1GC with low pause target.

export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC \
  -XX:MaxGCPauseMillis=20 \
  -XX:InitiatingHeapOccupancyPercent=35 \
  -XX:+DisableExplicitGC"

Broker thread pools

Property	Guideline
`num.io.threads`	2 × number of data disks (e.g. 8 disks → 16 threads)
`num.network.threads`	3–8 baseline; raise if `NetworkProcessorAvgIdlePercent` < 0.3
`num.replica.fetchers`	1–4 per broker; increase for cross-AZ replication bandwidth

num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
replica.fetch.max.bytes=1048576
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000

📦 Real World

Cloud brokers: use instance types with high network bandwidth (e.g. 25 Gbps+) and local NVMe. EBS-only volumes cap throughput before Kafka config does—upgrade disk tier before tuning num.io.threads.

Latency tuning

Low-latency Kafka minimizes waiting: no linger batching, no fetch batching, no compression CPU, and replication paths that do not buffer. Throughput configs are the enemy of p99 here.

Knob	Throughput	Low latency
linger.ms	20	0
acks	1	1 (or all if durability required)
compression.type	lz4	none
fetch.min.bytes	100000	1
fetch.max.wait.ms	500	0

Producer — low latency

linger.ms=0
batch.size=16384
compression.type=none
acks=1
buffer.memory=67108864

Each record ships immediately (per partition batch may still be small). For sub-10ms paths, co-locate producers with brokers (same AZ). acks=all adds replica round-trip—required for durability, costs latency.

Consumer — low latency

fetch.min.bytes=1
fetch.max.wait.ms=0
max.poll.records=100

Consumer returns as soon as any data is available—more polls, higher CPU, lower end-to-end delay.

Broker — replication fetch

num.replica.fetchers — more parallel follower fetch threads
replica.fetch.min.bytes=1 — followers do not wait for large fetch batches
Keep ISR healthy — out-of-sync replicas force leader-only reads or delayed commits

Avoid on latency-critical paths

Compression — CPU cycles add milliseconds; use only if network is the proven bottleneck
Large batches / high linger — intentional delay to fill batches
TLS without hardware acceleration — software AES/GCM at very high RPS adds up; use AES-NI capable instances
Oversized max.poll.records — long processing before next fetch

flowchart TB
  subgraph budget["Typical p99 budget (same AZ)"]
    L[linger 0–5ms]
    N[Network RTT 1–3ms]
    B[Broker append 1–5ms]
    R[Replication if acks=all 2–10ms]
    C[Consumer poll 1–50ms]
  end
  L --> N --> B --> R --> C

⚖️ Trade-off

You cannot max throughput and min latency with one config profile. Run separate producer factories (or clusters) for telemetry firehose vs payment command topics.

Partition count guidelines

Partitions are the unit of parallelism—and operational weight. Too few starves consumers; too many slows rebalance and bloats metadata. Size from measured throughput, not guesses.

Too few partitions

Single consumer thread per topic caps throughput
Hot partition on skewed key distribution
Cannot scale consumer fleet past partition count

Too many partitions

More leader elections and metadata on controller
Slower consumer group rebalance (more assignments)
More file handles and segment files per broker (~2–4 per partition minimum)
Producer batches spread thin—smaller batches if key space is sparse

Sizing formula

partitions ≈ target_throughput ÷ single_partition_throughput

single_partition_throughput (rule of thumb):
  HDD brokers:     ~10–30 MB/s
  SSD brokers:     ~30–80 MB/s
  NVMe + 10 Gbps:  ~50–100+ MB/s  (measure with perf test!)

Example:
  Target 600 MB/s ingest, measured 40 MB/s per partition → 600 ÷ 40 = 15 → use 18–24 with headroom

Start with 12–24 for medium workloads; revisit when BytesInPerSec per broker or consumer lag growth indicates saturation. See Topic management—partitions cannot be reduced.

Signal	Action
Consumer lag grows, CPU low	Add partitions + consumers (or fix slow handler)
One partition lag spikes	Key skew—salt keys or custom partitioner
Rebalance > 30s routinely	Reduce partitions per broker cluster-wide
Broker > 2000 partitions	Add brokers; avoid new high-partition topics

flowchart LR
  Few[Too few\nconsumer bound]
  Sweet[Sweet spot\n12-48 typical topic]
  Many[Too many\nrebalance + metadata cost]

  Few -->|add partitions| Sweet
  Sweet -->|uncapped growth| Many

⚙️ Config

Document partition count rationale in topic catalog: target MB/s, key cardinality, max consumer instances, measured perf test date. Future you will not remember why orders has 48 partitions.

🎯 Interview Tip

“How many partitions?” — Never a universal number. Formula: throughput target ÷ per-partition measured throughput; cap by broker partition budget (~2k–4k/broker). Increasing partitions does not reshuffle existing keyed data.