Kafka Performance Tuning
Throughput and latency pull in opposite directions—fat batches and lz4 compression maximize MB/s; linger.ms=0 and fetch.min.bytes=1 minimize p99. Measure with official perf tools, tune one layer at a time, and never optimize without a baseline.
Producer throughput tuning
Producer throughput is batching + compression + async pipelining. The producer accumulates records per partition, compresses the batch, and sends in fewer requests. Blocking on every send destroys all of that.
Benchmark baseline: kafka-producer-perf-test.sh
Establish MB/s and records/s before changing configs. Run from a host with network proximity to brokers:
kafka-producer-perf-test.sh \
--topic perf-test \
--num-records 5000000 \
--record-size 1024 \
--throughput -1 \
--producer-props \
bootstrap.servers=kafka:9092 \
acks=1 \
batch.size=131072 \
linger.ms=20 \
compression.type=lz4 \
buffer.memory=134217728
# Output: records/sec, MB/sec, avg latency, max latency, 50th/95th/99th/999th percentiles
Vary one knob per run. Save results with topic partition count and broker version noted— perf tests on empty clusters overstate real-world numbers.
Throughput-oriented producer config
| Property | Throughput value | Why |
|---|---|---|
batch.size |
131072 (128 KB) | Larger batches per partition → fewer requests |
linger.ms |
20 | Wait up to 20ms to fill batch — trades latency for size |
compression.type |
lz4 | Fast CPU, good ratio; zstd if CPU headroom and WAN |
buffer.memory |
134217728 (128 MB) | More in-flight batch buffer before max.block.ms blocks |
acks |
1 | Leader ack only — faster than all; not durability-safe |
batch.size=131072
linger.ms=20
compression.type=lz4
buffer.memory=134217728
acks=1
retries=3
max.in.flight.requests.per.connection=5
acks=1 + high throughput is for metrics and load tests—not billing or inventory. Production durability paths use acks=all and accept lower ceiling MB/s.
Async send — never block in a loop
producer.send(record).get() in a tight loop serializes on broker RTT—throughput collapses to ~1 request at a time. Fire async sends and optionally flush at interval or use callback batching.
// BAD — synchronous, ~hundreds msg/s
for (Event e : events) {
producer.send(new ProducerRecord<>("topic", e.getKey(), e)).get();
}
// GOOD — async pipeline, flush periodically
List<CompletableFuture<RecordMetadata>> futures = new ArrayList<>();
for (Event e : events) {
futures.add(producer.send(new ProducerRecord<>("topic", e.getKey(), e))
.thenApply(r -> r.recordMetadata()));
}
CompletableFuture.allOf(futures.toArray(CompletableFuture[]::new)).join();
// or: producer.flush() every N records in streaming loop
Partition count and producer parallelism
Producer maintains a separate batch per partition. More partitions = more parallel in-flight requests to brokers— up to broker and network limits. Diminishing returns past ~100–200 partitions per topic without more brokers. Align with target consumer parallelism (see Partition guidelines).
flowchart LR P[Producer] B0[Batch partition 0] B1[Batch partition 1] B2[Batch partition 2] BR[Broker leaders] P --> B0 P --> B1 P --> B2 B0 --> BR B1 --> BR B2 --> BR
Watch batch-size-avg JMX metric—if far below batch.size, increase linger.ms or record rate. Empty batches mean you are latency-tuned, not throughput-tuned.
Consumer throughput tuning
Consumers are fetch-bound and process-bound. Tune fetch size to reduce empty polls, increase max.poll.records, and never run slow I/O inside the poll thread longer than max.poll.interval.ms.
Benchmark: kafka-consumer-perf-test.sh
# First, load topic with producer perf test, then:
kafka-consumer-perf-test.sh \
--bootstrap-server kafka:9092 \
--topic perf-test \
--messages 5000000 \
--threads 1 \
--consumer.config consumer-perf.properties
# consumer-perf.properties:
# fetch.min.bytes=100000
# fetch.max.wait.ms=500
# max.poll.records=1000
Throughput-oriented consumer config
| Property | Throughput value | Effect |
|---|---|---|
fetch.min.bytes |
100000 (~100 KB) | Broker waits until at least this much data per fetch |
fetch.max.wait.ms |
500 | Max wait to accumulate fetch.min.bytes |
max.poll.records |
1000 | More records per poll loop iteration |
fetch.max.bytes |
52428800 (50 MB) | Upper bound per fetch response |
max.partition.fetch.bytes |
1048576 (1 MB) | Per-partition cap within fetch |
fetch.min.bytes=100000
fetch.max.wait.ms=500
max.poll.records=1000
max.poll.interval.ms=300000
session.timeout.ms=45000
enable.auto.commit=false
Parallel processing
Two levers—often combined:
- Increase partition count — more buckets for consumer group to assign
- Increase concurrency — Spring concurrency or multiple consumer instances in same group
Effective parallelism = min(partitions, consumer_threads). Adding 20 threads to 6 partitions wastes 14 threads.
Avoid slow work in the poll loop
If processing exceeds max.poll.interval.ms, coordinator revokes partitions—rebalance storm. Patterns for heavy handlers:
- Delegate to thread pool — poll thread hands off records; commit after worker completes (careful with ordering)
- pause / resume — pause partition while backlog processes, resume when caught up
- Batch + bulk insert — DB bulk write per poll batch instead of per-record round trips
@KafkaListener(id = "heavy-listener", topics = "events")
public void onMessage(ConsumerRecord<String, Event> record,
Consumer<String, Event> consumer,
Acknowledgment ack) {
TopicPartition tp = new TopicPartition(record.topic(), record.partition());
consumer.pause(List.of(tp));
workerPool.submit(() -> {
try {
heavyProcess(record.value());
ack.acknowledge();
} finally {
consumer.resume(List.of(tp));
}
});
}
“Consumer slow, rebalance loop?” — Check max.poll.interval.ms vs processing time. Fix: faster processing, smaller max.poll.records, async workers with ordered commit, or static membership to reduce churn cost.
Broker throughput tuning
Brokers are sequential-write log servers on disk and zero-copy send on network. Performance comes from OS page cache, fast filesystems, enough network threads—and not oversized JVM heaps.
Filesystem and mount options
- XFS or ext4 — standard choices; XFS common on large sequential volumes
- Separate disks — log dirs on dedicated NVMe/SSD; avoid sharing with OS or ZooKeeper (legacy)
- noatime — disable access-time updates on log mount; reduces metadata writes
- RAID10 / JBOD — JBOD with one log dir per disk often beats RAID5 for write latency
/dev/nvme0n1 /var/kafka-logs xfs defaults,noatime,nodiratime 0 0
Network sysctl
Increase socket buffer defaults so high-throughput TCP streams do not stall on small kernel buffers:
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 1048576 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 250000
Align with broker socket.send.buffer.bytes / socket.receive.buffer.bytes (often 100 KB–1 MB).
VM and page cache
| sysctl | Recommended | Purpose |
|---|---|---|
vm.swappiness |
1 | Strongly prefer RAM page cache over swap — swapping kills tail latency |
vm.dirty_ratio |
80 | Allow large dirty page cache before sync pressure |
vm.dirty_background_ratio |
5 | Background flush starts earlier — smoother writeback |
Kafka brokers send data via sendfile from page cache when segments are hot—no user-space copy. That is why 64 GB RAM with 6 GB JVM beats 16 GB RAM with 12 GB JVM: cache holds recent segments for consumer fetch.
JVM heap
4–6 GB max heap; set -Xms equal to -Xmx to avoid resize pauses. Remaining RAM for OS cache. G1GC with low pause target.
export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC \
-XX:MaxGCPauseMillis=20 \
-XX:InitiatingHeapOccupancyPercent=35 \
-XX:+DisableExplicitGC"
Broker thread pools
| Property | Guideline |
|---|---|
num.io.threads |
2 × number of data disks (e.g. 8 disks → 16 threads) |
num.network.threads |
3–8 baseline; raise if NetworkProcessorAvgIdlePercent < 0.3 |
num.replica.fetchers |
1–4 per broker; increase for cross-AZ replication bandwidth |
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
replica.fetch.max.bytes=1048576
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
Cloud brokers: use instance types with high network bandwidth (e.g. 25 Gbps+) and local NVMe. EBS-only volumes cap throughput before Kafka config does—upgrade disk tier before tuning num.io.threads.
Latency tuning
Low-latency Kafka minimizes waiting: no linger batching, no fetch batching, no compression CPU, and replication paths that do not buffer. Throughput configs are the enemy of p99 here.
| Knob | Throughput | Low latency |
|---|---|---|
| linger.ms | 20 | 0 |
| acks | 1 | 1 (or all if durability required) |
| compression.type | lz4 | none |
| fetch.min.bytes | 100000 | 1 |
| fetch.max.wait.ms | 500 | 0 |
Producer — low latency
linger.ms=0
batch.size=16384
compression.type=none
acks=1
buffer.memory=67108864
Each record ships immediately (per partition batch may still be small). For sub-10ms paths, co-locate producers with brokers (same AZ). acks=all adds replica round-trip—required for durability, costs latency.
Consumer — low latency
fetch.min.bytes=1
fetch.max.wait.ms=0
max.poll.records=100
Consumer returns as soon as any data is available—more polls, higher CPU, lower end-to-end delay.
Broker — replication fetch
- num.replica.fetchers — more parallel follower fetch threads
- replica.fetch.min.bytes=1 — followers do not wait for large fetch batches
- Keep ISR healthy — out-of-sync replicas force leader-only reads or delayed commits
Avoid on latency-critical paths
- Compression — CPU cycles add milliseconds; use only if network is the proven bottleneck
- Large batches / high linger — intentional delay to fill batches
- TLS without hardware acceleration — software AES/GCM at very high RPS adds up; use AES-NI capable instances
- Oversized max.poll.records — long processing before next fetch
flowchart TB
subgraph budget["Typical p99 budget (same AZ)"]
L[linger 0–5ms]
N[Network RTT 1–3ms]
B[Broker append 1–5ms]
R[Replication if acks=all 2–10ms]
C[Consumer poll 1–50ms]
end
L --> N --> B --> R --> C
You cannot max throughput and min latency with one config profile. Run separate producer factories (or clusters) for telemetry firehose vs payment command topics.
Partition count guidelines
Partitions are the unit of parallelism—and operational weight. Too few starves consumers; too many slows rebalance and bloats metadata. Size from measured throughput, not guesses.
Too few partitions
- Single consumer thread per topic caps throughput
- Hot partition on skewed key distribution
- Cannot scale consumer fleet past partition count
Too many partitions
- More leader elections and metadata on controller
- Slower consumer group rebalance (more assignments)
- More file handles and segment files per broker (~2–4 per partition minimum)
- Producer batches spread thin—smaller batches if key space is sparse
Sizing formula
partitions ≈ target_throughput ÷ single_partition_throughput single_partition_throughput (rule of thumb): HDD brokers: ~10–30 MB/s SSD brokers: ~30–80 MB/s NVMe + 10 Gbps: ~50–100+ MB/s (measure with perf test!) Example: Target 600 MB/s ingest, measured 40 MB/s per partition → 600 ÷ 40 = 15 → use 18–24 with headroom
Start with 12–24 for medium workloads; revisit when BytesInPerSec per broker or consumer lag growth indicates saturation. See Topic management—partitions cannot be reduced.
| Signal | Action |
|---|---|
| Consumer lag grows, CPU low | Add partitions + consumers (or fix slow handler) |
| One partition lag spikes | Key skew—salt keys or custom partitioner |
| Rebalance > 30s routinely | Reduce partitions per broker cluster-wide |
| Broker > 2000 partitions | Add brokers; avoid new high-partition topics |
flowchart LR Few[Too few\nconsumer bound] Sweet[Sweet spot\n12-48 typical topic] Many[Too many\nrebalance + metadata cost] Few -->|add partitions| Sweet Sweet -->|uncapped growth| Many
Document partition count rationale in topic catalog: target MB/s, key cardinality, max consumer instances, measured perf test date. Future you will not remember why orders has 48 partitions.
“How many partitions?” — Never a universal number. Formula: throughput target ÷ per-partition measured throughput; cap by broker partition budget (~2k–4k/broker). Increasing partitions does not reshuffle existing keyed data.