Kafka Operations & Administration
Production Kafka is not “set and forget.” You size partitions once (carefully), watch lag and under-replicated partitions, throttle reassignment during broker surgery, and lock down auth with SASL + ACLs. This chapter is the platform engineer’s playbook.
Topic management
Topics are the unit of parallelism and retention policy. Most admin work flows through kafka-topics.sh (or the Admin API in automation). Get partition count and RF right at creation— several knobs cannot be safely reduced later.
kafka-topics.sh essentials
# Create
kafka-topics.sh --bootstrap-server kafka:9092 \
--create --topic orders \
--partitions 24 --replication-factor 3 \
--config min.insync.replicas=2 \
--config retention.ms=604800000
# Describe (configs + partition leaders)
kafka-topics.sh --bootstrap-server kafka:9092 --describe --topic orders
# Alter partitions (increase only) or configs
kafka-topics.sh --bootstrap-server kafka:9092 \
--alter --topic orders --partitions 36
kafka-configs.sh --bootstrap-server kafka:9092 \
--entity-type topics --entity-name orders \
--alter --add-config retention.ms=1209600000
# Delete (disabled by default: delete.topic.enable=true on broker)
kafka-topics.sh --bootstrap-server kafka:9092 --delete --topic orders-staging
Partition count
Partitions bound maximum consumer parallelism (one consumer per partition per group) and spread write load across brokers. You can increase partitions; you cannot decrease without recreating the topic and migrating data.
Rule of thumb:
partitions ≈ target_throughput ÷ per_partition_throughput Example: Target: 120 MB/s ingest Per partition sustainable: ~5 MB/s (measure on your hardware) → 120 ÷ 5 = 24 partitions (round up, add headroom)
- Start higher than you think for high-throughput topics—adding partitions later does not reorder existing keys
- Keyed producers: more partitions = better key spread; too many = metadata overhead and slow rebalances
- Align with downstream task count (Flink parallelism, Connect tasks.max)
Increasing partitions on a keyed topic does not rebalance existing keys—only new keys hash to new partitions. Plan partition count before heavy production traffic.
Replication factor
Always ≥ 3 in production. RF=3 tolerates one broker loss with min.insync.replicas=2 and acks=all. RF=2 is acceptable only in dev or cost-constrained DR tiers with explicit risk acceptance.
Compaction vs retention
| Policy | log.cleanup.policy |
Use case |
|---|---|---|
| Time/size retention | delete (default) |
Event streams, logs, telemetry — keep N days or N bytes |
| Compaction | compact |
Changelog topics — latest value per key (KTable, Connect offsets) |
| Both | compact,delete |
Compacted state with TTL ceiling on stale keys |
Retention and segment configs
| Config | Scope | Meaning |
|---|---|---|
retention.ms |
Topic / broker default | Delete segments older than this (delete policy) |
retention.bytes |
Topic / broker | Max size per partition; -1 = unlimited |
segment.bytes |
Topic / broker | Roll new log segment at this size (default 1 GB) |
segment.ms |
Topic / broker | Roll segment by time even if size not reached |
min.compaction.lag.ms |
Compacted topics | Minimum age before key eligible for compaction |
max.compaction.lag.ms |
Compacted topics | Force compact stale keys after this age |
min.insync.replicas |
Topic override | Min replicas that must ack for acks=all to succeed |
kafka-topics.sh --bootstrap-server kafka:9092 \
--create --topic customer-profiles \
--partitions 12 --replication-factor 3 \
--config cleanup.policy=compact \
--config min.compaction.lag.ms=3600000 \
--config segment.bytes=268435456 \
--config min.insync.replicas=2
Per-topic min.insync.replicas=2 with cluster default.replication.factor=3 is the standard durability pair. RF=3 and min ISR=1 allows writes during single-replica partitions—avoid unless you understand the window.
Consumer group management
Consumer groups are Kafka’s scaling and delivery contract. Operators inspect lag, reset offsets for replays, and delete stale groups—always knowing resets are destructive to “processed” semantics unless consumers are idempotent.
kafka-consumer-groups.sh
# List all groups
kafka-consumer-groups.sh --bootstrap-server kafka:9092 --list
# Describe members, assignments, lag
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
--describe --group order-processor
# Delete empty group (no active members)
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
--delete --group order-processor-stale
--describe output columns:
- CURRENT-OFFSET — last committed offset per partition
- LOG-END-OFFSET — high watermark on broker
- LAG — LOG-END − CURRENT (records behind)
- CONSUMER-ID / HOST — member owning partition
Reset offsets
Group must be stopped (no active members) or use --execute with dry-run first. Always use --dry-run before --execute.
| Flag | Effect | Typical use |
|---|---|---|
--to-earliest |
Rewind to beginning of log | Full replay after bug fix |
--to-latest |
Skip to end — drop backlog | Non-critical telemetry; skip poison backlog |
--to-datetime |
Reset to offset at timestamp | Replay from incident time (ISO-8601) |
--to-offset |
Explicit offset per partition | Surgical recovery |
--by-duration |
Rewind N duration from now | “Reprocess last 2 hours” |
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
--group order-processor \
--topic orders \
--reset-offsets \
--to-datetime 2026-06-01T14:30:00.000 \
--dry-run
# If output looks correct:
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
--group order-processor \
--topic orders \
--reset-offsets \
--to-datetime 2026-06-01T14:30:00.000 \
--execute
When to reset
- Replay after processing bug — fix code, reset to incident timestamp, reprocess with idempotent sinks
- Bootstrap new service — new consumer group reads from earliest or specific datetime for historical backfill
- Skip poison backlog — to-latest after DLT quarantine (last resort)
Offset reset + non-idempotent DB writes = duplicate rows. Coordinate with downstream owners. Prefer resetting only affected partitions with --to-offset when scope is known.
Lag monitoring
--describe is the CLI lag view. For alerting, export records-lag-max via JMX/Prometheus or use Burrow (see Monitoring section). Sustained lag growth means consumers cannot keep pace—investigate processing time, partition count, or broker I/O.
“How do you replay Kafka events?” — Stop consumers → dry-run offset reset → execute → restart with idempotent handlers. Alternative: new consumer group with auto.offset.reset=earliest leaves original group untouched.
Broker configuration
Brokers are I/O-bound log servers—not memory-heavy application servers. Tune thread pools and replication, leave RAM for the OS page cache, and never enable unclean leader election in production.
Key broker defaults
| Property | Typical prod value | Notes |
|---|---|---|
num.partitions |
12–24 | Default when auto-create enabled; prefer explicit topic creation |
default.replication.factor |
3 | Auto-created topics inherit this |
log.retention.hours |
168 (7d) or use log.retention.ms |
Broker-wide default; override per topic |
log.segment.bytes |
1073741824 (1 GB) | Smaller segments = more files, faster deletion |
log.cleanup.policy |
delete | compact for changelog topics only |
min.insync.replicas |
2 | Cluster default; critical topics override |
delete.topic.enable |
true (staging), guarded in prod | Prevent accidental topic deletion via RBAC |
Network tuning
| Property | Purpose | Guidance |
|---|---|---|
num.network.threads |
Accept requests, parse protocol | Start with 8; increase if NetworkProcessor idle low |
num.io.threads |
Disk I/O request handling | Often 2× network threads on SSD/NVMe |
socket.send.buffer.bytes |
TCP send buffer | 100–1024 KB; match high-throughput NICs |
socket.receive.buffer.bytes |
TCP receive buffer | Same order as send buffer |
Replication tuning
- replica.lag.time.max.ms — follower out of ISR if no fetch within this window (default 30s)
- replica.fetch.max.bytes — max bytes per follower fetch; raise for large messages (align with message.max.bytes)
- num.replica.fetchers — parallel fetchers per broker; 1–4 typical
Unclean leader election
unclean.leader.election.enable=false in production. When true, a non-ISR replica can become leader after full ISR loss—data loss for un-replicated writes. With false, partition goes offline until ISR replica returns—availability hit, but no silent truncation.
Unclean election trades consistency for uptime. Financial and audit workloads keep it disabled. Some telemetry clusters enable it with acceptance of gap risk—document the decision.
JVM heap and GC
Kafka brokers rely on the OS page cache for hot reads—not a giant heap. ~6 GB heap is the common ceiling; more heap steals RAM from cache and can hurt throughput.
export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC \
-XX:MaxGCPauseMillis=20 \
-XX:InitiatingHeapOccupancyPercent=35 \
-XX:+ExplicitGCInvokesConcurrent \
-Djava.awt.headless=true"
Broker heap holds metadata, request buffers, and replication structures—not full log segments. Segments live on disk and in page cache. That is why 64 GB RAM machines run 6 GB heap and leave 50+ GB to cache.
num.partitions=12
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
log.retention.hours=168
log.segment.bytes=1073741824
num.network.threads=8
num.io.threads=16
replica.lag.time.max.ms=30000
replica.fetch.max.bytes=1048576
auto.create.topics.enable=false
Partition rebalancing
Partitions stick to brokers until you move them. Reassignment fixes hot brokers, replaces failed hardware, and balances leadership—but saturates network if unthrottled.
kafka-reassign-partitions.sh
Workflow: generate JSON plan → execute → verify → (optional) throttle → remove throttle.
# Generate plan: move all partitions off broker 3
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
--broker-list 3 --generate
# Save proposed assignment to reassignment.json, edit if needed, then:
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
--reassignment-json-file reassignment.json --execute
# Monitor progress
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
--reassignment-json-file reassignment.json --verify
When reassignment is needed
- Broker added — existing partitions never auto-migrate; rebalance to use new capacity
- Broker decommission — evacuate all partitions before shutdown
- Uneven distribution — one broker holds disproportionate leaders or disk usage
- Rack awareness fix — after correcting broker.rack assignments
Throttle reassignment
Unthrottled replication can peg network and spike produce latency. Set per-broker throttle before execute:
kafka-configs.sh --bootstrap-server kafka:9092 \
--alter --add-config "follower.replication.throttled.replicas=0:0:0-1000,1:0:0-1000" \
--entity-type brokers --entity-name 0
kafka-configs.sh --bootstrap-server kafka:9092 \
--alter --add-config "leader.replication.throttled.replicas=0:0:0-1000" \
--entity-type brokers --entity-name 1
# Set rate (bytes/sec) — also replica.alter.log.dirs.io.max.bytes.per.second
kafka-configs.sh --bootstrap-server kafka:9092 \
--alter --add-config "leader.replication.throttled.rate=52428800" \
--entity-type brokers --entity-name 0
# After --verify completes: remove throttle configs
flowchart LR G[--generate plan] T[Apply throttle] E[--execute] V[--verify complete] R[Remove throttle] G --> T --> E --> V --> R
Cruise Control
LinkedIn’s Cruise Control automates partition rebalancing from broker metrics (CPU, disk, network, leader count). Goals: rack-aware distribution, even load, broker capacity limits. Supports anomaly detection and self-healing proposals— the operational upgrade from manual JSON plans at scale.
Teams with 50+ brokers rarely hand-edit reassignment JSON. Cruise Control (or Confluent Auto Data Balancing) runs maintenance windows: detect hot broker → propose moves → throttle → execute → validate lag and URP metrics.
Monitoring & observability
Kafka exposes rich JMX metrics. Production stacks scrape via Prometheus JMX Exporter, dashboard in Grafana, and alert on lag and under-replication—not CPU alone.
flowchart LR B[Kafka brokers\nJMX :9999] P[Prometheus JMX Exporter\n/metrics] PR[Prometheus] G[Grafana dashboards] A[Alertmanager / PagerDuty] BU[Burrow\nlag SLA] B --> P --> PR --> G PR --> A B --> BU --> A
Broker metrics (critical)
| Metric | Healthy target | Meaning |
|---|---|---|
UnderReplicatedPartitions |
0 | Partitions with fewer ISR replicas than RF—durability risk |
ActiveControllerCount |
exactly 1 cluster-wide | 0 or >1 = cluster metadata crisis |
OfflinePartitionsCount |
0 | No leader available—writes/reads blocked |
RequestHandlerAvgIdlePercent |
> 0.2 | < 0.2 (80% busy) = request thread saturation |
NetworkProcessorAvgIdlePercent |
> 0.3 | < 0.3 = network thread bottleneck |
BytesInPerSec / BytesOutPerSec |
Baseline + headroom | Capacity planning; spot hot topics |
LogFlushRateAndTimeMs |
Stable, low p99 | Disk fsync pressure; spikes with slow storage |
Producer metrics
| Metric | Alert threshold | Notes |
|---|---|---|
record-error-rate |
0 | Any sustained error = auth, serialization, or broker reject |
record-retry-rate |
Low; spike = broker or network stress | Correlate with broker URP / leader election |
batch-size-avg |
Approach batch.size |
Small batches = throughput left on table |
compression-rate-avg |
Stable ratio | lz4/zstd CPU vs bytes saved |
request-latency-avg |
SLO-bound p99 | Includes broker + network RTT |
Consumer metrics
| Metric | Priority | Notes |
|---|---|---|
records-lag-max |
#1 consumer metric | Max lag across partitions—alert on growth rate |
fetch-rate |
High | Drops = stalled consumer or broker issue |
bytes-consumed-rate |
Medium | Throughput vs producer bytes-in |
commit-latency-avg |
Medium | High = coordinator load or large sync commits |
JMX and Prometheus
Kafka brokers expose MBeans under kafka.server, kafka.network, etc. Run JMX Exporter as a Java agent or sidecar; it maps MBean patterns to Prometheus gauges/counters on /metrics.
rules:
- pattern: kafka.server<>Value
name: kafka_under_replicated_partitions
type: GAUGE
- pattern: kafka.controller<>Value
name: kafka_active_controller_count
type: GAUGE
Grafana dashboards
Confluent publishes official Kafka Grafana dashboards (broker overview, topic, consumer lag). Import by dashboard ID or JSON; wire Prometheus datasource. Panel highlights: URP, offline partitions, bytes in/out, request handler idle.
Burrow
LinkedIn Burrow evaluates consumer lag trends—not just absolute lag—and emits status (OK, WARN, ERR, STOP). Better than static “lag > 10000” alerts that false-positive during deploys. Pair with SLA definitions per consumer group.
Page on URP > 0 for 5 minutes and offline partitions > 0 before paging on CPU. Kafka brokers run hot on page cache by design—CPU alone misleads.
“What do you monitor first on Kafka?” — Under-replicated partitions, active controller count, offline partitions, then consumer records-lag-max growth rate. Broker request handler idle explains produce latency without guessing.
Kafka security
Security is layered: encrypt wire traffic, authenticate principals, authorize per-resource operations. Kafka does not encrypt data at rest natively—that is filesystem or cloud KMS territory.
flowchart TB C[Client] TLS[SSL/TLS\nencrypt + broker identity] SASL[SASL authentication\nwho are you] ACL[ACL authorization\nwhat can you do] B[Broker] C --> TLS --> SASL --> ACL --> B
Authentication mechanisms
| Mechanism | How it works | When |
|---|---|---|
| SSL/TLS | Encrypt in transit; mTLS verifies client cert | Baseline for all production clusters |
| SASL/PLAIN | Username/password in JAAS config | Only over SSL; simple internal clusters |
| SASL/SCRAM | Hashed credentials in KRaft/ZK | Default for managed internal auth |
| SASL/GSSAPI (Kerberos) | Ticket-based enterprise SSO | Large enterprises with AD/Kerberos |
| SASL/OAUTHBEARER | OAuth2 JWT tokens | Cloud IdP, zero-trust, short-lived creds |
listeners=SASL_SSL://0.0.0.0:9093
advertised.listeners=SASL_SSL://broker1.example.com:9093
security.inter.broker.protocol=SASL_SSL
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512
sasl.enabled.mechanisms=SCRAM-SHA-512
ssl.keystore.location=/var/ssl/kafka.server.keystore.jks
ssl.keystore.password=${KEYSTORE_PASSWORD}
ssl.key.password=${KEY_PASSWORD}
authorizer.class.name=org.apache.kafka.metadata.authorizer.StandardAuthorizer
super.users=User:admin
Authorization — ACLs
KRaft clusters use StandardAuthorizer (or legacy AclAuthorizer). ACLs bind principal → operation → resource. Deny rules override allow.
| Resource type | Examples |
|---|---|
Topic | orders, prefix* |
Group | order-processor |
TransactionalId | inventory-txn |
Cluster | Idempotent produce, describe cluster |
Operations: Read, Write, Create, Delete, Alter, Describe, ClusterAction, IdempotentWrite, …
# Producer: write to orders topic
kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
--add --allow-principal User:order-service \
--operation Write --topic orders
# Consumer: read topic + join group
kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
--add --allow-principal User:order-processor \
--operation Read --topic orders
kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
--add --allow-principal User:order-processor \
--operation Read --group order-processor
# List ACLs for topic
kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
--list --topic orders
Encryption at rest
Not built into Kafka. Options: LUKS/dm-crypt on broker volumes, cloud EBS encryption (AWS KMS, GCP CMEK), or encrypted filesystems. Inter-broker and client-broker TLS covers in-flight; at-rest protects stolen disks.
Audit logging
Open-source Kafka has limited built-in audit trails. Confluent Platform adds audit logs (authentication, ACL changes, admin actions). Alternatives: broker access logs, Kubernetes audit, SIEM ingestion of ACL CLI changes via GitOps PR history.
SASL/PLAIN without SSL sends credentials in cleartext. SCRAM over SASL_SSL is the minimum bar for production. Rotate SCRAM credentials via kafka-configs.sh --alter --add-config SCRAM... on broker user store.
Capacity planning
Size disk and network from peak ingest and fan-out, not averages. CPU is rarely the first bottleneck; partition count per broker has a hard operational ceiling.
Storage
disk ≈ retention_seconds × peak_bytes_in_per_sec × replication_factor × 1.2 Example: retention = 7 days = 604,800 s peak ingest = 50 MB/s RF = 3 → 604800 × 50 × 3 × 1.2 ≈ 108 TB raw cluster (spread across brokers) Add: compacted topics, mirror-maker duplication, segment overhead
The 1.2 factor covers compaction lag, segment rounding, and headroom. Monitor actual disk per broker; alert at 70% sustained—reassignment needs free space.
Network
peak_egress ≈ peak_bytes_in × (replication_factor - 1 + consumer_fan_out) Example: 50 MB/s in, RF=3, 2 consumer clusters reading full stream replication traffic: 50 × (3-1) = 100 MB/s internal consumer traffic: 50 × 2 = 100 MB/s → broker NIC planning ~200+ MB/s sustained per ingress broker (not burst-only)
Cross-AZ replication doubles cloud egress bills—prefer rack-aware same-AZ replicas when AZ failure tolerance allows, or accept cost for true AZ isolation.
CPU
Typically not the first bottleneck except with heavy compression (zstd level high), SSL everywhere, or excessive metric cardinality scraping. Profile before buying larger CPUs—add brokers for I/O parallelism first.
Partition ceiling per broker
Operational limit: ~2,000–4,000 partitions per broker (cluster-dependent). Beyond this:
- Controller metadata grows—slower leader election and topic creation
- Rebalance and recovery times stretch to hours
- Each partition = file handles and metadata memory
| Deployment size | Guidance |
|---|---|
| Small / MVP | ~1 partition per broker per high-throughput topic; RF=3 on 3 brokers |
| Growing | Add brokers before adding partitions; keep < 2000 partitions/broker |
| Large | Cruise Control capacity goals; dedicated controller/quorum nodes (KRaft) |
flowchart LR D[Disk I/O\npage cache] N[Network\nRF + fan-out] M[Metadata\npartitions / broker] C[CPU\ncompression / SSL] D -->|usually first| N --> M --> C
Capacity review checklist: peak BytesInPerSec, retention per tier, RF, consumer fan-out, partitions/broker count, disk % used, URP during peak, rebalance duration last quarter.
Run annual “peak × 2” drills: if Black Friday doubles ingest, pre-create partitions and verify broker disk at 2× retention formula. Confluent Cloud auto-scales storage; self-hosted teams add brokers in maintenance windows before marketing events.