Kafka Operations & Administration

Production Kafka is not “set and forget.” You size partitions once (carefully), watch lag and under-replicated partitions, throttle reassignment during broker surgery, and lock down auth with SASL + ACLs. This chapter is the platform engineer’s playbook.

senior platform Kafka 3.x / KRaft

Topic management

Topics are the unit of parallelism and retention policy. Most admin work flows through kafka-topics.sh (or the Admin API in automation). Get partition count and RF right at creation— several knobs cannot be safely reduced later.

kafka-topics.sh essentials

bash — topic lifecycle
# Create
kafka-topics.sh --bootstrap-server kafka:9092 \
  --create --topic orders \
  --partitions 24 --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config retention.ms=604800000

# Describe (configs + partition leaders)
kafka-topics.sh --bootstrap-server kafka:9092 --describe --topic orders

# Alter partitions (increase only) or configs
kafka-topics.sh --bootstrap-server kafka:9092 \
  --alter --topic orders --partitions 36

kafka-configs.sh --bootstrap-server kafka:9092 \
  --entity-type topics --entity-name orders \
  --alter --add-config retention.ms=1209600000

# Delete (disabled by default: delete.topic.enable=true on broker)
kafka-topics.sh --bootstrap-server kafka:9092 --delete --topic orders-staging

Partition count

Partitions bound maximum consumer parallelism (one consumer per partition per group) and spread write load across brokers. You can increase partitions; you cannot decrease without recreating the topic and migrating data.

Rule of thumb:

partitions ≈ target_throughput ÷ per_partition_throughput

Example:
  Target: 120 MB/s ingest
  Per partition sustainable: ~5 MB/s (measure on your hardware)
  → 120 ÷ 5 = 24 partitions (round up, add headroom)
  • Start higher than you think for high-throughput topics—adding partitions later does not reorder existing keys
  • Keyed producers: more partitions = better key spread; too many = metadata overhead and slow rebalances
  • Align with downstream task count (Flink parallelism, Connect tasks.max)
⚠️ Pitfall

Increasing partitions on a keyed topic does not rebalance existing keys—only new keys hash to new partitions. Plan partition count before heavy production traffic.

Replication factor

Always ≥ 3 in production. RF=3 tolerates one broker loss with min.insync.replicas=2 and acks=all. RF=2 is acceptable only in dev or cost-constrained DR tiers with explicit risk acceptance.

Compaction vs retention

Policy log.cleanup.policy Use case
Time/size retention delete (default) Event streams, logs, telemetry — keep N days or N bytes
Compaction compact Changelog topics — latest value per key (KTable, Connect offsets)
Both compact,delete Compacted state with TTL ceiling on stale keys

Retention and segment configs

Config Scope Meaning
retention.ms Topic / broker default Delete segments older than this (delete policy)
retention.bytes Topic / broker Max size per partition; -1 = unlimited
segment.bytes Topic / broker Roll new log segment at this size (default 1 GB)
segment.ms Topic / broker Roll segment by time even if size not reached
min.compaction.lag.ms Compacted topics Minimum age before key eligible for compaction
max.compaction.lag.ms Compacted topics Force compact stale keys after this age
min.insync.replicas Topic override Min replicas that must ack for acks=all to succeed
bash — compacted changelog topic
kafka-topics.sh --bootstrap-server kafka:9092 \
  --create --topic customer-profiles \
  --partitions 12 --replication-factor 3 \
  --config cleanup.policy=compact \
  --config min.compaction.lag.ms=3600000 \
  --config segment.bytes=268435456 \
  --config min.insync.replicas=2
⚙️ Config

Per-topic min.insync.replicas=2 with cluster default.replication.factor=3 is the standard durability pair. RF=3 and min ISR=1 allows writes during single-replica partitions—avoid unless you understand the window.

Consumer group management

Consumer groups are Kafka’s scaling and delivery contract. Operators inspect lag, reset offsets for replays, and delete stale groups—always knowing resets are destructive to “processed” semantics unless consumers are idempotent.

kafka-consumer-groups.sh

bash — group operations
# List all groups
kafka-consumer-groups.sh --bootstrap-server kafka:9092 --list

# Describe members, assignments, lag
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group order-processor

# Delete empty group (no active members)
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --delete --group order-processor-stale

--describe output columns:

  • CURRENT-OFFSET — last committed offset per partition
  • LOG-END-OFFSET — high watermark on broker
  • LAGLOG-END − CURRENT (records behind)
  • CONSUMER-ID / HOST — member owning partition

Reset offsets

Group must be stopped (no active members) or use --execute with dry-run first. Always use --dry-run before --execute.

Flag Effect Typical use
--to-earliest Rewind to beginning of log Full replay after bug fix
--to-latest Skip to end — drop backlog Non-critical telemetry; skip poison backlog
--to-datetime Reset to offset at timestamp Replay from incident time (ISO-8601)
--to-offset Explicit offset per partition Surgical recovery
--by-duration Rewind N duration from now “Reprocess last 2 hours”
bash — reset to datetime (dry-run then execute)
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --group order-processor \
  --topic orders \
  --reset-offsets \
  --to-datetime 2026-06-01T14:30:00.000 \
  --dry-run

# If output looks correct:
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --group order-processor \
  --topic orders \
  --reset-offsets \
  --to-datetime 2026-06-01T14:30:00.000 \
  --execute

When to reset

  • Replay after processing bug — fix code, reset to incident timestamp, reprocess with idempotent sinks
  • Bootstrap new service — new consumer group reads from earliest or specific datetime for historical backfill
  • Skip poison backlogto-latest after DLT quarantine (last resort)
⚠️ Pitfall

Offset reset + non-idempotent DB writes = duplicate rows. Coordinate with downstream owners. Prefer resetting only affected partitions with --to-offset when scope is known.

Lag monitoring

--describe is the CLI lag view. For alerting, export records-lag-max via JMX/Prometheus or use Burrow (see Monitoring section). Sustained lag growth means consumers cannot keep pace—investigate processing time, partition count, or broker I/O.

🎯 Interview Tip

“How do you replay Kafka events?” — Stop consumers → dry-run offset reset → execute → restart with idempotent handlers. Alternative: new consumer group with auto.offset.reset=earliest leaves original group untouched.

Broker configuration

Brokers are I/O-bound log servers—not memory-heavy application servers. Tune thread pools and replication, leave RAM for the OS page cache, and never enable unclean leader election in production.

Key broker defaults

Property Typical prod value Notes
num.partitions 12–24 Default when auto-create enabled; prefer explicit topic creation
default.replication.factor 3 Auto-created topics inherit this
log.retention.hours 168 (7d) or use log.retention.ms Broker-wide default; override per topic
log.segment.bytes 1073741824 (1 GB) Smaller segments = more files, faster deletion
log.cleanup.policy delete compact for changelog topics only
min.insync.replicas 2 Cluster default; critical topics override
delete.topic.enable true (staging), guarded in prod Prevent accidental topic deletion via RBAC

Network tuning

Property Purpose Guidance
num.network.threads Accept requests, parse protocol Start with 8; increase if NetworkProcessor idle low
num.io.threads Disk I/O request handling Often 2× network threads on SSD/NVMe
socket.send.buffer.bytes TCP send buffer 100–1024 KB; match high-throughput NICs
socket.receive.buffer.bytes TCP receive buffer Same order as send buffer

Replication tuning

  • replica.lag.time.max.ms — follower out of ISR if no fetch within this window (default 30s)
  • replica.fetch.max.bytes — max bytes per follower fetch; raise for large messages (align with message.max.bytes)
  • num.replica.fetchers — parallel fetchers per broker; 1–4 typical

Unclean leader election

unclean.leader.election.enable=false in production. When true, a non-ISR replica can become leader after full ISR loss—data loss for un-replicated writes. With false, partition goes offline until ISR replica returns—availability hit, but no silent truncation.

⚖️ Trade-off

Unclean election trades consistency for uptime. Financial and audit workloads keep it disabled. Some telemetry clusters enable it with acceptance of gap risk—document the decision.

JVM heap and GC

Kafka brokers rely on the OS page cache for hot reads—not a giant heap. ~6 GB heap is the common ceiling; more heap steals RAM from cache and can hurt throughput.

bash — KAFKA_HEAP_OPTS (broker)
export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC \
  -XX:MaxGCPauseMillis=20 \
  -XX:InitiatingHeapOccupancyPercent=35 \
  -XX:+ExplicitGCInvokesConcurrent \
  -Djava.awt.headless=true"
🔬 Under the Hood

Broker heap holds metadata, request buffers, and replication structures—not full log segments. Segments live on disk and in page cache. That is why 64 GB RAM machines run 6 GB heap and leave 50+ GB to cache.

Properties — server.properties excerpt
num.partitions=12
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
log.retention.hours=168
log.segment.bytes=1073741824
num.network.threads=8
num.io.threads=16
replica.lag.time.max.ms=30000
replica.fetch.max.bytes=1048576
auto.create.topics.enable=false

Partition rebalancing

Partitions stick to brokers until you move them. Reassignment fixes hot brokers, replaces failed hardware, and balances leadership—but saturates network if unthrottled.

kafka-reassign-partitions.sh

Workflow: generate JSON plan → execute → verify → (optional) throttle → remove throttle.

bash — reassignment workflow
# Generate plan: move all partitions off broker 3
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --broker-list 3 --generate

# Save proposed assignment to reassignment.json, edit if needed, then:
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --reassignment-json-file reassignment.json --execute

# Monitor progress
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --reassignment-json-file reassignment.json --verify

When reassignment is needed

  • Broker added — existing partitions never auto-migrate; rebalance to use new capacity
  • Broker decommission — evacuate all partitions before shutdown
  • Uneven distribution — one broker holds disproportionate leaders or disk usage
  • Rack awareness fix — after correcting broker.rack assignments

Throttle reassignment

Unthrottled replication can peg network and spike produce latency. Set per-broker throttle before execute:

bash — throttle during reassignment
kafka-configs.sh --bootstrap-server kafka:9092 \
  --alter --add-config "follower.replication.throttled.replicas=0:0:0-1000,1:0:0-1000" \
  --entity-type brokers --entity-name 0

kafka-configs.sh --bootstrap-server kafka:9092 \
  --alter --add-config "leader.replication.throttled.replicas=0:0:0-1000" \
  --entity-type brokers --entity-name 1

# Set rate (bytes/sec) — also replica.alter.log.dirs.io.max.bytes.per.second
kafka-configs.sh --bootstrap-server kafka:9092 \
  --alter --add-config "leader.replication.throttled.rate=52428800" \
  --entity-type brokers --entity-name 0

# After --verify completes: remove throttle configs
flowchart LR
  G[--generate plan]
  T[Apply throttle]
  E[--execute]
  V[--verify complete]
  R[Remove throttle]

  G --> T --> E --> V --> R

Cruise Control

LinkedIn’s Cruise Control automates partition rebalancing from broker metrics (CPU, disk, network, leader count). Goals: rack-aware distribution, even load, broker capacity limits. Supports anomaly detection and self-healing proposals— the operational upgrade from manual JSON plans at scale.

📦 Real World

Teams with 50+ brokers rarely hand-edit reassignment JSON. Cruise Control (or Confluent Auto Data Balancing) runs maintenance windows: detect hot broker → propose moves → throttle → execute → validate lag and URP metrics.

Monitoring & observability

Kafka exposes rich JMX metrics. Production stacks scrape via Prometheus JMX Exporter, dashboard in Grafana, and alert on lag and under-replication—not CPU alone.

flowchart LR
  B[Kafka brokers\nJMX :9999]
  P[Prometheus JMX Exporter\n/metrics]
  PR[Prometheus]
  G[Grafana dashboards]
  A[Alertmanager / PagerDuty]
  BU[Burrow\nlag SLA]

  B --> P --> PR --> G
  PR --> A
  B --> BU --> A

Broker metrics (critical)

Metric Healthy target Meaning
UnderReplicatedPartitions 0 Partitions with fewer ISR replicas than RF—durability risk
ActiveControllerCount exactly 1 cluster-wide 0 or >1 = cluster metadata crisis
OfflinePartitionsCount 0 No leader available—writes/reads blocked
RequestHandlerAvgIdlePercent > 0.2 < 0.2 (80% busy) = request thread saturation
NetworkProcessorAvgIdlePercent > 0.3 < 0.3 = network thread bottleneck
BytesInPerSec / BytesOutPerSec Baseline + headroom Capacity planning; spot hot topics
LogFlushRateAndTimeMs Stable, low p99 Disk fsync pressure; spikes with slow storage

Producer metrics

Metric Alert threshold Notes
record-error-rate 0 Any sustained error = auth, serialization, or broker reject
record-retry-rate Low; spike = broker or network stress Correlate with broker URP / leader election
batch-size-avg Approach batch.size Small batches = throughput left on table
compression-rate-avg Stable ratio lz4/zstd CPU vs bytes saved
request-latency-avg SLO-bound p99 Includes broker + network RTT

Consumer metrics

Metric Priority Notes
records-lag-max #1 consumer metric Max lag across partitions—alert on growth rate
fetch-rate High Drops = stalled consumer or broker issue
bytes-consumed-rate Medium Throughput vs producer bytes-in
commit-latency-avg Medium High = coordinator load or large sync commits

JMX and Prometheus

Kafka brokers expose MBeans under kafka.server, kafka.network, etc. Run JMX Exporter as a Java agent or sidecar; it maps MBean patterns to Prometheus gauges/counters on /metrics.

yaml — JMX exporter rule snippet
rules:
  - pattern: kafka.server<>Value
    name: kafka_under_replicated_partitions
    type: GAUGE
  - pattern: kafka.controller<>Value
    name: kafka_active_controller_count
    type: GAUGE

Grafana dashboards

Confluent publishes official Kafka Grafana dashboards (broker overview, topic, consumer lag). Import by dashboard ID or JSON; wire Prometheus datasource. Panel highlights: URP, offline partitions, bytes in/out, request handler idle.

Burrow

LinkedIn Burrow evaluates consumer lag trends—not just absolute lag—and emits status (OK, WARN, ERR, STOP). Better than static “lag > 10000” alerts that false-positive during deploys. Pair with SLA definitions per consumer group.

💡 Pro Tip

Page on URP > 0 for 5 minutes and offline partitions > 0 before paging on CPU. Kafka brokers run hot on page cache by design—CPU alone misleads.

🎯 Interview Tip

“What do you monitor first on Kafka?” — Under-replicated partitions, active controller count, offline partitions, then consumer records-lag-max growth rate. Broker request handler idle explains produce latency without guessing.

Kafka security

Security is layered: encrypt wire traffic, authenticate principals, authorize per-resource operations. Kafka does not encrypt data at rest natively—that is filesystem or cloud KMS territory.

flowchart TB
  C[Client]
  TLS[SSL/TLS\nencrypt + broker identity]
  SASL[SASL authentication\nwho are you]
  ACL[ACL authorization\nwhat can you do]
  B[Broker]

  C --> TLS --> SASL --> ACL --> B

Authentication mechanisms

Mechanism How it works When
SSL/TLS Encrypt in transit; mTLS verifies client cert Baseline for all production clusters
SASL/PLAIN Username/password in JAAS config Only over SSL; simple internal clusters
SASL/SCRAM Hashed credentials in KRaft/ZK Default for managed internal auth
SASL/GSSAPI (Kerberos) Ticket-based enterprise SSO Large enterprises with AD/Kerberos
SASL/OAUTHBEARER OAuth2 JWT tokens Cloud IdP, zero-trust, short-lived creds
Properties — SASL_SSL listener
listeners=SASL_SSL://0.0.0.0:9093
advertised.listeners=SASL_SSL://broker1.example.com:9093
security.inter.broker.protocol=SASL_SSL
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512
sasl.enabled.mechanisms=SCRAM-SHA-512
ssl.keystore.location=/var/ssl/kafka.server.keystore.jks
ssl.keystore.password=${KEYSTORE_PASSWORD}
ssl.key.password=${KEY_PASSWORD}
authorizer.class.name=org.apache.kafka.metadata.authorizer.StandardAuthorizer
super.users=User:admin

Authorization — ACLs

KRaft clusters use StandardAuthorizer (or legacy AclAuthorizer). ACLs bind principaloperationresource. Deny rules override allow.

Resource type Examples
Topicorders, prefix*
Grouporder-processor
TransactionalIdinventory-txn
ClusterIdempotent produce, describe cluster

Operations: Read, Write, Create, Delete, Alter, Describe, ClusterAction, IdempotentWrite, …

bash — kafka-acls.sh examples
# Producer: write to orders topic
kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
  --add --allow-principal User:order-service \
  --operation Write --topic orders

# Consumer: read topic + join group
kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
  --add --allow-principal User:order-processor \
  --operation Read --topic orders

kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
  --add --allow-principal User:order-processor \
  --operation Read --group order-processor

# List ACLs for topic
kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
  --list --topic orders

Encryption at rest

Not built into Kafka. Options: LUKS/dm-crypt on broker volumes, cloud EBS encryption (AWS KMS, GCP CMEK), or encrypted filesystems. Inter-broker and client-broker TLS covers in-flight; at-rest protects stolen disks.

Audit logging

Open-source Kafka has limited built-in audit trails. Confluent Platform adds audit logs (authentication, ACL changes, admin actions). Alternatives: broker access logs, Kubernetes audit, SIEM ingestion of ACL CLI changes via GitOps PR history.

⚠️ Pitfall

SASL/PLAIN without SSL sends credentials in cleartext. SCRAM over SASL_SSL is the minimum bar for production. Rotate SCRAM credentials via kafka-configs.sh --alter --add-config SCRAM... on broker user store.

Capacity planning

Size disk and network from peak ingest and fan-out, not averages. CPU is rarely the first bottleneck; partition count per broker has a hard operational ceiling.

Storage

disk ≈ retention_seconds × peak_bytes_in_per_sec × replication_factor × 1.2

Example:
  retention = 7 days = 604,800 s
  peak ingest = 50 MB/s
  RF = 3
  → 604800 × 50 × 3 × 1.2 ≈ 108 TB raw cluster (spread across brokers)

Add: compacted topics, mirror-maker duplication, segment overhead

The 1.2 factor covers compaction lag, segment rounding, and headroom. Monitor actual disk per broker; alert at 70% sustained—reassignment needs free space.

Network

peak_egress ≈ peak_bytes_in × (replication_factor - 1 + consumer_fan_out)

Example:
  50 MB/s in, RF=3, 2 consumer clusters reading full stream
  replication traffic: 50 × (3-1) = 100 MB/s internal
  consumer traffic: 50 × 2 = 100 MB/s
  → broker NIC planning ~200+ MB/s sustained per ingress broker (not burst-only)

Cross-AZ replication doubles cloud egress bills—prefer rack-aware same-AZ replicas when AZ failure tolerance allows, or accept cost for true AZ isolation.

CPU

Typically not the first bottleneck except with heavy compression (zstd level high), SSL everywhere, or excessive metric cardinality scraping. Profile before buying larger CPUs—add brokers for I/O parallelism first.

Partition ceiling per broker

Operational limit: ~2,000–4,000 partitions per broker (cluster-dependent). Beyond this:

  • Controller metadata grows—slower leader election and topic creation
  • Rebalance and recovery times stretch to hours
  • Each partition = file handles and metadata memory
Deployment size Guidance
Small / MVP ~1 partition per broker per high-throughput topic; RF=3 on 3 brokers
Growing Add brokers before adding partitions; keep < 2000 partitions/broker
Large Cruise Control capacity goals; dedicated controller/quorum nodes (KRaft)
flowchart LR
  D[Disk I/O\npage cache]
  N[Network\nRF + fan-out]
  M[Metadata\npartitions / broker]
  C[CPU\ncompression / SSL]

  D -->|usually first| N --> M --> C
⚙️ Config

Capacity review checklist: peak BytesInPerSec, retention per tier, RF, consumer fan-out, partitions/broker count, disk % used, URP during peak, rebalance duration last quarter.

📦 Real World

Run annual “peak × 2” drills: if Black Friday doubles ingest, pre-create partitions and verify broker disk at 2× retention formula. Confluent Cloud auto-scales storage; self-hosted teams add brokers in maintenance windows before marketing events.