Kafka Operations & Administration

Topic management

Topics are the unit of parallelism and retention policy. Most admin work flows through kafka-topics.sh (or the Admin API in automation). Get partition count and RF right at creation— several knobs cannot be safely reduced later.

kafka-topics.sh essentials

# Create
kafka-topics.sh --bootstrap-server kafka:9092 \
  --create --topic orders \
  --partitions 24 --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config retention.ms=604800000

# Describe (configs + partition leaders)
kafka-topics.sh --bootstrap-server kafka:9092 --describe --topic orders

# Alter partitions (increase only) or configs
kafka-topics.sh --bootstrap-server kafka:9092 \
  --alter --topic orders --partitions 36

kafka-configs.sh --bootstrap-server kafka:9092 \
  --entity-type topics --entity-name orders \
  --alter --add-config retention.ms=1209600000

# Delete (disabled by default: delete.topic.enable=true on broker)
kafka-topics.sh --bootstrap-server kafka:9092 --delete --topic orders-staging

Partition count

Partitions bound maximum consumer parallelism (one consumer per partition per group) and spread write load across brokers. You can increase partitions; you cannot decrease without recreating the topic and migrating data.

Rule of thumb:

partitions ≈ target_throughput ÷ per_partition_throughput

Example:
  Target: 120 MB/s ingest
  Per partition sustainable: ~5 MB/s (measure on your hardware)
  → 120 ÷ 5 = 24 partitions (round up, add headroom)

Start higher than you think for high-throughput topics—adding partitions later does not reorder existing keys
Keyed producers: more partitions = better key spread; too many = metadata overhead and slow rebalances
Align with downstream task count (Flink parallelism, Connect tasks.max)

⚠️ Pitfall

Increasing partitions on a keyed topic does not rebalance existing keys—only new keys hash to new partitions. Plan partition count before heavy production traffic.

Replication factor

Always ≥ 3 in production. RF=3 tolerates one broker loss with min.insync.replicas=2 and acks=all. RF=2 is acceptable only in dev or cost-constrained DR tiers with explicit risk acceptance.

Compaction vs retention

Policy	`log.cleanup.policy`	Use case
Time/size retention	`delete` (default)	Event streams, logs, telemetry — keep N days or N bytes
Compaction	`compact`	Changelog topics — latest value per key (KTable, Connect offsets)
Both	`compact,delete`	Compacted state with TTL ceiling on stale keys

Retention and segment configs

Config	Scope	Meaning
`retention.ms`	Topic / broker default	Delete segments older than this (delete policy)
`retention.bytes`	Topic / broker	Max size per partition; -1 = unlimited
`segment.bytes`	Topic / broker	Roll new log segment at this size (default 1 GB)
`segment.ms`	Topic / broker	Roll segment by time even if size not reached
`min.compaction.lag.ms`	Compacted topics	Minimum age before key eligible for compaction
`max.compaction.lag.ms`	Compacted topics	Force compact stale keys after this age
`min.insync.replicas`	Topic override	Min replicas that must ack for `acks=all` to succeed

kafka-topics.sh --bootstrap-server kafka:9092 \
  --create --topic customer-profiles \
  --partitions 12 --replication-factor 3 \
  --config cleanup.policy=compact \
  --config min.compaction.lag.ms=3600000 \
  --config segment.bytes=268435456 \
  --config min.insync.replicas=2

⚙️ Config

Per-topic min.insync.replicas=2 with cluster default.replication.factor=3 is the standard durability pair. RF=3 and min ISR=1 allows writes during single-replica partitions—avoid unless you understand the window.

Consumer group management

Consumer groups are Kafka’s scaling and delivery contract. Operators inspect lag, reset offsets for replays, and delete stale groups—always knowing resets are destructive to “processed” semantics unless consumers are idempotent.

kafka-consumer-groups.sh

# List all groups
kafka-consumer-groups.sh --bootstrap-server kafka:9092 --list

# Describe members, assignments, lag
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group order-processor

# Delete empty group (no active members)
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --delete --group order-processor-stale

--describe output columns:

CURRENT-OFFSET — last committed offset per partition
LOG-END-OFFSET — high watermark on broker
LAG — LOG-END − CURRENT (records behind)
CONSUMER-ID / HOST — member owning partition

Reset offsets

Group must be stopped (no active members) or use --execute with dry-run first. Always use --dry-run before --execute.

Flag	Effect	Typical use
`--to-earliest`	Rewind to beginning of log	Full replay after bug fix
`--to-latest`	Skip to end — drop backlog	Non-critical telemetry; skip poison backlog
`--to-datetime`	Reset to offset at timestamp	Replay from incident time (ISO-8601)
`--to-offset`	Explicit offset per partition	Surgical recovery
`--by-duration`	Rewind N duration from now	“Reprocess last 2 hours”

kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --group order-processor \
  --topic orders \
  --reset-offsets \
  --to-datetime 2026-06-01T14:30:00.000 \
  --dry-run

# If output looks correct:
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --group order-processor \
  --topic orders \
  --reset-offsets \
  --to-datetime 2026-06-01T14:30:00.000 \
  --execute

When to reset

Replay after processing bug — fix code, reset to incident timestamp, reprocess with idempotent sinks
Bootstrap new service — new consumer group reads from earliest or specific datetime for historical backfill
Skip poison backlog — to-latest after DLT quarantine (last resort)

⚠️ Pitfall

Offset reset + non-idempotent DB writes = duplicate rows. Coordinate with downstream owners. Prefer resetting only affected partitions with --to-offset when scope is known.

Lag monitoring

--describe is the CLI lag view. For alerting, export records-lag-max via JMX/Prometheus or use Burrow (see Monitoring section). Sustained lag growth means consumers cannot keep pace—investigate processing time, partition count, or broker I/O.

🎯 Interview Tip

“How do you replay Kafka events?” — Stop consumers → dry-run offset reset → execute → restart with idempotent handlers. Alternative: new consumer group with auto.offset.reset=earliest leaves original group untouched.

Broker configuration

Brokers are I/O-bound log servers—not memory-heavy application servers. Tune thread pools and replication, leave RAM for the OS page cache, and never enable unclean leader election in production.

Key broker defaults

Property	Typical prod value	Notes
`num.partitions`	12–24	Default when auto-create enabled; prefer explicit topic creation
`default.replication.factor`	3	Auto-created topics inherit this
`log.retention.hours`	168 (7d) or use `log.retention.ms`	Broker-wide default; override per topic
`log.segment.bytes`	1073741824 (1 GB)	Smaller segments = more files, faster deletion
`log.cleanup.policy`	delete	compact for changelog topics only
`min.insync.replicas`	2	Cluster default; critical topics override
`delete.topic.enable`	true (staging), guarded in prod	Prevent accidental topic deletion via RBAC

Network tuning

Property	Purpose	Guidance
`num.network.threads`	Accept requests, parse protocol	Start with 8; increase if NetworkProcessor idle low
`num.io.threads`	Disk I/O request handling	Often 2× network threads on SSD/NVMe
`socket.send.buffer.bytes`	TCP send buffer	100–1024 KB; match high-throughput NICs
`socket.receive.buffer.bytes`	TCP receive buffer	Same order as send buffer

Replication tuning

replica.lag.time.max.ms — follower out of ISR if no fetch within this window (default 30s)
replica.fetch.max.bytes — max bytes per follower fetch; raise for large messages (align with message.max.bytes)
num.replica.fetchers — parallel fetchers per broker; 1–4 typical

Unclean leader election

unclean.leader.election.enable=false in production. When true, a non-ISR replica can become leader after full ISR loss—data loss for un-replicated writes. With false, partition goes offline until ISR replica returns—availability hit, but no silent truncation.

⚖️ Trade-off

Unclean election trades consistency for uptime. Financial and audit workloads keep it disabled. Some telemetry clusters enable it with acceptance of gap risk—document the decision.

JVM heap and GC

Kafka brokers rely on the OS page cache for hot reads—not a giant heap. ~6 GB heap is the common ceiling; more heap steals RAM from cache and can hurt throughput.

export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC \
  -XX:MaxGCPauseMillis=20 \
  -XX:InitiatingHeapOccupancyPercent=35 \
  -XX:+ExplicitGCInvokesConcurrent \
  -Djava.awt.headless=true"

🔬 Under the Hood

Broker heap holds metadata, request buffers, and replication structures—not full log segments. Segments live on disk and in page cache. That is why 64 GB RAM machines run 6 GB heap and leave 50+ GB to cache.

num.partitions=12
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
log.retention.hours=168
log.segment.bytes=1073741824
num.network.threads=8
num.io.threads=16
replica.lag.time.max.ms=30000
replica.fetch.max.bytes=1048576
auto.create.topics.enable=false

Partition rebalancing

Partitions stick to brokers until you move them. Reassignment fixes hot brokers, replaces failed hardware, and balances leadership—but saturates network if unthrottled.

kafka-reassign-partitions.sh

Workflow: generate JSON plan → execute → verify → (optional) throttle → remove throttle.

# Generate plan: move all partitions off broker 3
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --broker-list 3 --generate

# Save proposed assignment to reassignment.json, edit if needed, then:
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --reassignment-json-file reassignment.json --execute

# Monitor progress
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --reassignment-json-file reassignment.json --verify

When reassignment is needed

Broker added — existing partitions never auto-migrate; rebalance to use new capacity
Broker decommission — evacuate all partitions before shutdown
Uneven distribution — one broker holds disproportionate leaders or disk usage
Rack awareness fix — after correcting broker.rack assignments

Throttle reassignment

Unthrottled replication can peg network and spike produce latency. Set per-broker throttle before execute:

kafka-configs.sh --bootstrap-server kafka:9092 \
  --alter --add-config "follower.replication.throttled.replicas=0:0:0-1000,1:0:0-1000" \
  --entity-type brokers --entity-name 0

kafka-configs.sh --bootstrap-server kafka:9092 \
  --alter --add-config "leader.replication.throttled.replicas=0:0:0-1000" \
  --entity-type brokers --entity-name 1

# Set rate (bytes/sec) — also replica.alter.log.dirs.io.max.bytes.per.second
kafka-configs.sh --bootstrap-server kafka:9092 \
  --alter --add-config "leader.replication.throttled.rate=52428800" \
  --entity-type brokers --entity-name 0

# After --verify completes: remove throttle configs

flowchart LR
  G[--generate plan]
  T[Apply throttle]
  E[--execute]
  V[--verify complete]
  R[Remove throttle]

  G --> T --> E --> V --> R

Cruise Control

LinkedIn’s Cruise Control automates partition rebalancing from broker metrics (CPU, disk, network, leader count). Goals: rack-aware distribution, even load, broker capacity limits. Supports anomaly detection and self-healing proposals— the operational upgrade from manual JSON plans at scale.

📦 Real World

Teams with 50+ brokers rarely hand-edit reassignment JSON. Cruise Control (or Confluent Auto Data Balancing) runs maintenance windows: detect hot broker → propose moves → throttle → execute → validate lag and URP metrics.

Monitoring & observability

Kafka exposes rich JMX metrics. Production stacks scrape via Prometheus JMX Exporter, dashboard in Grafana, and alert on lag and under-replication—not CPU alone.

flowchart LR
  B[Kafka brokers\nJMX :9999]
  P[Prometheus JMX Exporter\n/metrics]
  PR[Prometheus]
  G[Grafana dashboards]
  A[Alertmanager / PagerDuty]
  BU[Burrow\nlag SLA]

  B --> P --> PR --> G
  PR --> A
  B --> BU --> A

Broker metrics (critical)

Metric	Healthy target	Meaning
`UnderReplicatedPartitions`	0	Partitions with fewer ISR replicas than RF—durability risk
`ActiveControllerCount`	exactly 1 cluster-wide	0 or >1 = cluster metadata crisis
`OfflinePartitionsCount`	0	No leader available—writes/reads blocked
`RequestHandlerAvgIdlePercent`	> 0.2	< 0.2 (80% busy) = request thread saturation
`NetworkProcessorAvgIdlePercent`	> 0.3	< 0.3 = network thread bottleneck
`BytesInPerSec / BytesOutPerSec`	Baseline + headroom	Capacity planning; spot hot topics
`LogFlushRateAndTimeMs`	Stable, low p99	Disk fsync pressure; spikes with slow storage

Producer metrics

Metric	Alert threshold	Notes
`record-error-rate`	0	Any sustained error = auth, serialization, or broker reject
`record-retry-rate`	Low; spike = broker or network stress	Correlate with broker URP / leader election
`batch-size-avg`	Approach `batch.size`	Small batches = throughput left on table
`compression-rate-avg`	Stable ratio	lz4/zstd CPU vs bytes saved
`request-latency-avg`	SLO-bound p99	Includes broker + network RTT

Consumer metrics

Metric	Priority	Notes
`records-lag-max`	#1 consumer metric	Max lag across partitions—alert on growth rate
`fetch-rate`	High	Drops = stalled consumer or broker issue
`bytes-consumed-rate`	Medium	Throughput vs producer bytes-in
`commit-latency-avg`	Medium	High = coordinator load or large sync commits

JMX and Prometheus

Kafka brokers expose MBeans under kafka.server, kafka.network, etc. Run JMX Exporter as a Java agent or sidecar; it maps MBean patterns to Prometheus gauges/counters on /metrics.

rules:
  - pattern: kafka.server<>Value
    name: kafka_under_replicated_partitions
    type: GAUGE
  - pattern: kafka.controller<>Value
    name: kafka_active_controller_count
    type: GAUGE

Grafana dashboards

Confluent publishes official Kafka Grafana dashboards (broker overview, topic, consumer lag). Import by dashboard ID or JSON; wire Prometheus datasource. Panel highlights: URP, offline partitions, bytes in/out, request handler idle.

Burrow

LinkedIn Burrow evaluates consumer lag trends—not just absolute lag—and emits status (OK, WARN, ERR, STOP). Better than static “lag > 10000” alerts that false-positive during deploys. Pair with SLA definitions per consumer group.

💡 Pro Tip

Page on URP > 0 for 5 minutes and offline partitions > 0 before paging on CPU. Kafka brokers run hot on page cache by design—CPU alone misleads.

🎯 Interview Tip

“What do you monitor first on Kafka?” — Under-replicated partitions, active controller count, offline partitions, then consumer records-lag-max growth rate. Broker request handler idle explains produce latency without guessing.

Kafka security

Security is layered: encrypt wire traffic, authenticate principals, authorize per-resource operations. Kafka does not encrypt data at rest natively—that is filesystem or cloud KMS territory.

flowchart TB
  C[Client]
  TLS[SSL/TLS\nencrypt + broker identity]
  SASL[SASL authentication\nwho are you]
  ACL[ACL authorization\nwhat can you do]
  B[Broker]

  C --> TLS --> SASL --> ACL --> B

Authentication mechanisms

Mechanism	How it works	When
SSL/TLS	Encrypt in transit; mTLS verifies client cert	Baseline for all production clusters
SASL/PLAIN	Username/password in JAAS config	Only over SSL; simple internal clusters
SASL/SCRAM	Hashed credentials in KRaft/ZK	Default for managed internal auth
SASL/GSSAPI (Kerberos)	Ticket-based enterprise SSO	Large enterprises with AD/Kerberos
SASL/OAUTHBEARER	OAuth2 JWT tokens	Cloud IdP, zero-trust, short-lived creds

listeners=SASL_SSL://0.0.0.0:9093
advertised.listeners=SASL_SSL://broker1.example.com:9093
security.inter.broker.protocol=SASL_SSL
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512
sasl.enabled.mechanisms=SCRAM-SHA-512
ssl.keystore.location=/var/ssl/kafka.server.keystore.jks
ssl.keystore.password=${KEYSTORE_PASSWORD}
ssl.key.password=${KEY_PASSWORD}
authorizer.class.name=org.apache.kafka.metadata.authorizer.StandardAuthorizer
super.users=User:admin

Authorization — ACLs

KRaft clusters use StandardAuthorizer (or legacy AclAuthorizer). ACLs bind principal → operation → resource. Deny rules override allow.

Resource type	Examples
`Topic`	`orders`, `prefix*`
`Group`	`order-processor`
`TransactionalId`	`inventory-txn`
`Cluster`	Idempotent produce, describe cluster

Operations: Read, Write, Create, Delete, Alter, Describe, ClusterAction, IdempotentWrite, …

# Producer: write to orders topic
kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
  --add --allow-principal User:order-service \
  --operation Write --topic orders

# Consumer: read topic + join group
kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
  --add --allow-principal User:order-processor \
  --operation Read --topic orders

kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
  --add --allow-principal User:order-processor \
  --operation Read --group order-processor

# List ACLs for topic
kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
  --list --topic orders

Encryption at rest

Not built into Kafka. Options: LUKS/dm-crypt on broker volumes, cloud EBS encryption (AWS KMS, GCP CMEK), or encrypted filesystems. Inter-broker and client-broker TLS covers in-flight; at-rest protects stolen disks.

Audit logging

Open-source Kafka has limited built-in audit trails. Confluent Platform adds audit logs (authentication, ACL changes, admin actions). Alternatives: broker access logs, Kubernetes audit, SIEM ingestion of ACL CLI changes via GitOps PR history.

⚠️ Pitfall

SASL/PLAIN without SSL sends credentials in cleartext. SCRAM over SASL_SSL is the minimum bar for production. Rotate SCRAM credentials via kafka-configs.sh --alter --add-config SCRAM... on broker user store.

Capacity planning

Size disk and network from peak ingest and fan-out, not averages. CPU is rarely the first bottleneck; partition count per broker has a hard operational ceiling.

Storage

disk ≈ retention_seconds × peak_bytes_in_per_sec × replication_factor × 1.2

Example:
  retention = 7 days = 604,800 s
  peak ingest = 50 MB/s
  RF = 3
  → 604800 × 50 × 3 × 1.2 ≈ 108 TB raw cluster (spread across brokers)

Add: compacted topics, mirror-maker duplication, segment overhead

The 1.2 factor covers compaction lag, segment rounding, and headroom. Monitor actual disk per broker; alert at 70% sustained—reassignment needs free space.

Network

peak_egress ≈ peak_bytes_in × (replication_factor - 1 + consumer_fan_out)

Example:
  50 MB/s in, RF=3, 2 consumer clusters reading full stream
  replication traffic: 50 × (3-1) = 100 MB/s internal
  consumer traffic: 50 × 2 = 100 MB/s
  → broker NIC planning ~200+ MB/s sustained per ingress broker (not burst-only)

Cross-AZ replication doubles cloud egress bills—prefer rack-aware same-AZ replicas when AZ failure tolerance allows, or accept cost for true AZ isolation.

CPU

Typically not the first bottleneck except with heavy compression (zstd level high), SSL everywhere, or excessive metric cardinality scraping. Profile before buying larger CPUs—add brokers for I/O parallelism first.

Partition ceiling per broker

Operational limit: ~2,000–4,000 partitions per broker (cluster-dependent). Beyond this:

Controller metadata grows—slower leader election and topic creation
Rebalance and recovery times stretch to hours
Each partition = file handles and metadata memory

Deployment size	Guidance
Small / MVP	~1 partition per broker per high-throughput topic; RF=3 on 3 brokers
Growing	Add brokers before adding partitions; keep < 2000 partitions/broker
Large	Cruise Control capacity goals; dedicated controller/quorum nodes (KRaft)

flowchart LR
  D[Disk I/O\npage cache]
  N[Network\nRF + fan-out]
  M[Metadata\npartitions / broker]
  C[CPU\ncompression / SSL]

  D -->|usually first| N --> M --> C

⚙️ Config

Capacity review checklist: peak BytesInPerSec, retention per tier, RF, consumer fan-out, partitions/broker count, disk % used, URP during peak, rebalance duration last quarter.

📦 Real World

Run annual “peak × 2” drills: if Black Friday doubles ingest, pre-create partitions and verify broker disk at 2× retention formula. Confluent Cloud auto-scales storage; self-hosted teams add brokers in maintenance windows before marketing events.