Quick Definition
Apache Kafka most commonly refers to a distributed streaming platform used for building real-time data pipelines and streaming applications.
Analogy: Kafka is like a high-throughput courier service that reliably delivers ordered batches of messages from many senders to many receivers, with a durable record of every package.
Formal technical line: Kafka is a partitioned, replicated, immutable commit log service that provides durable storage, real-time publish/subscribe, and stream processing hooks.
Other meanings:
- Kafka (literature): Franz Kafka, a 20th-century novelist.
- Kafka (software ecosystem): The broader set of client libraries, connectors, and managed services around Apache Kafka.
- Kafka (proprietary offerings): Managed Kafka-like services by cloud vendors.
What is Kafka?
What it is / what it is NOT
- What it is: Kafka is a distributed, durable, append-only log storage and messaging system optimized for throughput, fan-out, and retention-based replay.
- What it is NOT: It is not a traditional relational database, not a low-latency single-record key-value store, and not a long-term archival store by default.
Key properties and constraints
- Partitioned logs for scalability.
- Replication for durability and availability.
- Producers publish records keyed for partitioning; consumers read in offset order.
- Exactly-once semantics available with caveats and configuration.
- Pull-based consumption model (consumer controls pace).
- Retention configured per topic (time or size).
- Strong ordering guarantees are per-partition only.
- Cross-datacenter replication is supported but introduces latency and complexity.
Where it fits in modern cloud/SRE workflows
- Event backbone for microservices, analytics, streaming ML features, and audit trails.
- Centralized telemetry bus for observability and security events.
- Integral to platform engineering patterns on Kubernetes and managed cloud.
- SRE concerns: broker availability, replication correctness, controller stability, storage IO, and consumer group lag.
Diagram description (text-only)
- Producers write messages to topics.
- A topic is split into partitions distributed across brokers.
- Brokers replicate partitions to followers.
- Zookeeper or a controller manages cluster metadata (controller coordinates leader elections).
- Consumers join consumer groups and fetch messages; offsets are committed.
- Connectors ingress and egress data; stream processors transform messages.
Kafka in one sentence
Kafka is a durable, distributed commit log that decouples producers and consumers to enable scalable, fault-tolerant streaming and event-driven systems.
Kafka vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from Kafka | Common confusion |
|---|---|---|---|
| T1 | RabbitMQ | Broker with queue semantics and push delivery | People expect same ordering |
| T2 | Kinesis | Managed stream service by a cloud vendor | Assumed identical API |
| T3 | Pulsar | Multi-layered ledger with topic-level isolation | Compared only by latency |
| T4 | Redis Streams | In-memory stream with optional persistence | Assumed same durability |
| T5 | Database WAL | Storage log for a single DB | Thought of as pubsub substitute |
Row Details
- T1: RabbitMQ emphasizes routing and broker-side ack; Kafka emphasizes high-throughput log and replay.
- T2: Kinesis commonly has shard limits and vendor-managed scaling differences.
- T3: Pulsar separates serving and storage; offers geo-replication differently.
- T4: Redis Streams trade durability and disk throughput for low-latency memory ops.
- T5: DB WALs are internal and not designed for multi-subscriber streaming semantics.
Why does Kafka matter?
Business impact (revenue, trust, risk)
- Enables real-time analytics that can increase revenue by enabling immediate personalization and quicker fraud detection.
- Builds reliable audit trails and immutable event stores that increase regulatory trust and reduce compliance risk.
- Centralized event backbone reduces integration complexity, lowering operational risk and time-to-market for new features.
Engineering impact (incident reduction, velocity)
- Decouples services to reduce cascading failures and enable independent scaling.
- Stream processing enables incremental data processing patterns that cut batch windows and free engineers from brittle ETL.
- Standardized event model increases developer velocity for cross-team integrations.
SRE framing (SLIs/SLOs/error budgets/toil/on-call)
- SLIs commonly: broker availability, consumer lag, end-to-end publish latency, record loss rate.
- SLOs should reflect business-critical topics differently than telemetry topics.
- Error budgets guide planned schema changes or risky upgrades.
- Toil reduction: automate broker scaling, rolling upgrades, and connector restarts.
3–5 realistic “what breaks in production” examples
- Sudden consumer group backlog growth due to downstream SLA regression, causing increased retention usage and disk pressure.
- Under-replicated partitions after a broker failure and slow rebalancing leading to vulnerable data windows.
- Misconfigured retention causing unexpected data loss for feature replays.
- Network partitions causing leader flapping and repeated leader elections raising CPU and latency.
- Large single-key hotspots causing skewed partition traffic and local IO saturation.
Where is Kafka used? (TABLE REQUIRED)
| ID | Layer/Area | How Kafka appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge / Ingest | Edge gateways publish telemetry streams | Publish latency, retries | Connectors, Kafka Connect |
| L2 | Service Mesh / Backend | Event-driven integration between services | Consumer lag, processing rate | Client libraries, schema registry |
| L3 | Application | Event sourcing and async workflows | Throughput, error rates | Streams, state stores |
| L4 | Data / Analytics | Real-time pipelines for analytics | End-to-end latency | Stream processors, connectors |
| L5 | Cloud infra | Managed Kafka or K8s operators | Cluster health, broker metrics | Managed services, operators |
| L6 | Ops / Security | SIEM ingestion, audit logs | Log volume, retention compliance | Connectors, IAM integrations |
Row Details
- L1: Edge collects sensor or user telemetry; often uses local buffering.
- L2: Backend services use Kafka for decoupled message exchange and async tasks.
- L3: Applications implement event sourcing or CQRS with Kafka as the source of truth.
- L4: Data engineering consumes topics for enrichment and analytics pipelines.
- L5: Platform teams run Kafka on Kubernetes or use managed Kafka offerings.
- L6: Security and compliance teams ingest events for alerting and audits.
When should you use Kafka?
When it’s necessary
- You need durable, replayable event storage for multiple independent consumers.
- High-throughput ingestion from many producers with fan-out to many consumers.
- Event-driven business logic where ordering per key and retention-based replay matters.
When it’s optional
- Modest message volume with few consumers and strict low-latency requirements where simpler brokers suffice.
- Pure request/response workflows better handled by RPC or HTTP.
When NOT to use / overuse it
- Small teams with low message rates and simple queue semantics.
- For transactional single-record state updates better served by a database with change-data-capture if replay isn’t needed.
- As a long-term archival store without lifecycle management.
Decision checklist
- If you need durable replay AND multiple consumers -> use Kafka.
- If you need single consumer, low latency, and simple ordering -> consider a lightweight queue.
- If you need strict transactional DB-level updates and no replay -> use DB-based solution.
Maturity ladder
- Beginner: Use managed Kafka; topics with basic retention; one consumer group per logical worker.
- Intermediate: Use Kafka Connect, schema registry, consumer group balancing, and monitoring.
- Advanced: Multi-datacenter replication, Kafka Streams or ksqlDB, transactional producers, capacity plans.
Example decisions
- Small team: Use a managed cloud Kafka with Connect and a schema registry to avoid ops burden.
- Large enterprise: Run Kafka on Kubernetes with an operator, custom monitoring, multi-cluster replication, and strict SLOs.
How does Kafka work?
Components and workflow
- Brokers: Stateless processes that store topic partitions on disk; handle client requests.
- Controller: The broker acting as controller coordinates leader elections and partition assignments.
- ZooKeeper / Raft controller: Metadata management; newer versions use internal Raft without ZooKeeper.
- Topics and Partitions: Topics split into partitions ordered sequences of records.
- Producers: Append records to topic partitions; may include keys for partition selection.
- Consumers: Pull records from partitions; manage offsets and commit positions.
- Consumer Groups: Coordinate work division; each partition assigned to a group consumer.
- Replication: Each partition has leader and followers; ISR tracks in-sync replicas.
- Connectors: Kafka Connect sinks and sources for system integration.
- Stream processors: Kafka Streams or ksqlDB for stateful transformations.
Data flow and lifecycle
- Producer serializes message and sends to broker.
- Broker appends message to partition log and assigns offset.
- Replication sends message to followers; commit depends on acks.
- Consumers poll broker for messages starting from committed offsets.
- Consumers process and commit offsets once safe.
- Retention eviction removes old log segments per policy.
Edge cases and failure modes
- Producer retries causing duplicates unless idempotence is enabled.
- Consumer rebalances causing duplicate processing within windows.
- Slow follower leads to out-of-sync replicas and potential unavailability.
- Time-based retention truncates needed data if misconfigured.
Short practical examples (pseudocode)
- Produce with idempotence and acks=all: enable idempotence, send, check produce futures.
- Consumer commit pattern: poll, process batch, commit offsets synchronously when safe.
Typical architecture patterns for Kafka
- Event Backbone: All services publish significant events to Kafka; use topic per domain. Use when many consumers need same events.
- CQRS/Event Sourcing: Commands update write model and emit events to Kafka as the source of truth. Use when auditability and replay matter.
- Stream Enrichment Pipeline: Raw events ingested, enriched by stream processors, written to analytics topics. Use for real-time feature engineering.
- Log Aggregation / Telemetry Bus: Agents forward logs/metrics into Kafka for downstream processing. Use for centralized observability ingestion.
- Connector-based ETL: Use Kafka Connect to move data between systems. Use when needing scalable reliably-managed connectors.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Broker crash | Partition leader flips often | JVM OOM or disk full | Restart, increase memory, fix disk | Broker restarts, leader change rate |
| F2 | High consumer lag | Growing backlog per topic | Slow consumers or GC pauses | Scale consumers, optimize processing | Consumer lag metric per partition |
| F3 | Under-replicated partitions | ISR less than replication factor | Network or slow follower | Fix network, add capacity | Under-replicated partition count |
| F4 | GC pause spikes | Increased end-to-end latency | Poor heap tuning | Tune JVM, use G1 or reduce heap | JVM pause time, request latency |
| F5 | Disk saturation | Broker IO high and timeouts | Retention too long or large messages | Increase retention tiering, compact | Disk usage, io wait |
| F6 | Schema mismatch | Consumer deserialization errors | Incompatible schema change | Backward/forward compatible schemas | Deserialization error rate |
| F7 | Network partition | Producers/consumers erroring | Network flaps or firewall | Improve networking, multi-AZ design | Request failures and client errors |
Row Details
- F2: Consumer lag often due to downstream DB writes slowing consumption or improper backpressure handling.
- F6: Schema changes without registry coordination cause consumers to fail; enforce compatibility.
Key Concepts, Keywords & Terminology for Kafka
(40+ compact entries)
- Topic — Named stream of records — Logical grouping of messages — Pitfall: overly broad topics cause schema chaos.
- Partition — Ordered shard of a topic — Enables parallelism — Pitfall: too few partitions limit throughput.
- Broker — Kafka server process — Stores partitions and serves clients — Pitfall: single broker hosting hot partitions.
- Controller — Broker that coordinates cluster metadata — Manages leader elections — Pitfall: controller flapping causes instability.
- ISR — In-Sync Replicas — Replicas eligible to be leader — Pitfall: small ISR increases risk on failure.
- Leader — Partition replica accepting writes — Handles reads/writes — Pitfall: single leader hotspot.
- Follower — Replica that replicates from leader — Provides redundancy — Pitfall: slow followers causing URP.
- Offset — Sequential position in a partition — Consumer position marker — Pitfall: manual offset resets cause reprocessing.
- Consumer group — Set of consumers sharing topic partitions — Enables scaling — Pitfall: misaligned group IDs cause duplicate consumption.
- Consumer lag — Difference between latest offset and committed offset — Signal of backlog — Pitfall: ignoring lag masks processing issues.
- Producer — Client that writes records — Can be synchronous or async — Pitfall: not using idempotence leads to duplicates.
- Acks — Durability setting for producers — Controls commit semantics — Pitfall: acks=1 may lose data on leader crash.
- Idempotent producer — Ensures no duplicate writes on retries — Reduces duplicates — Pitfall: needs proper client configuration.
- Transactions — Atomic multi-topic writes — Ensures exactly-once across topics — Pitfall: complexity and higher latency.
- Retention — Policy to keep data by time/size — Enables replay — Pitfall: too short retention prevents reprocessing.
- Compaction — Keep latest message per key — Useful for changelogs — Pitfall: not suitable for event logs needing history.
- Log segment — Disk file containing records — Unit of disk cleanup — Pitfall: small segment size increases overhead.
- Kafka Connect — Framework for connectors — Simplifies integrations — Pitfall: connector misconfig can cause data loss.
- Connector — Source or sink plugin — Moves data to/from Kafka — Pitfall: plugin quality varies widely.
- Schema Registry — Stores message schemas — Enforces compatibility — Pitfall: optional use leads to incompatible consumers.
- SerDe — Serializer/Deserializer — Converts objects to bytes — Pitfall: mismatched SerDe between producer and consumer.
- Consumer rebalancing — Reassignment of partitions among consumers — Necessary for elasticity — Pitfall: frequent rebalances cause processing swings.
- Sticky assignor — Partition assignor to reduce churn — Lower rebalance impact — Pitfall: not ideal for highly dynamic groups.
- Leader election — Process selecting partition leader — Maintains availability — Pitfall: frequent elections indicate instability.
- MirrorMaker — Replication tool for cross-cluster copy — Supports DR and migration — Pitfall: latency and failure considerations.
- Tiered storage — Offload older segments to cheaper storage — Lowers cost — Pitfall: retrieval latency varies by vendor.
- Kafka Streams — Library for stream processing — In-process transforms and state stores — Pitfall: state store sizing and changelog retention.
- ksqlDB — SQL layer for Kafka streams — Fast prototyping of transforms — Pitfall: complex joins can be resource heavy.
- Exactly-once semantics — Guarantees no duplicate visible results — Important for financial systems — Pitfall: requires transactions and careful consumer logic.
- At-least-once — Default delivery model without dedupe — Simpler but duplicates possible — Pitfall: downstream idempotency required.
- Under-replicated partition — Partition with fewer in-sync replicas — Risk of data loss — Pitfall: prolonged URP indicates health issues.
- Broker rack awareness — Placement across failure domains — Reduces correlated failures — Pitfall: misconfigured rack IDs cause poor balancing.
- Controller quorum — Metadata consensus group — Stability critical — Pitfall: small quorum can be vulnerable.
- Log compaction delay — Time until compaction runs — Affects storage reclaim — Pitfall: expecting immediate compaction causes surprises.
- Retention.bytes — Size-based retention — Controls storage footprint — Pitfall: small sizes delete needed data.
- Message key — Used to determine partition — For ordering by key — Pitfall: heavy skew on single key creates hotspot.
- Compression — Reduces network and disk usage — Use gzip/snappy/lz4/zstd — Pitfall: CPU cost for compression.
- Consumer offset commit — Persisting progress — Manual or auto commit — Pitfall: commit on failure causes duplicate reads or data loss.
- Broker metrics — JMX metrics for operations — Essential for SREs — Pitfall: misinterpreting metrics without context.
- Quotas — Broker/client resource limits — Prevent noisy neighbors — Pitfall: mis-set quotas can throttle healthy producers.
- Security (SASL, TLS, ACLs) — Authentication and encryption — Controls access — Pitfall: overly permissive ACLs leak data.
- Authorization — Topic and group permissions — Restricts actions — Pitfall: missing ACLs break consumers.
- Producer retries — Retransmission behavior — Reduces data loss on transient errors — Pitfall: high retries without idempotence duplicates.
- Consumer offset reset policy — earliest/latest — Controls start position on missing offset — Pitfall: default latest hides data.
- KRaft — Kafka Raft metadata mode — Removes external ZooKeeper — Pitfall: newer and evolving operational practices.
How to Measure Kafka (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Broker availability | Cluster online percentage | Proportion of healthy brokers | 99.9% for critical | Flapping masks instability |
| M2 | Partition under-replicated | Durability risk count | Count of URP partitions | 0 for production | Short spikes may be transient |
| M3 | Consumer lag | Processing backlog per partition | Offset distance per partition | Low steady state per topic | Some lag is expected at peaks |
| M4 | End-to-end publish latency | Time from produce to consumer visible | Time delta traced across produce and consume | <500ms for real-time | Depends on replication and network |
| M5 | Produce error rate | Failed publishes percent | Failed produce calls / total | <0.1% | Bursts after maintenance possible |
| M6 | Request queue time | Broker request waiting time | Broker request queue metric | Low relative to processing | CPU saturation increases this |
| M7 | Disk utilization | Storage pressure per broker | Disk usage percent | <70% operational | Retention changes shift utilization |
| M8 | JVM pause time | GC impact on latency | JVM pause metrics | Minimal pauses | Large heaps increase pause risk |
| M9 | Connect task failures | Connector stability | Failed tasks count | 0 for critical connectors | Bad connector configs cause churn |
| M10 | Schema violations | Consumer deserialization errors | Deserialization error counts | 0 for compatibility | Rolling deploys may cause temporary errors |
Row Details
- M4: End-to-end latency commonly measured by inserting trace IDs in records and measuring produce-to-consume time.
- M7: Disk utilization target varies by retention and trailing window; include headroom for spikes.
Best tools to measure Kafka
Follow the exact structure for each tool.
Tool — Prometheus + JMX Exporter
- What it measures for Kafka: Broker metrics, JVM metrics, request latencies, ISR counts.
- Best-fit environment: Self-managed Kafka, Kubernetes.
- Setup outline:
- Export JMX from brokers.
- Scrape with Prometheus.
- Record rules for derived SLIs.
- Use Pushgateway for ephemeral jobs.
- Strengths:
- Flexible queries, alerting rules.
- Works well with Grafana.
- Limitations:
- Maintenance overhead for collectors.
- High-cardinality metrics need care.
Tool — Grafana
- What it measures for Kafka: Visualization and dashboarding of Prometheus metrics.
- Best-fit environment: Any environment with metric storage.
- Setup outline:
- Connect to Prometheus or other metric sources.
- Build executive and ops dashboards.
- Configure alerting via Grafana or external alert manager.
- Strengths:
- Powerful visualizations.
- Panel sharing and templating.
- Limitations:
- Not a metric store.
- Alerting limited compared to dedicated systems.
Tool — Confluent Control Center / Managed UI
- What it measures for Kafka: Cluster health, topic throughput, connect status.
- Best-fit environment: Confluent Platform or managed service.
- Setup outline:
- Enable telemetry in Confluent.
- Configure connectors and alert thresholds.
- Use built-in dashboards for SLOs.
- Strengths:
- Purpose-built Kafka insights.
- Simplifies operational tasks.
- Limitations:
- May be proprietary.
- Limited customization compared to OSS stacks.
Tool — OpenTelemetry + Tracing
- What it measures for Kafka: End-to-end publish and consume latency with traces.
- Best-fit environment: Distributed applications needing distributed tracing.
- Setup outline:
- Instrument producers and consumers with trace context.
- Export traces to tracing backend.
- Correlate with metrics.
- Strengths:
- Detailed visibility across services.
- Limitations:
- Requires app instrumentation and schema for trace context.
Tool — Managed cloud metrics (Cloud provider)
- What it measures for Kafka: Broker health, throughput, quotas for managed offering.
- Best-fit environment: Managed Kafka services.
- Setup outline:
- Enable provider metrics.
- Integrate with provider alerting.
- Map provider metrics to SLIs.
- Strengths:
- Low ops overhead.
- Provider-specific insights.
- Limitations:
- Varies by provider and metrics exposed.
Recommended dashboards & alerts for Kafka
Executive dashboard
- Panels: Cluster availability, top-topic throughput, error rate, storage utilization, end-to-end latency.
- Why: High-level view for business and platform owners to see health and capacity.
On-call dashboard
- Panels: Under-replicated partitions, consumer group lag per critical topic, broker CPU/memory/disk, recent leader changes, connector task failures.
- Why: Fast triage for SREs to identify impact and scope.
Debug dashboard
- Panels: Per-broker request queue times, JVM GC pauses, network IO, partition leader distribution, disk IO per partition.
- Why: Root-cause investigation and performance tuning.
Alerting guidance
- Page vs ticket:
- Page for URP count > 0 for >5 minutes on critical topics, major broker down, or consumer lag exceeding SLO thresholds for critical pipelines.
- Ticket for non-urgent connector failures or warning-level disk usage.
- Burn-rate guidance:
- Use error budget burn rates to throttle risky operations like schema migrations or cluster upgrades.
- Noise reduction tactics:
- Group alerts by cluster and topic.
- Suppress alerts during planned maintenance windows.
- Deduplicate repeated alerts using alertmanager grouping and inhibition.
Implementation Guide (Step-by-step)
1) Prerequisites – Inventory of topics, retention needs, throughput estimates, and SLAs. – Capacity plan for brokers: disk, network, CPU. – Security plan: TLS, authentication, ACLs. – Choice: managed vs self-managed vs operator.
2) Instrumentation plan – Export JMX metrics from brokers. – Instrument producers/consumers for tracing. – Configure Schema Registry for schemas.
3) Data collection – Deploy Connect for sources/sinks. – Configure topic partitions and replication factors. – Set retention and compaction policies per topic.
4) SLO design – Define critical topics and assign SLOs for end-to-end latency and durability. – Create error budgets and escalation policies.
5) Dashboards – Build exec, on-call, and debug dashboards from metrics. – Include topology and mapping of topics to owners.
6) Alerts & routing – Define thresholds for URP, lag, and broker availability. – Configure routing rules to on-call teams and escalation channels.
7) Runbooks & automation – Create runbooks for root causes like disk full, broker restart, and rebalancing. – Automate routine tasks: rolling upgrades, partition reassignment, connector restarts.
8) Validation (load/chaos/game days) – Load test to target throughput and retention scenarios. – Run chaos tests for broker failure and network partitions. – Conduct game days to exercise runbooks.
9) Continuous improvement – Postmortems for incidents. – Regular capacity reviews and tuning. – Schema evolution reviews.
Pre-production checklist
- Topics defined with partition and replication counts.
- Schema registry integrated and compatibility checks.
- Monitoring and alerts configured.
- Access controls and TLS working.
- Disaster recovery plan and backups defined.
Production readiness checklist
- Baseline performance metrics established.
- Consumer lag SLIs validated under load.
- Runbooks published and on-call assigned.
- Capacity headroom for spikes.
- Disaster recovery tested with MirrorMaker or replication.
Incident checklist specific to Kafka
- Check controller and broker availability.
- Verify under-replicated partitions and ISR sizes.
- Confirm disk usage and GC metrics.
- Inspect consumer group lag for key topics.
- Execute known runbook steps: restart broker, force leader election, scale consumers.
Example for Kubernetes
- Deploy Kafka via operator with PVCs and pod anti-affinity.
- Verify StatefulSet stability, PV reclaim policies, and node resource limits.
- Good: stable leader distribution and persistent volumes mounted.
Example for managed cloud service
- Configure cluster sizing and topic configs via provider console or API.
- Verify provider metrics and alerts map to your SLIs.
- Good: provider-managed upgrades have minimal downtime and visible maintenance windows.
Use Cases of Kafka
-
Real-time personalization – Context: E-commerce site personalizing product recommendations. – Problem: Need low-latency user event processing for sessions. – Why Kafka helps: High-throughput ingestion and multiple consumer apps for feature building. – What to measure: End-to-end latency, event loss rate, consumer lag. – Typical tools: Kafka Streams, schema registry, feature store.
-
Fraud detection pipeline – Context: Financial transactions require near-real-time scoring. – Problem: Aggregating events quickly to detect anomalies. – Why Kafka helps: Durable replay and stream processing for models. – What to measure: Detection latency, false positive rate, throughput. – Typical tools: Kafka Streams, ksqlDB, model inference services.
-
Audit trail and compliance – Context: Regulatory requirement to store an immutable record. – Problem: Multiple services create events that must be auditable. – Why Kafka helps: Append-only log and retention settings for replay. – What to measure: Retention adherence, integrity checks, schema versioning. – Typical tools: Kafka topics with compaction off, schema registry.
-
Log aggregation and observability – Context: Centralized logs from many microservices. – Problem: Collecting and processing logs at scale. – Why Kafka helps: Durable ingestion and backpressure handling. – What to measure: Log ingestion latency, connector failures. – Typical tools: Filebeat/Fluentd -> Kafka Connect -> ELK.
-
Change Data Capture (CDC) – Context: Syncing DB changes to multiple systems. – Problem: Multiple consumers need consistent DB updates. – Why Kafka helps: Debezium pushes DB changes into Kafka for replay. – What to measure: CDC lag, schema changes, data correctness. – Typical tools: Debezium, Kafka Connect, sink connectors.
-
Metrics pipeline – Context: High-cardinality telemetry streaming to analytics. – Problem: Scale and ingest spikes. – Why Kafka helps: Buffering, scaling consumers for analytics. – What to measure: Ingestion rate, partition skew, retention costs. – Typical tools: Telegraf/Prometheus remote write, Kafka Connect.
-
Event-driven microservices – Context: Decoupled services communicate via events. – Problem: Tight coupling via RPC leads to fragility. – Why Kafka helps: Loose coupling via topics; replayability. – What to measure: Service coupling metrics, latency, error rates. – Typical tools: Client libraries, schema registry.
-
Streaming ETL and enrichment – Context: Enrich incoming events with reference data. – Problem: Batch ETL has stale data and long delays. – Why Kafka helps: Stream processing maintains up-to-date state stores. – What to measure: Enrichment latency, correctness, state store size. – Typical tools: Kafka Streams, ksqlDB.
-
IoT telemetry ingestion – Context: Massive device telemetry with intermittent connectivity. – Problem: Buffering and burst handling. – Why Kafka helps: Durable ingestion and replay when downstream is recovered. – What to measure: Publish retries, retention, ingestion spikes. – Typical tools: Edge buffers, Connectors, tiered storage.
-
Feature store materialization – Context: ML features assembled in real time. – Problem: Keep feature values fresh and consistent. – Why Kafka helps: Streaming updates into online and offline stores. – What to measure: Feature staleness, throughput, correctness. – Typical tools: Kafka Streams, state stores, materialization connectors.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes: Stateful event pipeline for personalization
Context: E-commerce platform running on Kubernetes with microservices.
Goal: Build real-time personalization features with low latency and replayability.
Why Kafka matters here: Decouples user event producers from offline and online consumers; supports replay for model retraining.
Architecture / workflow: Producers in pods send events to Kafka operator cluster; Kafka Streams app enriches events; online feature store consumes enriched events.
Step-by-step implementation:
- Deploy Kafka via operator with 3+ brokers and rack awareness.
- Configure topics per domain with adequate partitions.
- Deploy schema registry and enforce compatibility.
- Deploy Kafka Streams app in Kubernetes with liveness/readiness probes.
- Hook online store sink via Connect.
What to measure: Consumer lag, broker CPU/disk, event latency, schema errors.
Tools to use and why: Kubernetes operator for lifecycle, Prometheus for metrics, Grafana dashboards.
Common pitfalls: Insufficient partitions; state store sizing; pod eviction without graceful drain.
Validation: Load test with synthetic events, verify feature freshness under load.
Outcome: Real-time personalization with replay for model improvement.
Scenario #2 — Serverless/Managed-PaaS: Ingest analytics via managed Kafka
Context: SaaS using managed Kafka service and serverless functions for processing.
Goal: Scale ingestion without operating Kafka.
Why Kafka matters here: Managed durability and auto-scaling of brokers reduce ops.
Architecture / workflow: Serverless producers write events to managed cluster; managed Connect sinks to warehouse.
Step-by-step implementation:
- Provision managed Kafka cluster with topic plans.
- Configure IAM and TLS for serverless functions.
- Deploy consumer lambdas triggered by event streams.
- Use managed Connect to sink to analytics.
What to measure: Provider metrics for latency, function failures, sink success rates.
Tools to use and why: Managed Kafka UI for health, tracing for end-to-end latency.
Common pitfalls: Hidden provider quotas; cold-start latency in serverless.
Validation: Spike tests and verify no data loss during autoscale.
Outcome: Low-ops ingestion pipeline with scalable consumers.
Scenario #3 — Incident-response/postmortem: URP after broker failure
Context: Production cluster experiences sudden broker crash and many URP.
Goal: Restore replication and identify root cause.
Why Kafka matters here: URP increases risk of data loss and affects availability.
Architecture / workflow: SREs triage controller state, identify broken disks.
Step-by-step implementation:
- Page SRE and check controller and broker logs.
- Inspect under-replicated partition metrics and identify affected brokers.
- Restart crashed broker and monitor replication.
- Reassign partitions away from problematic nodes if disk damaged.
- Postmortem root-cause analysis and corrective actions.
What to measure: URP counts, leader changes, disk IO metrics.
Tools to use and why: Broker logs, Prometheus, Grafana, and storage metrics.
Common pitfalls: Restarting without remediation causing repeated crashes.
Validation: Ensure URP returns to zero and replication factor preserved.
Outcome: Restored durability and planned remediation for disk capacity.
Scenario #4 — Cost/performance trade-off: Retention vs tiered storage
Context: Large telemetry retention causing high storage costs.
Goal: Lower costs while preserving recent data for fast access.
Why Kafka matters here: Tiered storage can move older segments to cheaper object storage.
Architecture / workflow: Configure tiered storage for older segments and keep hot window on local disks.
Step-by-step implementation:
- Identify hot window and retention needs.
- Enable tiered storage on brokers and set offload thresholds.
- Test retrieval latency for cold segments.
- Adjust retention.bytes and retention.ms per topic.
What to measure: Cost per GB, retrieval latency for cold data, broker disk usage.
Tools to use and why: Cloud object storage, tiered storage metrics.
Common pitfalls: Unexpected retrieval latencies during bulk reprocessing.
Validation: Reprocess archived data and measure time and success.
Outcome: Reduced storage cost with acceptable retrieval performance.
Common Mistakes, Anti-patterns, and Troubleshooting
- Symptom: Consumer lag steadily grows -> Root cause: Slow downstream writes -> Fix: Scale consumers and batch writes; add backpressure.
- Symptom: Frequent leader elections -> Root cause: Flapping brokers or network instability -> Fix: Fix network, check broker health, increase session timeouts.
- Symptom: Data loss after retention -> Root cause: Misconfigured retention.ms -> Fix: Increase retention or implement tiered storage.
- Symptom: High CPU on brokers -> Root cause: Compression CPU cost or heavy GC -> Fix: Change compression codec or tune JVM.
- Symptom: Deserialization errors -> Root cause: Schema incompatibility -> Fix: Use schema registry, enforce compatibility.
- Symptom: Under-replicated partitions -> Root cause: Disk IO or slow followers -> Fix: Add brokers, rebalance partitions, fix IO.
- Symptom: Hot partition causing slow throughput -> Root cause: Key skew -> Fix: Repartition keys or use more partitions and custom partitioner.
- Symptom: Connector tasks repeatedly failing -> Root cause: Misconfigured connector or destination errors -> Fix: Fix connector config or destination permissions.
- Symptom: Excessive consumer rebalances -> Root cause: Short session timeouts or sticky assignment absent -> Fix: Increase session.timeout.ms and use the cooperative assignor.
- Symptom: Garbage collection spikes -> Root cause: Large heap and old GC algorithms -> Fix: Use G1 GC and tune heap sizes.
- Symptom: Unexpected duplicate messages -> Root cause: Producer retries without idempotence -> Fix: Enable idempotent producer and transactional writes if needed.
- Symptom: High disk utilization -> Root cause: Retention misestimate -> Fix: Adjust retention, enable compaction or tiered storage.
- Symptom: Missing metrics -> Root cause: JMX not exposed -> Fix: Configure JMX exporter and correct scrape targets.
- Symptom: TLS handshake failures -> Root cause: Expired certs or wrong truststore -> Fix: Rotate certs and verify keystore/truststore config.
- Symptom: ACL denials -> Root cause: Missing topic ACLs -> Fix: Add appropriate ACL entries for service principals.
- Symptom: Slow partition reassignment -> Root cause: High traffic during reassignment -> Fix: Throttle reassignment and schedule during low traffic.
- Symptom: Schema drift -> Root cause: No registry or incompatible changes -> Fix: Enforce compatibility, use schema registry.
- Symptom: Monitoring false positives -> Root cause: Thresholds not tuned -> Fix: Calibrate baselines and use anomaly detection.
- Symptom: High network egress cost -> Root cause: Cross-AZ or cross-region replication -> Fix: Adjust replication strategy and batch sizes.
- Symptom: Observability pitfall — Too many metrics high-cardinality -> Root cause: Tag explosion -> Fix: Reduce cardinality and aggregate metrics.
- Symptom: Observability pitfall — Missing correlation IDs -> Root cause: Producers not instrumented -> Fix: Add trace context to messages.
- Symptom: Observability pitfall — Dashboards full of noise -> Root cause: Unfiltered alerts -> Fix: Introduce suppression and grouping rules.
- Symptom: Observability pitfall — Metrics without SLIs -> Root cause: Lack of derived metrics -> Fix: Create SLIs (e.g., tail latency) from raw metrics.
- Symptom: Overpartitioning -> Root cause: Too many partitions for brokers -> Fix: Re-evaluate partition counts and increase brokers.
- Symptom: Underestimating retention cost -> Root cause: No lifecycle policy -> Fix: Implement tiered storage and lifecycle policies.
Best Practices & Operating Model
Ownership and on-call
- Platform team owns cluster operations, topics escalation to domain teams.
- Topic owners responsible for schema, retention, and consumer behavior.
- On-call rotations should include runbooks for common broker and consumer incidents.
Runbooks vs playbooks
- Runbooks: Step-by-step procedures for specific failures.
- Playbooks: High-level decision trees for escalations and cross-team coordination.
Safe deployments (canary/rollback)
- Use rolling, broker-by-broker upgrades with health checks.
- Canary topic changes by deploying consumer in shadow mode first.
- Have automated rollback of schema changes if compatibility errors occur.
Toil reduction and automation
- Automate partition reassignment with throttles.
- Automate connector restart and dead-letter routing.
- Automate topic creation with guardrails (limits, naming, owners).
Security basics
- Encrypt in transit with TLS.
- Authenticate clients with SASL or mTLS.
- Authorize with topic-level ACLs.
- Use schema registry with auth to prevent rogue producers.
Weekly/monthly routines
- Weekly: Review consumer lag trends, top producers, connector status.
- Monthly: Capacity review, retention and cost analysis, upgrade planning.
- Quarterly: Replay drills and disaster recovery tests.
What to review in postmortems related to Kafka
- Timeline of leader elections and URP.
- Metrics during incident and correlation with consumer lag.
- Configuration changes and schema evolutions.
- Action items for capacity, automation, or alert tuning.
What to automate first
- Health alerts for broker and URP remediation.
- Connector restart and dead-letter routing for sink failures.
- Topic creation with enforced defaults (partitions, replication).
- Consumer lag alerting and automated scaling for consumers.
Tooling & Integration Map for Kafka (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Monitoring | Collects broker and JVM metrics | Prometheus, Grafana | Essential for SREs |
| I2 | Tracing | End-to-end latency and correlation | OpenTelemetry | Instrument apps for traces |
| I3 | Connectors | Move data to/from Kafka | DBs, S3, Elasticsearch | Use vetted connectors |
| I4 | Schema Registry | Manage message schemas | Producers, consumers | Enforce compatibility |
| I5 | Stream Processing | Stateful transforms and joins | Kafka topics, state stores | Kafka Streams, ksqlDB |
| I6 | Operators | Manage Kafka on K8s | StatefulSets, PVCs | Use mature operators |
| I7 | Security | Authentication and ACL management | TLS, SASL, IAM | Centralize secrets handling |
| I8 | Tiered storage | Offload old segments to object store | S3, GCS, Azure Blob | Saves cost for large retention |
| I9 | Disaster Recovery | Cross-cluster replication | MirrorMaker, replication | Plan RPO/RTO policies |
| I10 | Managed Service | Vendor-hosted Kafka offering | Cloud IAM, VPC | Low ops overhead |
Row Details
- I3: Connectors vary in quality; prefer community and vendor-tested connectors.
- I6: Operators simplify lifecycle but review operator maturity and loss scenarios.
Frequently Asked Questions (FAQs)
What is the difference between Kafka and RabbitMQ?
Kafka is a distributed log optimized for throughput and replay, while RabbitMQ is a message broker focused on routing and queue semantics.
What is the difference between Kafka topics and partitions?
A topic is a logical stream; partitions are ordered shards of a topic providing parallelism and ordering per partition.
What is the difference between Kafka and Kinesis?
Kinesis is a managed streaming service with different shard and scaling semantics; operational behavior and APIs differ by provider.
How do I ensure no data loss in Kafka?
Use replication with acks=all, monitor under-replicated partitions, and configure durable retention and monitoring.
How do I measure end-to-end latency in Kafka?
Embed trace IDs when producing, instrument consumers to report consumption timestamps, and compute produce-to-consume deltas.
How do I handle schema evolution?
Use a schema registry and enforce backward/forward compatibility rules for producers and consumers.
How many partitions should I have per topic?
Depends on throughput and consumer parallelism; start with an estimate based on expected max throughput per broker and scale as needed.
How do I secure Kafka in production?
Use TLS, SASL or mTLS for authentication, and topic-level ACLs for authorization; integrate with secret management.
How do I scale Kafka?
Add brokers and rebalance partitions; ensure network and disk capacity scale proportionally.
How do I avoid consumer duplicates?
Enable idempotent producers or use transactional producers plus consumer-side deduplication when necessary.
How do I handle cross-region replication?
Use tools like MirrorMaker or vendor replication; accept higher latency and configure careful conflict resolution policies.
How do I monitor consumer lag effectively?
Measure offset difference per partition over time and alert when lag sustained beyond SLO windows.
How do I choose between Kafka Streams and ksqlDB?
Kafka Streams is a library for embedded stream processing; ksqlDB provides SQL-like declarative streaming. Choose Streams for embedded apps and ksqlDB for rapid SQL-based transformations.
How do I handle schema registry outages?
Design connectors and consumers to fail gracefully and have a fallback mode; cache schemas in clients.
How do I manage topic lifecycle?
Automate creation with defaults, enforce retention and replication policies, and archive with tiered storage.
How do I debug serialization errors?
Check schema versions, SerDe configuration, and the content of failing messages with a consumer in debugging mode.
How do I reduce noisy alerts?
Use grouping, dedupe strategies, suppression windows, and tune thresholds to baseline traffic variability.
Conclusion
Kafka is a foundational streaming platform that provides durable, replayable logs and scalable fan-out for modern event-driven architectures. Proper use requires clear ownership, observability, schema governance, and operational automation.
Next 7 days plan
- Day 1: Inventory topics, owners, SLAs, and retention settings.
- Day 2: Ensure monitoring (Prometheus/JMX) and basic dashboards in place.
- Day 3: Integrate schema registry and audit schema usage.
- Day 4: Implement alerting for URP and consumer lag for critical topics.
- Day 5: Run a small load test on consumer groups and validate scaling.
- Day 6: Create runbooks for common incidents and assign on-call.
- Day 7: Plan a game day to exercise recovery and replay.
Appendix — Kafka Keyword Cluster (SEO)
Primary keywords
- Kafka
- Apache Kafka
- Kafka tutorial
- Kafka guide
- Kafka stream processing
- Kafka architecture
- Kafka topics
- Kafka partitions
- Kafka broker
- Kafka consumer
- Kafka producer
- Kafka Connect
- Kafka Streams
- ksqlDB
- Kafka schema registry
- Kafka replication
Related terminology
- consumer lag
- under-replicated partitions
- ISR in Kafka
- Kafka retention policy
- log compaction
- exactly-once semantics
- at-least-once delivery
- idempotent producer
- Kafka transactions
- Kafka security
- SASL TLS Kafka
- Kafka ACLs
- Kafka on Kubernetes
- Kafka operator
- KRaft mode
- tiered storage Kafka
- MirrorMaker replication
- Kafka monitoring
- Kafka metrics
- Prometheus Kafka metrics
- Kafka JMX exporter
- Kafka troubleshooting
- Kafka runbook
- Kafka SLO
- Kafka SLIs
- Kafka alerts
- Kafka best practices
- Kafka capacity planning
- Kafka partitioning strategy
- Kafka key skew
- Kafka compression
- Kafka JVM tuning
- Kafka GC pause
- Kafka disk usage
- Kafka retention.ms
- Kafka retention.bytes
- Kafka compaction delay
- Kafka connect sink
- Kafka connect source
- Debezium Kafka CDC
- Kafka event sourcing
- Kafka audit log
- Kafka observability
- Kafka end-to-end latency
- Kafka tracing
- OpenTelemetry Kafka
- Kafka cost optimization
- Kafka tiered storage S3
- Kafka message schema
- Avro Kafka
- Protobuf Kafka
- JSON Schema Kafka
- Kafka consumer group rebalance
- cooperative rebalance Kafka
- sticky assignor Kafka
- Kafka leader election
- Kafka controller
- Kafka cluster health
- Kafka broker restart
- Kafka partition reassignment
- Kafka rolling upgrade
- Kafka canary deployment
- Kafka connector failures
- Kafka dead letter queue
- Kafka backpressure handling
- Kafka resilience
- Kafka disaster recovery
- Kafka cross region replication
- Kafka managed service
- Confluent Kafka
- AWS MSK
- Azure Event Hubs Kafka
- Google Pub/Sub Kafka-compatible
- Kafka Streams state store
- Kafka materialized view
- Kafka feature store
- Kafka real-time analytics
- Kafka personalization
- Kafka fraud detection
- Kafka log aggregation
- Kafka telemetry ingestion
- Kafka metrics pipeline
- Kafka event-driven microservices
- Kafka ETL pipeline
- Kafka enrichment pipeline
- Kafka serverless integration
- Kafka lambdas
- Kafka connector best practices
- Kafka connector configuration
- Kafka security best practices
- Kafka production checklist
- Kafka preproduction checklist
- Kafka game day
- Kafka chaos testing
- Kafka postmortem
- Kafka incident response
- Kafka under-replicated partitions alert
- Kafka consumer lag alert
- Kafka broker availability SLI
- Kafka end-to-end SLI
- Kafka throughput planning
- Kafka partition count planning
- Kafka throughput tuning
- Kafka retention cost management
- Kafka storage optimization
- Kafka tiered storage benefits
- Kafka cold storage retrieval
- Kafka schema compatibility
- Kafka schema evolution strategy
- Kafka SerDe best practices
- Kafka serialization errors
- Kafka deserialization troubleshooting
- Kafka connector scaling
- Kafka consumer scaling strategies
- Kafka producer tuning
- Kafka producer acks
- Kafka producer retries
- Kafka client libraries
- Kafka Java client
- Kafka Python client
- Kafka Go client
- Kafka .NET client
- Kafka operators comparison
- Kafka operator production
- Kafka managed vs self-managed
- Kafka performance tradeoffs
- Kafka cost performance analysis
- Kafka retention planning
- Kafka stream processing patterns
- Kafka CQRS
- Kafka event sourcing patterns
- Kafka security compliance
- Kafka regulatory auditing
- Kafka schema registry integration
- Kafka data lineage
- Kafka metadata management
- Kafka KRaft benefits
- Kafka ZooKeeper migration
- Kafka metadata replication
- Kafka topic lifecycle management
- Kafka alert noise reduction
- Kafka alert deduplication
- Kafka SLIs for business owners
- Kafka dashboards for executives
- Kafka on-call dashboard
- Kafka debug dashboard
- Kafka observability best practices
