Running a Kafka cluster in production without monitoring is like flying an airplane blind. Because Kafka handles critical data pipelines, a silent failure (like a slow consumer causing data latency or an offline replica) can quickly impact your business operations.

To keep your cluster healthy, you must export metrics (typically using tools like the **Prometheus JMX Exporter** and **Grafana**) and set alerts on key health indicators. Let's review the most important metrics to watch.

Essential Kafka Dashboard Monitoring Metrics Diagram
Real-World Analogy: The Car Dashboard

Imagine driving a car down the highway:

  • Under-Replicated Partitions is the check engine light. If it turns red (value > 0), one of your tires is flat, and you risk a breakdown if another goes out.
  • Consumer Lag is the fuel gauge. If the needle points to empty, your engine is running out of fuel (or processing backlog is growing), indicating that you need to fill up soon.
  • Active Controller Count is the steering wheel. There must be exactly 1 steering wheel in the car. Having 0 or 2 steering wheels is a recipe for disaster.

Three Critical Broker Metrics

1. UnderReplicatedPartitions (Alert if > 0)

Metric Name: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

Indicates the count of partitions that do not have their full set of configured replicas online. If this number rises above 0, it means at least one follower broker has crashed or is struggling to sync data. **Set a high-priority alert on this metric.**

2. ActiveControllerCount (Alert if != 1)

Metric Name: kafka.controller:type=KafkaController,name=ActiveControllerCount

Tracks how many brokers are acting as the active controller manager. There must be exactly 1 active controller in the cluster. If this number drops to 0, leader elections cannot take place.

3. IsrShrinksPerSec / IsrExpandsPerSec

Metric Name: kafka.server:type=ReplicaManager,name=IsrShrinksPerSec

Tracks how often brokers are removed from the In-Sync Replicas (ISR) list. Frequent shrinks indicate network drops or broker performance issues.

Critical Client Metrics

1. records-lag-max (Consumer metric)

Measures the maximum difference between the latest message written to a partition and the offset committed by the consumer. If lag grows continuously, it means your consumer is too slow to handle incoming traffic, indicating that you need to add more consumer instances or partitions.

2. record-error-rate (Producer metric)

Tracks the rate of write requests that fail due to schema validation issues, broker downtime, or authorization errors. In a healthy system, this value should be 0.

Conclusion

Effective monitoring prevents minor glitches from turning into system outages. Set up Grafana dashboards to track **under-replicated partitions**, monitor **consumer lag** to prevent business delays, and alert on broker heartbeat failures to catch issues early.