Understanding and Managing Failed Partitions in Kafka

TL;DR

Kafka marks a partition as failed when it notices any issues with the replication process for that partition. It’s not uncommon to run into errors or warnings like Partition $topicPartition marked as failed (kafka.server.ReplicaFetcherThread). In this article, we’ll explain why Kafka marks partitions as failed and how to identify them. We’ll also discuss what actions you can take to address or prevent them.

Estimated reading time: 4 minutes

TL;DR
What are failed partitions in Kafka?
Common Reasons
Monitoring and detecting Failed Partitions in Kafka
- 1. The Kafka Server Logs
- 2. Metrics to monitor
Actions to take to address Failed Partitions
Conclusion
References

What are failed partitions in Kafka?

Whenever Kafka encounters any issues, such as network failures, hardware issues, or broker failures during replication, it marks the partitions as Failed. Kafka does this to ensure the data’s integrity and prevent any inconsistencies.

Common Reasons

Some of the common reasons Kafka may mark partitions as failed are:

Network Issues: Issues such as network failures, higher latency, intermittent connectivity issues, or network partitioning.
Hardware issues: Issues in the underlying hardware of a broker may cause that broker to malfunction.
Storage and Disk-related issues: Storage errors or disks running out of space can interrupt replication. Partitions marked as failed due to storage issues are eventually removed from the Failed Partitions set once you take this broker’s log directory offline or resolve the disk issues.
Broker failures: Issues with the broker process, such as crashes, resource crunch, other blocking errors on the broker, etc.
Configuration issues: You can tune and configure more than a hundred parameters supported by Kafka. Any misconfiguration in these parameters can cause replication issues.
Other unexpected errors: Any other errors may prevent smooth replication for the partitions. You can monitor the Kafka broker’s logs to understand the underlying issues.

Monitoring and detecting Failed Partitions in Kafka

Fortunately, Kafka provides several key metrics and logs you can monitor to detect such partition failures. Let’s look at some of the most efficient ways to monitor for these issues:

1. The Kafka Server Logs

The Kafka broker’s logs are the most obvious place to monitor for issues. You should regularly monitor the broker’s logs and look for errors and warnings related to replication and partition failure errors.

Example:

tail -f /var/log/kafka/server.log | grep -i "marked as failed"

2. Metrics to monitor

Kafka offers several metrics that you can monitor. You should monitor these critical metrics to detect any replication issues or failed partitions:

UnderReplicatedPartitions: Partitions where the number of in-sync replicas is less than the minimum number configured for the broker or the topic. This indicates that some replicas may be out of sync or unavailable. UnderReplicatedPartitions can lead to data loss if additional replicas fall behind and go out of sync if ignored.
ISR Shrinks and Expands: ISR (In-Sync Replica) shrinks happen when replicas of partitions fall out of sync with the leader, reducing the number of replicas in the ISR list. ISR shrinks potentially increase the risk of data loss. Similarly, ISR expansion happens when out-of-sync replicas catch up with the leader and rejoin the ISR list. Kafka offers IsrShrinkRate and IsrExpandRate Metrics, which you can use to monitor these changes.
OfflinePartitionsCount: This metric tracks the number of partitions without an active leader. It can help you identify partitions that have become unavailable.
ReplicationLag: The ReplicationLag metric tracks replicas that have fallen significantly behind the leader. This can help you identify unhealthy partitions.
LogFlushTimeMs: If the LogFlushTimeMs metrics are increasing, it could indicate an underlying issue with the disk I/O or any other problems affecting the replication process.
Network and Disk I/O: Monitor network and disk performance metrics to detect possible bottlenecks.

Actions to take to address Failed Partitions

When you encounter failed partitions in Kafka, here are some actions you can take to address them:

Identifying the cause

The first step is to identify why the partitions are marked as failed. You can check the brokers’ logs to determine the errors. Identify if the issue is related to the network, storage, hardware, configuration, or any other issue.

Try restarting the affected broker.

If a broker is not functioning properly, a restart can often help fix it. More importantly, the broker’s startup logs can often provide valuable insights into what is causing the issue.

Rebalancing the cluster

If the issue persists and you may need to take the affected broker offline, you can rebalance and distribute the partitions to healthy brokers. Use the kafka-reassign-partitions tool to perform this operation.

bin/kafka-reassign-partitions.sh --zookeeper zookeeper.socketdaddy.io:2181 --reassignment-json-file ${path_to_reassignment_file} --execute

Fine-tune broker configurations

Optimize broker configurations for your environment. Review settings like replica.fetch.max.bytes, num.replica.fetchers, and network configurations to improve replication performance.

Conclusion

Automated ways to monitor and recover from failed partitions in Kafka are essential to managing a Kafka cluster and its administration. By closely monitoring your system for issues and proactively addressing them, you can ensure your cluster’s data integrity and availability.

Understanding and Managing Failed Partitions in Kafka

TL;DR

Table of contents

What are failed partitions in Kafka?

Common Reasons