Introduction
Apache Kafka has become one of the most popular technologies for messaging, streaming and event processing for a while now. However, especially in a microservices architecture, things can go wrong. Downstream services can become unavailable, a breaking change can occur, or a power outage could happen. How can we ensure that these incidents don’t lead to a loss of data and all messages will eventually be processed? Let’s take a look at three possible ways.
Environment setup
For our environment setup we will use Docker. Docker is a container runtime. Containers are applications that package code and dependencies in one executable called an image. This enables applications to run from one environment to the other. Furthermore, it is an easy way to spin up your of required dependencies in one go without needing to manually install every dependency locally! Docker-compose simplifies this even further, by letting you define all your containers and its configuration in a single file, and deploy them with a single command.
Let’s first setup these required components, which include Zookeeper, Kafka and AKHQ. Zookeeper is responsible for things like managing a list of brokers, leader election for partitions, and sending notifications to Kafka when changes occur. Kafka is needed for the production and consumption of our messages and AKHQ is an open source tool which can be used to inspect the Kafka cluster. Since Kafka needs Zookeeper to run, Zookeeper needs to be started first. That is why it is defined first in the docker-compose.yml
to start all the containers in the docker-compose.yml
execute the following command in the directory where the file is located: docker-compose up -d
Docker-compose.yml
Error handling with fixed backoff
Now that we have our environment up and running, let’s start with a simple retry mechanism. This is the fixed backoff. As the name suggests, a fixed backoff retries the message for a fixed amount of times. After that, the message is not retried again. Optionally, the message could be posted to a dead letter topic and processed later.
In the sample below, the message is retried three times and is sent to the dead letter topic after the retry count is exhausted.
application.yml
Configuration
Listener
Let’s verify if this is the case. In the docker-compose.yml
we have defined that the AKHQ page is served at localhost:8080 which in turn connects to Kafka on port 9092.
Thus, lets start the application and open our browser at http://localhost:8080/ui/docker-kafka-server/topic
You should see the following page. Double click on where my-topic is written in the name column.
The following page shows up.
Click on produce to topic. In the value field type null and click produce.
You will see the following exception three times in your stack-trace:
Caused by: java.lang.Exception: null
Reopen http://localhost:8080/ui/docker-kafka-server/topic again. After the retries have been exhausted Spring created the Dead letter topic because it didn’t exist yet and posted the message to the dead letter topic.
Note that the DLT postfix is the default, but this can be overridden by configuring your own destination resolver. Consult the following Spring documentation on how to implement your custom destination resolver: https://docs.spring.io/spring-kafka/docs/current/reference/html/#dead-letters
The retry mechanism as well as the publishing of the error message to the dead letter topic is configured in the error handler bean in the KafkaConfig
class.
Error handling with exponential backoff
While the fixed backoff is better than the default implementation, which is retrying a message infinitely, it is a relatively naive implementation. It is quite likely the systems have not been recovered after three retries. Think about examples where a circuit breaker of a service opens up and is not accepting traffic at the moment, or simply a redeployment gone wrong and a rollback needs to be done. We therefore would want to delay the reprocessing after an initial retry instead of keep retrying our failed messages.
Enter the exponential backoff!
In general, the application.yml
and the listener stay the same. Only the error handler in the configuration is different. The code below indicates that retries should commence for a minute, with an initial delay of three seconds, and multiply the delay times two. Effectively this leads to five retries within a minute. After this, the message is sent to the dead letter topic just as in the fixed delay example.
Non blocking retries
While the exponential backoff is an improvement, there is a problem. Say we have a very wide retry window, the rest of our messages are blocked until the previous messages are processed. Is it possible to have the above mentioned exponential backoff and not block the rest of our messages being processed? This is possible, thanks to Spring Kafka’s RetryableTopic
Application.yml
Configuration
Listener
Before we take a look at the RetryableTopic
annotation it is noteworthy to mention that the error handler bean in the configuration is not necessary anymore. This is because the retry logic as well as the dead letter strategy is handled in the RetryableTopic
annotation.
TheRetryableTopic
annotation has a lots of options, and there are even more to configure. Take a look at the Spring documentation for other available options: https://docs.spring.io/spring-kafka/docs/current/api/org/springframework/kafka/annotation/RetryableTopic.html
The retry attempts property will create the same amount of topics as the retry attempts. The backoff is similar to what we saw before, with a delay and a multiplier. The number of partitions is in how many parts the topic is being sliced. The topic suffix strategy says what the suffix should be of the retry topics. In this case we use the index value. The DLT strategy says what should be done when DLT processing fails, and lastly the excludes exclude a list of exceptions that should not be retried.
In this example these are a list of exceptions occurring during serialisation and deserialisation, but these can be complemented with functional errors that are unlikely to be recoverable.
Now let’s start the application. Let’s go to AKHQ.
What is different in the topic list compared to the previous samples?
Spring Kafka just created six retry topics next to the main topic and the dead letter topic. On every retry attempt the message is put on the next retry topic so that the main topic is not blocked and other messages can be processed. This is great, since errors can have a wide variety of reasons, and it is totally possible that other messages can be processed successfully while others can’t.
Post another null message on the main topic and you will see the message travelling from retry topic to retry topic and eventually arriving in the dead letter topic.
Conclusion
It should be clear that the RetryableTopic
annotation provides the most robust retry logic. Keep in mind though that even though the main topic is not blocked, the retry topics could be blocked if a lot of messages come in. To solve this you could increase the partition count since then the retried messages are divided over these partitions. Another thing to consider is that sometimes our Kafka environment is managed by an external team and configuration is maintained by them. However for the RetryableTopic
to work you need to be able to auto create topics. Sometimes this isn’t enabled by default since from Kafka perspective auto creating topics is not a good practice because topic creation often needs to be tuned and auto creating topics is generally seen as a way to avoid that. In case it is not possible for you to have the correct permissions then you probably want to use the ExponentialBackOff
. However do not block the main topic for too long. After the retries exhaust send the message to the dead letter topic. Optionally, you could define a @DltHandler
method to try and reprocess the message from the dead letter topic.
Code repository
https://github.com/sourcelabs-nl/spring-kafka-error-handling