Apache Kafka Storage Architecture | Kafka Streams Series -1
TL;DR | This is the first article of the Kafka Streams series which I’m going to cover Kafka and their APIs. The main focus is this series going to be the Kafka Streams. At the end of the series, I’m going to demonstrate Apache Kafka Streams which I wrote with Java. To talk about this article; covers the fundamentals of Apache Kafka and gives you an idea to imagine how Kafka works fast. With that, I will talk about its failover system.
I got confused when I decided to explain something about Apache Kafka. Because I had lots of stuff to show you all, so I wasn’t determined about a road map. Finally, I made up my mind. This series going to throw you inside the Apache Kafka. We’re going to understand the deep of Apache Kafka and end of the series we will implement it to understand their API’s such as Producer, Consumer, and Streams.
It’s significant to get to know Apache Kafka storage structure, cluster structure and work distribution structure to handle problems we will face. As far as I see, Kafka architectures are too complicated. But to solve Kafka problems is through to learn that are architectures. So, first we are going to start to come to know Kafka Storage Architecture.
Before we start
I assume that you know Apache Kafka fundamentals. This point isn’t convenient for you if you don’t have any idea about Apache Kafka. I can give some reference to those who need to realize the fundamentals of Apache Kafka.
This article gives you come core concepts about Apache Kafka : here.
And then, you should check this article to get fundamentals.
What are Kafka’s abilities based on?
The Kafka’s power come it’s distributed working, and it provides a horizontal scalable system. It’s allow us to share topic messages into hundreds or thousands partitions to thousands of server. That operations pretty easy if a topic come into your mind as a single database table. But, what if it isn’t? Of course, it’s not! It’s more complicated than a simple database’s table or a simple log file. So, it’s difficult to work distributed and share chunk of information between partitions and machines.
Before the details, let’s know the keywords
Topic
You are already know, Apache Kafka store the messages in topic. But I want to internalize for you, I’d like to show you a physical Kafka Topic which is the logical name group your messages. When we create a simple topic, it means we need a special structural log file to store our messages. And also we are waiting to that structural log files gives us some flexibility and scalability, because our messages size is extremely large. And you can see in the below, what makes the topic log files different from simple log files.
Kafka Topics stores millions message of data. Do you believe that able to store millions messages to read in a simple log file even if it has been indexed? Sure not, Kafka provides partitions for the topics.
Partitions
You may get the partition into your mind as a directory. It means, Kafka creates a separate directory for each topic partition. The needed partition count can be given as a parameter when creating a new topic. If you determined n partition count, Kafka going to create n directories for that topic.
We know that; Kafka wants to be safe side, high fault tolerance and reliable system. And that’s why Kafka caught our attention. To develop a high fault tolerance system we should have high available backup or second address. At this point, the keyword is replication factor.
Replication Factor
The comrade of partitions. Like partition, you can define the number of it when creating a new topic. I called it the comrade of partition because it’s responsible to determine the number of partition copy. Calculating is quite simple. Here is formula.
Number of Replicas(15) = Partitions(5) x Replication(3)
In this case, 3 replicas stores in 3 brokers. Each broker have 5 partitions. In total, we have 15 directories. That directories are called partition replica.
All directories are uses between available brokers on the processing time. Because Kafka works distributed and as I said before Kafka is also responsible to handle a high availability problem. The replication factor gives a power to handle that’s problem. But not just for high availability, it gives also more flexibility such as tolerance of the faults and at-least-once principle power.
Segment
Segments are log files that includes messages to store in each partition. There are more than one segment can exist in a partition. The messages write in the segments, and they are having been cleared when segment size arrive at maximum. The default segment size is 1 GB, and it can be configurable. Here is the view of segments included in broker.
Offset
Given a name which is 64 bit long to every segment. This value called offset.
The offset is a value that is increment by 1 for each message. While for the first message is like 0000000 and then the next value will be 0000001. When arrives at the end of the segment limit such as 923652372 offset value, then creates a new segment which is started 923652372 value, and it’s incremented by 1. This new segment get to name it a new offset, in this case 923652373.
Because of that, if you want to get a specific message in the broker you need to know topic name, partition number and offset number. Recall, Kafka works distributed.
Digging deeper into the Kafka Storage Architecture
For now, we covered the something keywords about the Apache Kafka. If everything is well for you, we can continue to cover how Kafka handle failover. The one of the Kafka mission is the significant to deal under high messages load and to be highly available at the same time to do that. To know it, we should dig the lower level of Kafka.
Leader and Follower Partitions
We may classify partitions in two groups. Leader and follower partitions. Follower partitions are consists of n copies of the leader partitions. As the name suggests, the follower partitions should be up to date with the leader it depends on. If I need to internalize, assume that we are to create a topic with the command in the below.
After executes this command, we get a topic which has 5 partitions which imagine like directories, and 3 replicas. That means, Kafka replicate each of 5 partitions in the 3 broker. And each partition has a leader partition and zero or more follower partitions. The number of follower partitions depends on the replication factor. I’m going to cover ISR list on the next article but for now I can say, Kafka provides a list in the internal which collect partition's status and availability check called ISR list. All of that, follower partitions, leader partitions and ISR list give a power high failover system to the Kafka.
Conclusion
We learned how Kafka logically organized. Apache Kafka keeps millions of messages in punch of log files. It’s known segments. These segments come together in the partitions which is the directory of brokers. These directories i.e. partitions, can be replicated on the more than one broker. This replication gives us fault-tolerance and highly available system.
Thanks for interested in and reading it. I’m going to talk about Kafka Cluster Architecture at the next article. See you at the next time.