Apache Kafka Storage Architecture | Kafka Streams Series -1

TL;DR | This is the first article of the Kafka Streams series which I’m going to cover Kafka and their APIs. The main focus is this series going to be the Kafka Streams. At the end of the series, I’m going to demonstrate Apache Kafka Streams which I wrote with Java. To talk about this article; covers the fundamentals of Apache Kafka and gives you an idea to imagine how Kafka works fast. With that, I will talk about its failover system.

I got confused when I decided to explain something about Apache Kafka. Because I had lots of stuff to show you all, so I wasn’t determined about a road map. Finally, I made up my mind. This series going to throw you inside the Apache Kafka. We’re going to understand the deep of Apache Kafka and end of the series we will implement it to understand their API’s such as Producer, Consumer, and Streams.

It’s significant to get to know Apache Kafka storage structure, cluster structure and work distribution structure to handle problems we will face. As far as I see, Kafka architectures are too complicated. But to solve Kafka problems is through to learn that are architectures. So, first we are going to start to come to know Kafka Storage Architecture.

Before we start

This article gives you come core concepts about Apache Kafka : here.
And then, you should check this article to get fundamentals.

What are Kafka’s abilities based on?

Before the details, let’s know the keywords

Topic

Initial topic files

Kafka Topics stores millions message of data. Do you believe that able to store millions messages to read in a simple log file even if it has been indexed? Sure not, Kafka provides partitions for the topics.

Partitions

We know that; Kafka wants to be safe side, high fault tolerance and reliable system. And that’s why Kafka caught our attention. To develop a high fault tolerance system we should have high available backup or second address. At this point, the keyword is replication factor.

Replication Factor

Number of Replicas(15) = Partitions(5) x Replication(3)

In this case, 3 replicas stores in 3 brokers. Each broker have 5 partitions. In total, we have 15 directories. That directories are called partition replica.
All directories are uses between available brokers on the processing time. Because Kafka works distributed and as I said before Kafka is also responsible to handle a high availability problem. The replication factor gives a power to handle that’s problem. But not just for high availability, it gives also more flexibility such as tolerance of the faults and at-least-once principle power.

Segment

Segments

Offset

Because of that, if you want to get a specific message in the broker you need to know topic name, partition number and offset number. Recall, Kafka works distributed.

Digging deeper into the Kafka Storage Architecture

Leader and Follower Partitions

After executes this command, we get a topic which has 5 partitions which imagine like directories, and 3 replicas. That means, Kafka replicate each of 5 partitions in the 3 broker. And each partition has a leader partition and zero or more follower partitions. The number of follower partitions depends on the replication factor. I’m going to cover ISR list on the next article but for now I can say, Kafka provides a list in the internal which collect partition's status and availability check called ISR list. All of that, follower partitions, leader partitions and ISR list give a power high failover system to the Kafka.

Conclusion

Thanks for interested in and reading it. I’m going to talk about Kafka Cluster Architecture at the next article. See you at the next time.

Software Engineer @Nesine.com www.kodadam.net

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store