Brief Introduction of Kafka
Kafka was originally developed by Linkedin. It is a distributed, partition-supported, replica-based distributed messaging system coordinated by zookeeper. Its biggest feature is that it can process large amounts of data in real time to meet the needs of Various demand scenarios;
For example, batch processing systems based on hadoop, low-latency real-time systems, Storm/Spark streaming engines, web/nginx logs, access logs, message services, etc., written in scala language, Linkedin contributed to the Apache foundation in 2010 and Become a top open source project.
Terminologies in Kafka
Before we start, here are some terminology you must know.
- Producer: The producer of the message, and the entry of the message.
- Broker: Broker is a kafka instance. There are one or more kafka instances on each server. Let’s assume that each broker corresponds to a server. The brokers in each kafka cluster have a unique number, such as broker-0, broker-1, etc.
- Topic: The topic of the message, which can be understood as the classification of the message, and the kafka data is stored in the topic. Multiple topics can be created on each broker.
- Partition: The partition of the topic. Each topic can have multiple partitions. The role of the partition is to perform load and improve the throughput of kafka. The data of the same topic in different partitions is not repeated, and the manifestation of the partition is a folder one by one!
- Replication: Each partition has multiple replicas, and the role of the replicas is to be a spare tire. When the main partition (Leader) fails, a spare tire (Follower) will be selected to become the leader. The default maximum number of replicas in kafka is 10, and the number of replicas cannot be greater than the number of brokers. Followers and leaders are definitely on different machines, and the same machine can only store one replica (including itself) for the same partition.
- Message: The body of each sent message.
- Consumer: The consumer, that is, the consumer of the message, is the exporter of the message.
- Consumer Group: We can form multiple consumer groups into a consumer group. In the design of kafka, the data in the same partition can only be consumed by one of the consumers in the consumer group. Consumers in the same consumer group can consume data from different partitions of the same topic, which is also to improve the throughput of Kafka!
- Zookeeper: The kafka cluster relies on zookeeper to save the meta information of the cluster to ensure the availability of the system.
The producer uses the push mode to publish data to the broker. Each message is appended to the partition and written to the disk sequentially, so the data in the same partition is guaranteed to be in order.
As mentioned above, data will be written to different partitions, so why does kafka need to partition? I believe everyone should be able to guess that the main purpose of the partition is:
- Better scalability. Because a topic can have multiple partitions, we can easily cope with the growing amount of data by scaling the machine.
- Better concurrency performance. Using partition as the read and write unit, multiple consumers can consume data at the same time, which improves the efficiency of message processing.
Similar with Server Load Balance (SLB), when we send a request to a server, the server may load the request and distribute the traffic to different servers. In kafka, if a topic has multiple partitions , how does the producer know which partition to send the data to? There are several principles in kafka:
- When a partition is written, the partition to be written can be specified. If specified, the corresponding partition will be written.
- If the partition is not specified, but the key of the data is set, a partition will be hashed according to the value of the key.
- If neither partition nor key is specified, a partition will be selected by polling
Ensuring that messages are not lost is the basic guarantee of a message queue middleware. How can the producer ensure that messages are not lost when writing messages to Kafka? In fact, it is described in the above write flow chart, that is, through the ACK response mechanism. When the producer writes data to the queue, a parameter can be set to determine whether to confirm that kafka has received the data. The value of this parameter can be set to 0, 1, and all.
- 0 means that the producer does not need to wait for the return of the cluster to send data to the cluster, and does not ensure that the message is sent successfully. The least secure but the most efficient.
- 1 means that the producer sends data to the cluster and can send the next data as long as the leader responds, only to ensure that the leader sends data successfully.
- all means that when the producer sends data to the cluster, all followers need to complete the synchronization from the leader before sending the next one, ensuring that the leader sends data successfully and all replicas complete the backup. The most secure, but the least efficient.
Structure of Partition
Each topic can be divided into one or more partitions. If you think the topic is too abstract, then the partition is more specific. The Partition on the server are exactly folders. There will be multiple groups of segment files under the folder of each partition, and each group of segment files contains .index files, .log files, and .timeindex files (not available in earlier versions. ) three files, the log file is actually where the message is stored, and the index and timeindex files are index files for retrieving messages.
As shown in the figure above, this partition has three groups of segment files, and the size of each log file is the same, but the number of stored messages is not necessarily equal (the size of each message is inconsistent). The name of the file is named after the minimum offset of the segment. For example, 000.index stores messages with an offset of 0~368795. Kafka uses the method of segmentation + index to solve the problem of search efficiency.
Structure of Message
As mentioned above, the log file is actually the place where messages are stored. We also write messages one by one in the producer to kafka. What does the message stored in the log look like? The message mainly includes the message body, message size, offset, compression type… and so on! The three key things we need to know are:
- Offset: offset is an ordered id number of 8 bytes, which can uniquely determine the position of each message within the parition!
- Message size: The message size occupies 4 bytes, which is used to describe the size of the message.
- Message body: The message body stores the actual message data (compressed), and the space occupied varies according to the specific message.
After the message is stored in the log file, the consumer can consume it. When talking about the two modes of message queue communication, we talked about the point-to-point mode and the publish-subscribe mode. Kafka adopts a point-to-point model. Consumers actively go to the Kafka cluster to pull messages. Like the producer, consumers also ask the leader to pull messages when they pull messages.
Multiple consumers can form a consumer group, and each consumer group has a group id! Consumers of the same consumer group can consume data from different partitions under the same topic, but multiple consumers in the group will not consume data from the same partition.
The picture shows the situation where the number of consumers in the consumer group is less than the number of partitions, so there will be a situation where a consumer consumes multiple partition data, and the consumption speed is not as fast as the processing speed of consumers who only process one partition! If the number of consumers in the consumer group exceeds the number of partitions, will there be multiple consumers consuming the data of the same partition?
It has been mentioned above that this does not happen! The extra consumers do not consume any partition data. Therefore, in practical applications, it is recommended that the number of consumers in the consumer group be the same as the number of partitions!
In the section on saving data, we talked about partitions being divided into multiple groups of segments, including .log, .index, and .timeindex files. Each stored message contains offset, message size, and message body… We have mentioned segments and offset, how do you use segment+offset to search for a message? What is the process of finding a message whose offset is 368801 now? Let’s take a look at the picture below:
- First find the segment file where the 368801 message of the offset is located (use the binary search method), what is found here is the second segment file.
- Open the .index file in the found segment (that is, the 368796.index file, the starting offset of the file is 368796+1, the offset of the message we are looking for with an offset of 368801 in the index is 368796 +5=368801, so the relative offset to look for here is 5).
Since the file uses a sparse index to store the relationship between the relative offset and the physical offset of the corresponding message, it is impossible to find an index with a relative offset of 5 directly. Here, the binary method is also used to find that the relative offset is less than or equal to the specified value. The largest relative offset in the index entry of the relative offset, so the index with the relative offset of 4 is found.
- According to the found index whose relative offset is 4, determine that the physical offset position of message storage is 256. Open the data file and scan sequentially from the position 256 until the message with offset 368801 is found.
This mechanism is based on the order of offset, and uses multiple means such as segment + ordered offset + sparse index + binary search + sequential search to efficiently search for data! At this point, the consumer can get the data that needs to be processed for processing.