What is a Streaming Platform?
Streaming Platform processes the stream of records in the order they occur. It can store those stream of records in a fault tolerant manner.It can publish and subscribe to these records like Message Queue.
Kafka can run on one or multiple servers in the form of cluster and can spread accross multiple data centers. This cluster stores stream of records in categories called topics. Each record is the combination of key, value and timestamp.
Where we can use Kafka?
Kafka is widely used in the two classes of applications as mentioned below.
Real-time streaming data pipelines that reliably get data between systems or applications. Real-time streaming applications that transform or react to the streams of data.
The communication between client and server is done with TCP Protocol. It is versioned and can be differntiated with older versions. Kafka clients are available in many languages primarily in Java.
Core APIs in Kafka
Producer API : This Producer API allows an application to publish a stream of records to one or more Kafka topics.
Consumer API: This Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
Streams API: This Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics.
Connector API: This Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
What all we have in Kafka?
Topics in Kafka
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; i.e a topic can have zero, one, or many consumers that can subscribe to the data written to it. Each topic has its own partitioned log maintained by Kafka.
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
The Kafka cluster has a retention period for all published records to be available for consumption. After the retention period is over they have to free up the space for other records. The data size does not have any effect on the performance of Kafka cluster.
Distribution in Kafka
The partitioned log is distributed over the servers in Kafka cluster. Each Server is responsible for handling and requests. Each partition has one master server and other servers act as followers. The master server handles all operations for the partition and followers just follow it. If master fails then one of the followers become the master.
Producers in Kafka
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic.
Consumers in Kafka
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
Kafka as Messaging System
There are two models in messaging one is message lined up in a queue and the other one is publish – subscribe model.
Advantages of Queue and Publish Subscribe Model
The Queuing Model allows us to divide the processing of data over multiple consumer instances, which lets scaling our processing.
Publish-subscribe allows us to broadcast data to multiple processes, but cant scale tot he level which we can have in queue.
But queues aren’t multi-subscriber—once one process reads that data it’s gone.
The advantage of this model is that every topic supports both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.
Kafka as Storage System
The Data in Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn’t considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.
The disk structures Kafka uses scale well—Kafka will perform the same whether we have 50 KB or 50 TB of persistent data on the server.
Kafka for Stream Processing
In Kafka, a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.
For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.
Kafka has support for fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.