As the technology industry change we always have a new buzz word showing up every now and then. We heard a lot about Hadoop until it was overshadowed by Spark. But these two may not be the only buzz words which keep showing up day-in and day-out. We have a few more which go past our eyes. Kafka is one such technology which is adapted at a rapid pace today by many enterprises.
What is Kafka
Kafka is an Apache Top Level project. Apache Kafka is an open-source streaming platform which provides unified, high-throughput, low-latency platform which can handle real-time data feeds. It is a scalable, fault-tolerant, publish-subscribe messaging system which enables us to build distributed applications.
Apache Kafka offers additional libraries which can be used for stream processing. Apache Kafka also offers the ability to connect to external systems via Kafka Connect.
History of Apache Kafka
Originally developed at LinkedIn it was donated to Apache Software Foundation (ASF) in 2011. The project graduated to become a top level project for ASF in 2012. However, a lot has changed since 2012 – specially the design and robustness of the platform.
While at LinkedIn, Kafka helped design pipelines to move the data from several different source to several different destination.
However, the popularity increased when Netflix provided insights into the magic and amazing potential Apache Kafka had to offer. And this got everyone interested to replicate the same magic which Netflix has achieved.
What Apache Kafka Offers?
Apache Kafka offers low-latency ingestion of large amounts of data. This can be event based or some kind of logs or transactions. It is a pub-sub system which can –
- deliver “in-order”;
- is persistent;
- is scalable;
- guarantees the delivery “at least once”
If I were to rephrase this I would say, Apache Kafka can offer really high-throughput data ingestion and ensure that the data is: Partitioned, Ordered, Persisted, Delivered “at least once” and Re-read.
Apache Kafka has publishers, queue and subscribers. The naming convention mentioned here is to keep you in sync with the traditional messaging systems. For Kafka world these conventions would be “producers, topics and consumers”.
Let us take a some more details of various aspects mentioned for Apache Kafka.
The design of Kafka enables the platform to be ingest messages at blistering speed.
- The ingestion rates in Kafka can exceed beyond 100k/seconds.
- The data is ingested in a partitioned and ordered fashion.
The scalability can be achieved in Kafka at various levels.
- Multiple producers can write to same topic.
- Topics can be partitioned.
- Consumers can be grouped to consume individual partitions.
Kafka is a distributed architecture which means there are several nodes running together to serve the cluster.
- Topics inside Kafka are replicated.
- Users can choose the number of replicas for each topic to be safe in case of a node failure.
- Node failure in cluster won’t impact.
- Integration with Zookeeper provides producers and consumers accurate information about the cluster.
- Internally each topic has its own leader which takes care of the writes.
- Failure of node ensures new leader election.
Kafka offers the data durability as well.
- The message written in Kafka can be persisted.
- The persistence can be configured.
- This ensures re-processing, if required, can be performed.
An important concept for Apache Kafka is “log”. This is not related to application log or system log. This is a log of the data. It creates a loose structure of the data which is consumed by Kafka. The notion of “log” is an ordered, append-only sequence of data. The data can be anything because for Kafka it will be just an array of bytes.
What Apache Kafka does not Offer?
Apache Kafka was designed to be fast and lightweight. The traditional messaging systems carried a lot of overheads which slowed down the processing. Kafka is a smart solutions which shaves off all the unnecessary hooks and switches.
So if you are comparing Apache Kafka with the traditional messaging systems, here is what Kafka does not offer.
- Kafka doesn’t number the messages. It has a notion of “offset” inside the log which identifies the messages.
- Consumers consume the data from topics but Kafka does not keep a track of the message consumption. Kafka does not know which consumer consumed which message from the topic. The consumer or consumer group has to keep a track about the consumption.
- There are no random reads from Kafka. Consumer has to mention the offset for the topic and Kafka starts serving the messages in order from the given offset.
- Kafka does not offer the ability to delete. The message stays via logs in Kafka till it expires (until the retention time defined).
Apache Kafka V/S Traditional Message Brokers
So how different is Apache Kafka in comparison to the traditional message brokers which can be based on AMQP or JMS or something else?
As mentioned earlier, Kafka is a light weight messaging queue. The traditional messaging solutions had additional functionalities in the areas of subscription (or consumption). It provided a framework which would ensure that the messages are delivered to the consuming systems through the acknowledgement mechanism. Besides this traditional messaging system would offer non-persistent messages, TTLs on messages and several different protocols (AMQP, MQTT, STOMP and a few others).
Apache Kafka does not offer the above mentioned functionalities.
Apache Kafka can be an excellent alternative to the traditional message brokers if the use-case demands exceptionally fast ingestion and easily scaleable solution. This means that if you want to ingest from a firehose of data in a reliable, efficient, scalable and durable method – Apache Kafka is your answer.
To put in a simple way:
- Apache Kafka is an excellent solution for data ingestion at blistering pace and consumers which are reliable in nature.
- Traditional Message brokers are good for comparatively slower data ingestion and consumers which are unreliable in nature.
Apache Kafka – Use Cases
So the question which remains open is “Should I use Apache Kafka?”
The short answer is – it really depends on the use case. There are a few areas where Apache Kafka does not really fit.
However, in a lot of use cases which are related to web-scale solutions or real-time data ingestion from several different sources or sending the data to multiple different destinations. The following use-cases are really good fit for Apache Kafka usage:
- Website Activity Tracking
- Log Aggregation
- Stream Processing
- Event Sourcing
- Commit Log
The Apache Kafka documentation discusses these use cases quiet extensively here.