Are you working on a project that handles a lot of data? Is it growing? Do you have problems transferring data? If so, you need to read this story.
There was this project that I worked on, which is called “Chapar” which means mailman in Farsi language. It was a multimedia project for banks that provided multiple media such as SMS, Email, Fax, IM for clients. There were different modules for every media running separately.
For every transaction that was held on a customer’s account, an event was generated and was put into a message broker, which at that time was Apache ActiveMQ. There was a core module called “notifier” that had the responsibility to receive these events from the message broker and process them and decide what to do with them. Like which kind of media should be used for a particular event. Or if the client’s account is active for that media. The busiest module was SMS. We had a lot of performance issues with the flood of SMS’s.
As the number of clients grew, more and more performance issues began to emerge. Sometimes we had 20 SMS processes running but this still wasn’t enough. A huge volume of events remained on the broker. After a period of time under that huge load, ActiveMQ would stop working! We had to restart the broker. One of the consequences was that we lost all those events and data remaining on the broker! Actually ActiveMQ has a persistent mode that saves the data in order not to have data loss, but the performance of it in persistent mode was a nightmare. Scary, huh? It gets scarier when there is no log of why your running instance has stopped working suddenly! Especially when they call you at 7 in the morning to tell you that the system has shut down because of the “Happy Birthday” messages that a bank sent to its clients at 7! Running multiple process of ActiveMQ at scale is a very hard thing to do.
Another issue that really was a “royal pain” was ActiveMQ’s unpredictable behaviour! In some environments it stopped working and in other environments it just slowed down without any reason! So we lost our trust in it. There is also a problem with ActiveMQ in that it uses a lot of CPU. So if your application and your ActiveMQ are running on the same server, that’s going to be problem!
Hence we needed to find a solution for this data loss, lack of reliability and performance of ActiveMQ. That’s when I got to meet Kafka.
Mr. Franz Kafka. a German-speaking Bohemian novelist and short-story writer, widely regarded as one of the major figures of 20th-century literature.
Oops! My bad.
I meant Apache Kafka…
What Is Apache Kafka?
Apache Kafka is a distributed system designed for streams. It is built to be fault-tolerant, high-throughput, horizontally scalable, allows geographic distribution of data streams and stream processing applications.
Who uses Kafka?
Communication is required between different systems in real-time scenarios which are done by using data pipelines. Today’s applications are not going to survive with one server.
This is what a lot of enterprises look like. I admit it may be a little dramatic, but if you think about large companies, your own possibly, there are hundreds of applications all needing data to operate. Now, whether it be creating logs, records, and databases, key-value pairs, binary objects, or messages, all of these applications are creating data at an incredible rate. Often that rate can strain existing data stores and require more stores to take on the load. When that happens, you have issues related to getting the data where it needs to be and enabling applications to find it. Furthermore, as businesses change, the variety of data increases, making the types of applications and data stores change as well. Now, this obviously doesn’t happen overnight, but it happens, and the result becomes a complex web of point-to-point data movements that are very hard to manage and work with.
So by using Kafka, your application will look something like this.
In this setup, Kafka acts as a kind of universal pipeline for data. Each system can feed into this central pipeline or be fed by it; applications or stream processors can tap into it to create new, derived streams, which in turn can be fed back into the various systems for serving.
As a central broker of data, Kafka enables disparate applications and data stores to engage with one another in a loosely coupled fashion by conveying data through messaging topics which Kafka manages at scale and reliably. Regardless of the system, the vendor, the language, or runtime, all can integrate into this data fabric, provided by none other than Apache Kafka.
Applications publish data as a stream of events while other applications pick up that stream and consume it when they want. Because all events are stored, applications can hook into this stream and consume as required—in batch, real-time or near-real-time. This means that you can truly decouple systems and enable proper agile development. Furthermore, a new system can subscribe to the stream and catch up with historic data up until the present before existing systems are properly decommissioned.
Why Kafka ?
Kafka is a streaming platform. Kafka has some great features over ActiveMQ such as:
- Scalability : Without incurring any downtime on the fly by adding additional nodes, Kafka can be scaled-out. Moreover, inside the Kafka cluster, the message handling is fully transparent and these are seamless. It’s worth mentioning that scaling ActiveMQ and making clusters out of it was not a pleasant experience – it is easier with Kafka.
- Durability : Because of the message replications, you’ll never going to lose your data.
- High-throughput : Without a lot of hardware, Kafka is capable of handling high-velocity and high-volume data. Also, Kafka is able to support message throughput of thousands of messages per second.
- Low Latency : It is capable of handling these messages with the very low latency in the range of milliseconds, demanded by most of the new use cases.
- Flexibility : Compared to ActiveMQ, Kafka has much more configuration options for consumers and producers.
At first understanding the architecture of Kafka seems a little bit hard and we had to invest time to comprehend it. But after mastering it, it is really easy to implement it. It took us roughly two months to implement Kafka.
One thing that I need to mention is that you should also get familiar with Zookeeper which is a centralised service for maintaining configuration information, naming, providing distributed synchronisation, and providing group services. In order to run Kafka, you need to run a Zookeeper and introduce it to your Kafka processes.
One of the problems that we had with Kafka was that there’s still not a good web console for monitoring it, unlike ActiveMQ. With Kafka, we needed to work with two web consoles that gave us complementary information – not all in one place like ActiveMQ.
To sum it up I want to say that Kafka is unique because it combines messaging, storage and processing of events all in one platform.
After implementing Kafka in our project we were quite happy with results. With just one broker running, the results were much better than ActiveMQ. Kafka provides a lot of features which we didn’t need to use. From all the results and experiences of using Kafka in big and fast growing application, we can say dealing with data in your application is not going to be a nightmare anymore.
Tomcy John, Panka Misraj. “Data Lake for Enterprises”