Category Archives: APACHE KAFKA



Kafka’s growth is exploding. More than one-third of all Fortune 500 companies are using Kafka. Kafka is used for real-time streams of data, to collect big data, or to do real time analysis. Kafka is used with in-memory micro services to provide stability and it can be used to feed events to CEP (complex event streaming systems) and IoT/IFTTT-style automation systems. So, explore this article and know what Kafka is and why it is important.


Kafka is a distributed streaming platform that is used to publish and subscribe to streams of records. Kafka is used for fault tolerant storage. Kafka duplicates topic log partitions to multiple servers. Kafka is designed to allow the users apps to process records as they occur. Kafka is fast and uses IO proficiently by batching and compressing records. Kafka is also used for decoupling data streams. It is used to stream data into data lakes, applications, and real-time stream analytics systems.


Kafka is usually used in real-time streaming data architectures to deliver real-time analytics. Kafka is also fast, scalable, sturdy, and fault-tolerant publish-subscribe messaging system. Kafka is used in use cases where JMS, RabbitMQ, and AMQP may not even be considered due to capacity and responsiveness. Kafka has higher throughput, reliability, and duplication characteristics, which makes it applicable for things such as tracking service calls or tracking IoT sensor data where a traditional MOM might not be considered.

Kafka can also work with Flume/Flafka, Spark Streaming, Storm, HBase, Flink, and Spark for real-time ingesting, analysis and processing of stream data. Kafka traders support huge message streams for low-latency follow-up analysis in Hadoop or Spark. Also, Kafka Streaming can be used for real-time analytics.


Kafka has operational ease. Kafka is used to set up and use, and it is easy to figure out how Kafka works. However, the main reason Kafka is very popular is because of its excellent performance. It is stable, provides reliable durability, has a flexible publish-subscribe/queue that scales well with N-number of consumer groups, has robust replication, provides producers with tuneable consistency guarantees. In addition, Kafka works well with systems that have data streams to develop and enables those systems to aggregate, alter, and load into other stores. But none of those characteristics would matter if Kafka was slow. The most important reason Kafka is popular is Kafka’s unique performance.


Kafka is used for stream processing, website activity tracking, metrics collection, and monitoring, log aggregation, real-time analytics, ingesting data into Hadoop, CQRS, replay messages, error recovery, and guaranteed distributed commit log for in-memory computing.


Kafka relies heavily on the OS kernel to move data around quickly. It relies on the principals of zero copy. Kafka enables the user to batch data records into portions. These batches of data can be seen end-to-end from producer to file system to the consumer. Batching allows for more efficient data compression and reduces I/O latency. Kafka writes to the immutable commit log to the disk sequential, thus evading random disk access and slow disk seeking. It shards a topic log into hundreds of partitions to thousands of servers. This sharding allows Kafka to handle huge load.


The answer will always depend on what the user use case is. Kafka fits a class of problems that handles a lot of web-scale organizations and enterprises, but just as the traditional message broker is not a one size which fits to all, neither is Kafka. If you are looking to build a set of resilient data services and applications, Kafka can function as the source of truth by collecting and keeping all of the “facts” or “events” for a system.

In the end, the user has to consider the trade-offs and disadvantages. If you think you can profit from having multiple publish/subscribe and queueing tools, it might be worth considering.