Overview of HDInsight Kafka

Chris Seferlis Jun 29, 2018

Copy of Overview of HDInsight Kafka Continuing with my HDInsight series, today I’ll be talking about Kafka. HDInsight Kafka will sound much like Storm but as I get into the nuts the bolts you’ll see the differences. Kafka is an open source distributed stream platform that can be used to build real time data streaming pipelines and applications with a message broker functionality, like a message cue.

Some specific Kafka improvements with HDInsight:

9% uptime from HDInsight
You get 16 terabyte managed discs which increases the scale and reduces the number of required nodes for traditional Kafka clusters, which would have a limit of 1 terabyte.
Kafka takes a single rack view, but Azure is designed in 2 dimensions for update and fault domains. Thus, Microsoft designed special tools to rebalance the partitions and replicas. Once you scale out, you would repartition your data and then you’d be able to take advantage of the additional nodes, as well as when you scale down.
Kafka allows you to change the number of worker nodes for scaling up/down, depending on the workload and this can be done through the portal or PowerShell or any automation tool within Azure.
Direct integration with Azure log analytics. This looks at the virtual machine level information like the disc and the network. The importance of this is it allows you to roll that up into the Microsoft OMS suite for global log analytics. So, when you’re looking at all your resources in Azure through OMS, it helps you to see it at a high level and also drill in for more details.
The Zookeeper manages the state of the cluster which helps the concurrency, resiliency and the low latency transactions, as well as the orchestration of the data through the nodes and clusters.
Records are stored in topics which is produced by a producer and consumed by consumers. The producers send records to Kafka brokers and each worker node in the cluster is considered a broker. These brokers are what is helping the data move around inside the clusters.

Again, Kafka and Storm sound relatively similar, here’s some major differences:

Storm was invented by Twitter; Kafka by LinkedIn. But these are all using the Hadoop platform and it’s an open source, so they can build their own iterations.
Storm is meant more for real time message processing; Kafka is for distributed messaging processing.
Storm can take data from Kafka and other database system and process the data; Kafka is taking in those streams from things like Facebook, Twitter and LinkedIn.
Kafka is a message broker; Storm’s primary use is stream processing.
In Storm there is no data storage, you can only stream data through it; Kafka stores the data on the file system. As those streams are processed, Storm can do it much faster, on a micro-batch processing level. Kafka is doing small batches, larger than micro.
As far as dependency, Kafka requires Zookeeper for all the orchestration; Storm does not depend on anything externally.
Storm has a latency of milliseconds; with Kafka it depends on the source of the data, but typically takes slightly less than 1-2 seconds. So, you’re keeping the data local in Kafka, processing it, then pushing it somewhere else. Whereas with Storm, you’re processing the data in motion as you’re pushing it somewhere else.

Basically, two different ways to solve similar problems depending on the use case. It apparently worked better for LinkedIn to design it this way as opposed to the way that Twitter handles their data.

Either way, pretty cool technology. If you’d like to learn more about Kafka, Storm, HDInsight or Azure itself, we are your best resource. Click the link below or contact us—we’re here to help.