Real-Time analytics with Kafka and Spark Streaming

It’s Real-Time time

Lately Real-Time processing has been gaining a lot of popularity. However, one thing to note is that the concept of Real-Time, RT, has been around for some time. Traditional enterprise software vendors have been providing tools to do real time processing, formerly known as “Complex Event Processing” or CEP systems, for quite some time. This raises an obvious question, if real time processing is not a new concept, then why is it becoming popular only now.

I think following three factors are making real time processing popular.

  • Exponential growth in continuous data streams
  • Open source tools for reliable, high-throughput and low latency event queuing and processing
  • Capability to run the tools on commodity hardware

Canonical stream processing architecture

The increasing demand for RT processing requires systems that enables it. A canonical RT/ stream processing architecture is typically made of a following components.

  • Data sources – source of data streams. This is the raw data and so the source of truth. A data source could be a sensor network, or a mobile application, or a web client, or a log from a server, or even a thing from Internet of Things.
  • Message bus – reliable, high-throughput and low latency messaging system. Kafka and Flume are obvious choices. There are many other options as well, like, ActiveMQ, RabbitMQ, etc. Kafka is definitely gaining lots of popularity right now.
  • Stream processing system – a computation framework capable of doing computations on data streams. There a few stream processing frameworks out there, Spark Streaming, Storm, Samza, Flink, etc. Spark Streaming is probably the most popular framework for stream processing right now.
  • NoSql store – processed data is of no use if it does not serve end applications or users. End applications like to do lots of fast read and writes. HBase, Cassandra, MongoDb are popular choices.
  • End applications – these are the application which consume the processed data/result streams.

Canonical Stream Processing Architecture

The architecture itself is confusing with lots of tools and components interacting with each other. It gets even more complicated when it is drafted to address a complex use case as many more components and many more levels of interactions between those components may be required.

See it in action

I wish there was a simple end-to-end application that would demonstrate how one would stitch together various open source tools to address a particular use-case. With more and more organizations and people wanting to explore the potential of RT processing systems, such an example application will be really helpful to understand nuances of building a RT processing system. Fortunately, I recently wrote an application that does exactly that. Pankh, is an application that performs sentiment analysis on tweets.

Screen Shot 2015-06-14 at 11.21.35 AM

Figure below shows different building blocks of Pankh and how are they stitched together in the application.

Screen Shot 2015-06-14 at 11.20.48 AM

Building pieces of Pankh

Pankh uses twitter4j to get tweets from twitter, which is then streamed to Apache Kafka. Twitter credentials are required for twitter4j to be able to pull tweets. These credentials can be specified in “twitter-kafka/conf/producer.conf”. The application provides capability to pull tweets that has some keywords, user is interested in. These keywords can be specified in the configuration file as well. Spark streaming is used to perform sentiment analysis on stream of tweets that it gets from Apache Kafka, which acts as a message bus for the application. Based on the analysis, each tweet is tagged with a positive or negative sentiment and the result is then persisted into Apache HBase. The results then can be used from various applications. For demo purposes one can easily use Hue and Hive as end applications.

This application has been tested on CDH 5.4.0 cluster.

Tweet about this on TwitterShare on FacebookShare on LinkedIn

'Real-Time analytics with Kafka and Spark Streaming' have 1 comment

  1. September 20, 2015 @ 3:45 pm Jason Haas

    Thanks for the post! I am doing something similar for collecting twitter data. My architecture is like this:

    Twitter API data connections (Python code) –> Kafka –> Logstash –> Elasticsearch –> Kibana.

    This allows for real time analytics on large amount of real time tweet data. I am also using Storm + Streamparse to do real time distributed processing on top of any data collected from the tweet.


Would you like reply to Jason Haas

Your email address will not be published.