Spark Streaming – A Simple Example

Streamline data processing has become an inherent part of a modern data architecture build on top of Hadoop. Big Data applications need to act on data being ingested at a high rate and volume in real time. Sensor data, logs and other events likely have the most value when being analyst at the time they are emited – in real time.

For over 6 years Apache Storm has matured at hundreds of global companies to the preferred stream processing engine on top of Hadoop. In a more recent approach Spark Streaming was published build around Spark‘s Resilient Distributed Datasets (RDD). Constructed with the same concepts as Spark, a in-memory batch compute enginge, Spark Streaming is offering the clear advantage of bringing batch and real time processing closer together, as ideally the same code base can be leveraged for both.

Hence Spark Streaming is a so called micro-batching framework that uses timed intervals. It uses so called D-Streams (Discretized Stream) that structure computation as small sets of short, stateless, and deterministic tasks. State is distributed and stored in fault-tolerant RDDs. A D-Stream can be build from various data sources as Kafka, Flume, or HDFS offering many of the same operations available for RDDs with additional operations typical for time operations such as sliding windows.

In this post we are looking at a fairly basic example of using Spark Streaming. We will listen to a server emitting line by line in this example.

Setting up the Project

Quickly setting up a Spark or Scala project can be achieved by using Maven archtypes as explained in detail here. Cask Data makes a spark artifact available that can be used to quickly bootstrap a spark project.

To use the Cask Data archtype:

spark_archtypeMake sure in the pom to adjust the following to make use of Spark 1.3 which is the current stable release of Spark. Also include the spark-streaming_2.10 for streaming.

NetworkWordCount

The NetworkWordCount example used here first sets up a local StreamContext that uses a 1 second batch interval.

With this context in place we can create a D-Stream that listens to the host and port we provide as command arguments to the application:

With we allow Spark to persist RDDs to disk as well. Go here to find out more about Spark storage levels and which to choose.

By now we have a socket stream established based in a Spark context on top of D-Streams that reads line by line, every second. Next we would want to actually transform this stream into a bag of words aggregating it into a list of word counts. To transform the lines emitted by the stream into a set of words we would split it up like the following:

Our list of word counts are created by reducing set of words over each by then read RDD. At the end we print out that result.

To actually run the context we would finally start the context until we receive a termination.

Building and Running the Example

Spark applications can be run best locally using it as a Standalone Application. Therefor one of the quickest way to run the example would be by building Spark locally. First we would need to checkout the source via github. Examples of building Spark can be found here. Running it all together:

Now we can submit our example.

Running the socket server for testing:

Further Readings

2 thoughts on “Spark Streaming – A Simple Example”

  1. Please stop use wordcount examples ! There many other good use cases you can take as an example. Which will give more in depth knowledge on concept.

Leave a Reply

Your email address will not be published. Required fields are marked *