Simple Spark Streaming & Kafka Example in a Zeppelin Notebook

Apache Zeppelin is a web-based, multi-purpose notebook for data discovery, prototyping, reporting, and visualization. With it’s Spark interpreter Zeppelin can also be used for rapid prototyping of streaming applications in addition to streaming-based reports.

In this post we will walk through a simple example of creating a Spark Streaming application based on Apache Kafka.

Creating a Notebook

For our example we first need to create a new notebook, which we’ll name “Simple Spark Streaming Kafka Example”:

Naming the notebook:

Adding Dependencies to Spark Interpreter

For our Kafka example we rely on dependencies not necessarly included with the SparkContext created by the Zeppelin interpreter. Zeppelin allows to import arbitrary packages available in any Maven repository. For any none standard/public repository it needs be configured with Zeppelin.

We can reach the interpreter setting through the notebook options as shown in the picture above. Here we can set the default interpreter for our notebook, but we can also enter the settings page of all interpreters:

Find the Spark interpreter in the list of available Zeppelin interpreters for editing:

Scroll to the bottom were you should find the appropriate Dependencies sections, where you can add additional packages:

This are the packages needed depending on your Kafka distribution as well as the Scala release you are using for Spark.

org.apache.spark:spark-streaming-kafka_2.10:1.6.2
org.apache.kafka:kafka_2.10:0.8.2.2
org.apache.kafka:kafka-clients:0.8.2.2

Save and restart the interpreter.

Preparing Kafka Topic

For this example we create a simple topic named “spark-test-topic” with just one partition:

$ cd /usr/hdp/current/kafka-broker/
$ bin/kafka-topics.sh --create 
> --topic spark-test-topic 
> --zookeeper node1.hdp:2181 
> --partitions 1 
> --replication-factor 1
Created topic "spark-test-topic".

In our example we will use a Spark Streaming app to read the key-value messages send to this topic’s partition and split it by space to simply print out the individual words of one interval.

For this we don’t need a specific producer but simply reuse the existing console producer, which sends each line after a line break to the topic:

$ bin/kafka-console-producer.sh --broker-list node1.hdp:6667 --topic spark-test-topic
word word hello

Simple Spark Streaming Application

For our example we do need a couple of dependencies not already imported into the created SparkContext. We do need the Kafka message serializes as well as the KafkaUtils of the Kafka streaming package for Spark.

import _root_.kafka.serializer.DefaultDecoder
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._

Our SparkStreamingContext will only require ERROR logging to prevent INFO logging output. The context will run in a 5 second interval:

sc.setLogLevel("ERROR")  // prevent INFO logging from polution output

val ssc = new StreamingContext(sc, Seconds(5))    // creating the StreamingContext with 5 seconds interval

For our streaming context we need to know the Kafka configuration of the topic to subscribe to, which we will hold in our kafkaConf variable:

val kafkaConf = Map(
    "metadata.broker.list" -> "node1.hdp:6667",
    "zookeeper.connect" -> "node1.hdp:2181",
    "group.id" -> "kafka-streaming-example",
    "zookeeper.connection.timeout.ms" -> "1000"
)

Finally we are creating a D-Stream and map it to individual words separated by space based :

val lines = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](
    ssc,
    kafkaConf,
    Map("spark-test-topic" -> 1),   // subscripe to topic and partition 1
    StorageLevel.MEMORY_ONLY
)

val words = lines.flatMap{ case(x, y) => y.split(" ")}

words.print()

The complete notebook:

The complete code example:

%spark
import _root_.kafka.serializer.DefaultDecoder
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._

// prevent INFO logging from pollution output
sc.setLogLevel("ERROR")

// creating the StreamingContext with 5 seconds interval
val ssc = new StreamingContext(sc, Seconds(5))

val kafkaConf = Map(
    "metadata.broker.list" -> "node1.hdp:6667",
    "zookeeper.connect" -> "node1.hdp:2181",
    "group.id" -> "kafka-streaming-example",
    "zookeeper.connection.timeout.ms" -> "1000"
)

val lines = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](
    ssc,
    kafkaConf,
    Map("spark-test-topic" -> 1),   // subscripe to topic and partition 1
    StorageLevel.MEMORY_ONLY
)

val words = lines.flatMap{ case(x, y) => y.split(" ")}

words.print()

ssc.start()

Done.

Simple Spark Streaming & Kafka Example in a Zeppelin Notebook

Creating a Notebook

Adding Dependencies to Spark Interpreter

Preparing Kafka Topic

Simple Spark Streaming Application

Further Reading

Published by hkropp

9 thoughts on “Simple Spark Streaming & Kafka Example in a Zeppelin Notebook”

Leave a comment Cancel reply

Creating a Notebook

Adding Dependencies to Spark Interpreter

Preparing Kafka Topic

Simple Spark Streaming Application

Further Reading

Share this:

Related

Published by hkropp

9 thoughts on “Simple Spark Streaming & Kafka Example in a Zeppelin Notebook”

Leave a comment Cancel reply