Simple Spark Streaming & Kafka Example in a Zeppelin Notebook

Apache Zeppelin is a web-based, multi-purpose notebook for data discovery, prototyping, reporting, and visualization. With it’s Spark interpreter Zeppelin can also be used for rapid prototyping of streaming applications in addition to streaming-based reports.

In this post we will walk through a simple example of creating a Spark Streaming application based on Apache Kafka. Continue reading “Simple Spark Streaming & Kafka Example in a Zeppelin Notebook” →

Spark Streaming with Python

Streaming applications in Spark can be written in Scala, Java and Python giving developers the possibility to reuse existing code. An important note about Python in general with Spark is that it lacks behind the development of the other APIs by several months. For Spark Streaming only basic input sources are supported. Sources like Flume and Kafka might not be supported. For now only text file and text socket inputs are supported (Kafka support is available with Spark 1.3). A general fileStream is not supported just textFileStream. Continue reading “Spark Streaming with Python” →

HDFS Spooling Directory with Spark

As Spark natively supports reading from any kind of Hadoop InputFormat, those data sources are also available to form DStreams for Spark Streaming applications. By using a simple HDFS file input format a HDFS directory can be turned into a spooling directory for data ingestion.

Files newly added to that directory in an atomic way (required) would be picked up by the running streaming context for processing. The data could for example be processed and stored in an external database like HBase or Hive. Continue reading “HDFS Spooling Directory with Spark” →

Spark Streaming with Kafka & HBase Example

Even a simple example using Spark Streaming doesn’t quite feel complete without the use of Kafka as the message hub. More and more use cases rely on Kafka for message transportation. By taking a simple streaming example (Spark Streaming – A Simple Example source at GitHub) together with a fictive word count use case this post describes the different ways to add Kafka to a Spark Streaming application. Additionally this posts describes the possibility to write out results to HBase from Spark directly using the TableOutputFormat. Continue reading “Spark Streaming with Kafka & HBase Example” →