HDFS Spooling Directory with Spark

As Spark natively supports reading from any kind of Hadoop InputFormat, those data sources are also available to form DStreams for Spark Streaming applications. By using a simple HDFS file input format a HDFS directory can be turned into a spooling directory for data ingestion.

Files newly added to that directory in an atomic way (required) would be picked up by the running streaming context for processing. The data could for example be processed and stored in an external database like HBase or Hive.

Such a streaming application is actually quite simple to assemble, still having a huge impact on data ingestion approaches in Hadoop, because a typical landing zone could be obeyed.

Today we often find such a landing zone reside on an edge node close to the cluster running tools like Ab Ignition or Datastage that are only capable of offline processing, treating Hadopp as a “Yet-Another-Filesystem”. With Spark comes another alternative for such ETL workflows to be executed in parallel on the cluster itself.

A simple example of an spooling directory on HDFS could be established by the following code:

A fairly simple example printing each line in a file copied to the directory in HDFS to the screen.

Once build it could be executed as following, where /spark_log would be a directory in HDFS:

A file uploaded to the named directory would be printed line by line to the terminal (Spark will not print all lines with the print statement):

Upload file to the dir:

Spark Streaming output:

Storing any of that data to a external database for example Hive or HBase could be achieved by using the foreachRDD functionality:

Further Readings

3 thoughts on “HDFS Spooling Directory with Spark”

Leave a Reply

Your email address will not be published. Required fields are marked *