Spark Streaming with Python

Streaming applications in Spark can be written in Scala, Java and Python giving developers the possibility to reuse existing code. An important note about Python in general with Spark is that it lacks behind the development of the other APIs by several months. For Spark Streaming only basic input sources are supported. Sources like Flume and Kafka might not be supported. For now only text file and text socket inputs are supported (Kafka support is available with Spark 1.3). A general fileStream is not supported just textFileStream. Continue reading “Spark Streaming with Python” →

HiveSink for Flume

With the most recent release of HDP (v2.2.4) Hive Streaming is shipped as technical preview. It can for example be used with Storm to ingest streaming data collected from Kafka as demonstrated here. But it also still has some serious limitations and in case of Storm a major bug. Nevertheless Hive Streaming is likely to become the tool of choice when it comes to streamline data ingestion to Hadoop. So it is worth to explore already today.

Flume’s upcoming release 1.6 will contain a HiveSink capable of leveraging Hive Streaming for data ingestion. In the following post we will use it as a replacement for the HDFS sink used in a previous post here. Other then replacing the HDFS sink with a HiveSink none of the previous setup will change, except for Hive table schema which needs to be adjusted as part of the requirements that currently exist around Hive Streaming. So let’s get started by looking into these restrictions. Continue reading “HiveSink for Flume” →

HDFS Spooling Directory with Spark

As Spark natively supports reading from any kind of Hadoop InputFormat, those data sources are also available to form DStreams for Spark Streaming applications. By using a simple HDFS file input format a HDFS directory can be turned into a spooling directory for data ingestion.

Files newly added to that directory in an atomic way (required) would be picked up by the running streaming context for processing. The data could for example be processed and stored in an external database like HBase or Hive. Continue reading “HDFS Spooling Directory with Spark” →

Hadoop File Ingest and Hive

In the beginning of all Hadoop adventures is the task of ingesting data to HDFS preferably today being queried for analysis by Hive at any point in time. High chances are that most enterprise data today at the beginning of any Hadoop project resides inside of RDBMS systems. Sqoop is the tool of choice within the Hadoop ecosystem for these kind of data. It is also quite convenient to use with Hive directly.

As most business is inherently event driven and more and more electronic devices are being used to track this events, ingesting a stream of data to Hadoop is a common demand. A tool like Kafka would be used for data ingestion into Hadoop in such a scenario of stream processing.

None of the methods mentioned above consider the sheer amount of data stored in files today. Not to mention the files newly created day by day. While WebHDFS or direct HDFS sure are convenient method for file ingestion they often require direct access to the cluster or a huge landing zone also with direct access to HDFS. A continues data ingest is also not supported.

For such scenarios Apache Flume sure would be a good option. Flume is capable of dealing with various continues data sources. Sources can be piped together over several nodes through channels writing data into various sink. In this post we look at the possibility to define a local directory where files can be dropped off, while Flume monitors for new files in that directory to sink to HDFS. Continue reading “Hadoop File Ingest and Hive” →

Hadoop Credential API

In Hadoop 2.6 a fundamental feature was introduced that did not get much attention but will play an important role moving forward – the Credential API. Looking ahead the Credential Management Framework (CMF) will play an important role for the pluggable token authentication framework, column encryption in ORC files, or the transparent data encryption. But not only future components but applications build for Hadoop can benefit from it. Continue reading “Hadoop Credential API” →

Spark Streaming with Kafka & HBase Example

Even a simple example using Spark Streaming doesn’t quite feel complete without the use of Kafka as the message hub. More and more use cases rely on Kafka for message transportation. By taking a simple streaming example (Spark Streaming – A Simple Example source at GitHub) together with a fictive word count use case this post describes the different ways to add Kafka to a Spark Streaming application. Additionally this posts describes the possibility to write out results to HBase from Spark directly using the TableOutputFormat. Continue reading “Spark Streaming with Kafka & HBase Example” →

My Agenda for Hadoop Summit 2015 in Brussels

Next week starts Hadoop Summit 2015 in Brussels. A week packed full of interesting talks additionally aligned with diverse community events to connect with amazing people from the field it makes up a fun Hadoop event ahead. The “Birds of Feahter Sessions” and the “Hadoop Crash Course” are great opportunities for beginners. Continue reading “My Agenda for Hadoop Summit 2015 in Brussels” →

Apache Kafka: Queuing for Hadoop

Apache Kafka is a distributed system designed for streams that is often being categorized as a messaging system but provides a fundamentally different abstraction, although it serves a similar role. The key abstraction of Kafka to keep in mind is a structured commit log of events. With events being any kind of system, user, or machine emitted data. Kafka is built to:

Fault-tolerant
High throughput
Horizontally scalable
Allow geographically distributing data streams and processing.

A constantly growing number of data generated at today companies is event data. While there is an approach to combine machine generated data under the umbrella term of
Internet-of-Things (IoT) it is crucial to understand that business is inherently event driven.
A purchase, a customer claim or registration are just examples of such events. Business is interactive. When analyzing the data time matters. Most of this data has it’s highest value when analyzed close to or even in real-time.

Apache Kafka was created to solve two main problems that arise from the ever increasing demand for stream data processing. Designed for reliability Kafka is capable of scaling against the growing demand for events passing. Secondly Kafka can interact with various applications and platform for the same events, which helps to orchestrated today’s complex architectures providing a central message hub for each system. Today chances are that all that data will end up in Hadoop for further or even real time analyses making Kafka a queue to Hadoop. Continue reading “Apache Kafka: Queuing for Hadoop” →

Spark Streaming – A Simple Example

Streamline data processing has become an inherent part of a modern data architecture build on top of Hadoop. Big Data applications need to act on data being ingested at a high rate and volume in real time. Sensor data, logs and other events likely have the most value when being analyst at the time they are emited – in real time.

For over 6 years Apache Storm has matured at hundreds of global companies to the preferred stream processing engine on top of Hadoop. In a more recent approach Spark Streaming was published build around Spark‘s Resilient Distributed Datasets (RDD). Constructed with the same concepts as Spark, a in-memory batch compute enginge, Spark Streaming is offering the clear advantage of bringing batch and real time processing closer together, as ideally the same code base can be leveraged for both.

Hence Spark Streaming is a so called micro-batching framework that uses timed intervals. It uses so called D-Streams (Discretized Stream) that structure computation as small sets of short, stateless, and deterministic tasks. State is distributed and stored in fault-tolerant RDDs. A D-Stream can be build from various data sources as Kafka, Flume, or HDFS offering many of the same operations available for RDDs with additional operations typical for time operations such as sliding windows.

In this post we are looking at a fairly basic example of using Spark Streaming. We will listen to a server emitting line by line in this example. Continue reading “Spark Streaming – A Simple Example” →

Distcp between two HA Cluster

With HDFS High Availability two nodes can act as a NameNode to the system, but not at the same time. Only one of the nodes acts as the active NameNode at any point in time while the other node is in a standby state. The standby node only acts as a slave node preserving enough state to immediately take over when the active node dies. In that it differs from the before existing SecondaryNamenode which was not able to immediately take over.

From a client perspective most confusing is the fact how the active NameNode is discovered? How is HDFS High Availability configured? In this post we look at how for example to distribute data between two clusters in HA mode. Continue reading “Distcp between two HA Cluster” →