Broadcast Join with Spark

With a broadcast join one side of the join equation is being materialized and send to all mappers. It is therefore considered as a map-side join which can bring significant performance improvement by omitting the required sort-and-shuffle phase during a reduce step. In this Post we are going to discuss the possibility for broadcast joins in Spark DataFrame and RDD API in Scala. Continue reading “Broadcast Join with Spark” →

Connecting Livy to a Secured Kerberized HDP Cluster

Livy.io is a proxy service for Apache Spark that allows to reuse an existing remote SparkContext among different users. By sharing the same context Livy provides an extended multi-tenant experience with users being capable of sharing RDDs and YARN cluster resources effectively.

In summary Livy uses a RPC architecture to extend the created SparkContext with a RPC service. Through this extension the existing context can be controlled and shared remotely by other users. On top of this Livy introduces authorization together with enhanced session management.

livy-architecture

Analytic applications like Zeppelin can use Livy to offer multi-tenant spark access in a controlled manner.

This post discusses setting up Livy with a secured HDP cluster.

Continue reading “Connecting Livy to a Secured Kerberized HDP Cluster” →

Custom MATLAB InputFormat for Apache Spark

Hadoop supports multiple file formats as input for MapReduce workflows, including programs executed with Apache Spark. Defining custom InputFormats is a common practice among Hadoop Data Engineers and will be discussed here based on publicly available data set.

The approach demonstrated in this post does not provide means for a general MATLAB™ InputFormat for Hadoop. This would require significant effort in finding a general purpose mapping of MATLAB™’s file format and type system to the ones of HDFS. Continue reading “Custom MATLAB InputFormat for Apache Spark” →

Running PySpark with Conda Env

Controlling the environment of an application is vital for it’s functionality and stability. Especially in a distributed environment it is important for developers to have control over the version of dependencies. In such an scenario it’s a critical task to ensure possible conflicting requirements of multiple applications are not disturbing each other.

That is why frameworks like YARN ensure that each application is executed in a self-contained environment – typically in a Linux Container or Docker Container – that is controlled by the developer. In this post we show what this means for Python environments being used by Spark. Continue reading “Running PySpark with Conda Env” →

Running PySpark with Virtualenv

That is why frameworks like YARN ensure that each application is executed in a self-contained environment – typically in a Linux (Java) Container or Docker Container – that is controlled by the developer. In this post we show what this means for Python environments being used by Spark. Continue reading “Running PySpark with Virtualenv” →

HDFS Spooling Directory with Spark

As Spark natively supports reading from any kind of Hadoop InputFormat, those data sources are also available to form DStreams for Spark Streaming applications. By using a simple HDFS file input format a HDFS directory can be turned into a spooling directory for data ingestion.

Files newly added to that directory in an atomic way (required) would be picked up by the running streaming context for processing. The data could for example be processed and stored in an external database like HBase or Hive. Continue reading “HDFS Spooling Directory with Spark” →

Spark Streaming with Kafka & HBase Example

Even a simple example using Spark Streaming doesn’t quite feel complete without the use of Kafka as the message hub. More and more use cases rely on Kafka for message transportation. By taking a simple streaming example (Spark Streaming – A Simple Example source at GitHub) together with a fictive word count use case this post describes the different ways to add Kafka to a Spark Streaming application. Additionally this posts describes the possibility to write out results to HBase from Spark directly using the TableOutputFormat. Continue reading “Spark Streaming with Kafka & HBase Example” →

Spark Streaming – A Simple Example

Streamline data processing has become an inherent part of a modern data architecture build on top of Hadoop. Big Data applications need to act on data being ingested at a high rate and volume in real time. Sensor data, logs and other events likely have the most value when being analyst at the time they are emited – in real time.

For over 6 years Apache Storm has matured at hundreds of global companies to the preferred stream processing engine on top of Hadoop. In a more recent approach Spark Streaming was published build around Spark‘s Resilient Distributed Datasets (RDD). Constructed with the same concepts as Spark, a in-memory batch compute enginge, Spark Streaming is offering the clear advantage of bringing batch and real time processing closer together, as ideally the same code base can be leveraged for both.

Hence Spark Streaming is a so called micro-batching framework that uses timed intervals. It uses so called D-Streams (Discretized Stream) that structure computation as small sets of short, stateless, and deterministic tasks. State is distributed and stored in fault-tolerant RDDs. A D-Stream can be build from various data sources as Kafka, Flume, or HDFS offering many of the same operations available for RDDs with additional operations typical for time operations such as sliding windows.

In this post we are looking at a fairly basic example of using Spark Streaming. We will listen to a server emitting line by line in this example. Continue reading “Spark Streaming – A Simple Example” →

10+ Resources For A Deep Dive Into Spark

Spark, initially an amplab project, is widely seen as the next top compute model for distributed processing. It elaborates strongly on the actor model provided through Akka. Some already argue it is going to replace everything there is – namely MapReduce. While this is hardly going to be the case, without any doubt Spark will become a core asset of modern data architectures. Here you’ll find a collection of 10 resources eligible for a deep dive into Spark:

Spark: Cluster Computing with Working Sets
One of the first publications about Spark from 2010.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
A publication about one of the core concepts of Spark, the resilient distributed dataset (RDD).
Disk-Locality in Datacenter Computing Considered Irrelevant
Current distributed approaches are mainly centered around the concept of data (disk) locality. Especially MapReduce is based on this concept. The authors of this publication argue for a shift away from disk-locality to memory-locality in today’s distributed environments.
GraphX: A Resilient Distributed Graph System on Spark
A very promising use case apart from ML is the use of Spark for large scale graph analysis.
Spark at Twitter – Seattle Spark Meetup, April 2014
Twitter shares some of their viewpoints and the lessons they have learned.
MLlib is a Spark Implementation of some common machine learning algorithms
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Reactive Akka Streams
Shark makes Hive faster and more powerful
Running Spark on YARN
YARN (amazon book) as the operating system of tomorrows data architectures was particularly designed for different compute models as Spark.
Spark SQL unifies access to structured data
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
Spark Packages
Managing Data Transfers in Computer Clusters with
Orchestra (default broadcast mechanism)