Streamline data processing has become an inherent part of a modern data architecture build on top of Hadoop. Big Data applications need to act on data being ingested at a high rate and volume in real time. Sensor data, logs and other events likely have the most value when being analyst at the time they are emited – in real time.
For over 6 years Apache Storm has matured at hundreds of global companies to the preferred stream processing engine on top of Hadoop. In a more recent approach Spark Streaming was published build around Spark‘s Resilient Distributed Datasets (RDD). Constructed with the same concepts as Spark, a in-memory batch compute enginge, Spark Streaming is offering the clear advantage of bringing batch and real time processing closer together, as ideally the same code base can be leveraged for both.
Hence Spark Streaming is a so called micro-batching framework that uses timed intervals. It uses so called D-Streams (Discretized Stream) that structure computation as small sets of short, stateless, and deterministic tasks. State is distributed and stored in fault-tolerant RDDs. A D-Stream can be build from various data sources as Kafka, Flume, or HDFS offering many of the same operations available for RDDs with additional operations typical for time operations such as sliding windows.
In this post we are looking at a fairly basic example of using Spark Streaming. We will listen to a server emitting line by line in this example. Continue reading “Spark Streaming – A Simple Example”
Spark, initially an amplab project, is widely seen as the next top compute model for distributed processing. It elaborates strongly on the actor model provided through Akka. Some already argue it is going to replace everything there is – namely MapReduce. While this is hardly going to be the case, without any doubt Spark will become a core asset of modern data architectures. Here you’ll find a collection of 10 resources eligible for a deep dive into Spark:
- Spark: Cluster Computing with Working Sets
One of the first publications about Spark from 2010.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
A publication about one of the core concepts of Spark, the resilient distributed dataset (RDD).
- Disk-Locality in Datacenter Computing Considered Irrelevant
Current distributed approaches are mainly centered around the concept of data (disk) locality. Especially MapReduce is based on this concept. The authors of this publication argue for a shift away from disk-locality to memory-locality in today’s distributed environments.
- GraphX: A Resilient Distributed Graph System on Spark
A very promising use case apart from ML is the use of Spark for large scale graph analysis.
- Spark at Twitter – Seattle Spark Meetup, April 2014
Twitter shares some of their viewpoints and the lessons they have learned.
- MLlib is a Spark Implementation of some common machine learning algorithms
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Reactive Akka Streams
- Shark makes Hive faster and more powerful
- Running Spark on YARN
YARN (amazon book) as the operating system of tomorrows data architectures was particularly designed for different compute models as Spark.
- Spark SQL unifies access to structured data
- BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
- Spark Packages
- Managing Data Transfers in Computer Clusters with
Orchestra (default broadcast mechanism)