Spark, initially an amplab project, is widely seen as the next top compute model for distributed processing. It elaborates strongly on the actor model provided through Akka. Some already argue it is going to replace everything there is – namely MapReduce. While this is hardly going to be the case, without any doubt Spark will become a core asset of modern data architectures. Here you’ll find a collection of 10 resources eligible for a deep dive into Spark:
- Spark: Cluster Computing with Working Sets
One of the first publications about Spark from 2010. - Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
A publication about one of the core concepts of Spark, the resilient distributed dataset (RDD). - Disk-Locality in Datacenter Computing Considered Irrelevant
Current distributed approaches are mainly centered around the concept of data (disk) locality. Especially MapReduce is based on this concept. The authors of this publication argue for a shift away from disk-locality to memory-locality in today’s distributed environments. - GraphX: A Resilient Distributed Graph System on Spark
A very promising use case apart from ML is the use of Spark for large scale graph analysis. - Spark at Twitter – Seattle Spark Meetup, April 2014
Twitter shares some of their viewpoints and the lessons they have learned. - MLlib is a Spark Implementation of some common machine learning algorithms
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Reactive Akka Streams - Shark makes Hive faster and more powerful
- Running Spark on YARN
YARN (amazon book) as the operating system of tomorrows data architectures was particularly designed for different compute models as Spark. - Spark SQL unifies access to structured data
- BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
- Spark Packages
- Managing Data Transfers in Computer Clusters with
Orchestra (default broadcast mechanism)