10+ Resources For A Deep Dive Into Spark

Spark, initially an amplab project, is widely seen as the next top compute model for distributed processing. It elaborates strongly on the actor model provided through Akka. Some already argue it is going to replace everything there is – namely MapReduce. While this is hardly going to be the case, without any doubt Spark will become a core asset of modern data architectures. Here you’ll find a collection of 10 resources eligible for a deep dive into Spark:

  1. Spark: Cluster Computing with Working Sets
    One of the first publications about Spark from 2010.
  2. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
    A publication about one of the core concepts of Spark, the resilient distributed dataset (RDD).
  3. Disk-Locality in Datacenter Computing Considered Irrelevant
    Current distributed approaches are mainly centered around the concept of data (disk) locality. Especially MapReduce is based on this concept. The authors of this publication argue for a shift away from disk-locality to memory-locality in today’s distributed environments.
  4. GraphX: A Resilient Distributed Graph System on Spark
    A very promising use case apart from ML is the use of Spark for large scale graph analysis.
  5. Spark at Twitter – Seattle Spark Meetup, April 2014
    Twitter shares some of their viewpoints and the lessons they have learned.
  6. MLlib is a Spark Implementation of some common machine learning algorithms
  7. Discretized Streams: Fault-Tolerant Streaming Computation at Scale
    Reactive Akka Streams
  8. Shark makes Hive faster and more powerful
  9. Running Spark on YARN
    YARN (amazon book) as the operating system of tomorrows data architectures was particularly designed for different compute models as Spark.
  10. Spark SQL unifies access to structured data
  11. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
  12. Spark Packages
  13. Managing Data Transfers in Computer Clusters with
    Orchestra (default broadcast mechanism)
Advertisement

Get Started with Hadoop – Now!!

Looking back it is insane how mature Hadoop has become. Not only the maturity itself but also the pace is quite impressive. Early projects jumped right onto the Hadoop wagon without clear but big expectations. Great about this times was that it felt like a gold-rush and Hadoop’s simple and inherently scalable paradigm made sure this path was sticked with successful projects. In his recent Book Arun Murthy identifies 4 Phases Hadoop has gone through so far:

  • Phase 0: The Area of Ad Hoc Hadoop
  • Phase 1: Hadoop on Demand
  • Phase 2: Dawn of the shared Cluster
  • Phase 3: Emergence of YARN

Continue reading “Get Started with Hadoop – Now!!”