How Apache Spark Works (Short Summary)
A concise overview of why Apache Spark was created, how RDDs enable in-memory processing for iterative and interactive workloads, and its key programming abstractions.
Why Spark Was Made
Before Spark was in the market, MapReduce was used as the processing brain on top of Hadoop.
See the Apache Spark original research paper.
The typical flow for MapReduce is:
- Read data from disk
- Apply some processing
- Dump intermediary data on disk
- Do some more processing
- Show final results
Now if some processing has to be done incrementally by changing some variable across the entire dataset, MapReduce will again start from reading from disk. If you run the processing 100 times, it will do it 100 times x 2 (for intermediary processing also).
Spark solves the following typical use cases where the same processing is applied to datasets with varying variable inputs.
Two Typical Use Cases Where Spark Shines
Iterative Jobs
Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter (e.g., through gradient descent). While each iteration can be expressed as a MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty.
Interactive Analysis
Hadoop is often used to perform ad-hoc exploratory queries on big datasets, through SQL interfaces such as Pig and Hive. Ideally, a user would be able to load a dataset of interest into memory across a number of machines and query it repeatedly.
RDDs: The Core Abstraction
To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs).
An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.
RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.
Programming Model
Spark provides two main abstractions for parallel programming:
- Resilient distributed datasets
- Parallel operations on these datasets (invoked by passing a function to apply on a dataset)
These are based on typical functional programming concepts of map, flatMap, filter, etc.
In addition, Spark supports two restricted types of shared variables that can be used in functions running on the cluster:
- Broadcast variables — read-only variables shared across all workers
- Accumulators — variables that workers can only “add” to using an associative operation
Example Spark Code
val file = spark.textFile("hdfs://...")
val errs = file.filter(_.contains("ERROR"))
val ones = errs.map(_ => 1)
val count = ones.reduce(_ + _)