DatabricksSparkData EngineeringBig Data

How Apache Spark Works (Short Summary)

A concise overview of why Apache Spark was created, how RDDs enable in-memory processing for iterative and interactive workloads, and its key programming abstractions.

8 August 2013 · 2 min read

Why Spark Was Made

Before Spark was in the market, MapReduce was used as the processing brain on top of Hadoop.

See the Apache Spark original research paper.

The typical flow for MapReduce is:

Read data from disk
Apply some processing
Dump intermediary data on disk
Do some more processing
Show final results

Now if some processing has to be done incrementally by changing some variable across the entire dataset, MapReduce will again start from reading from disk. If you run the processing 100 times, it will do it 100 times x 2 (for intermediary processing also).

Spark solves the following typical use cases where the same processing is applied to datasets with varying variable inputs.

Two Typical Use Cases Where Spark Shines

Iterative Jobs

Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter (e.g., through gradient descent). While each iteration can be expressed as a MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty.

Interactive Analysis

Hadoop is often used to perform ad-hoc exploratory queries on big datasets, through SQL interfaces such as Pig and Hive. Ideally, a user would be able to load a dataset of interest into memory across a number of machines and query it repeatedly.

RDDs: The Core Abstraction

To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs).

An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.

Programming Model

Spark provides two main abstractions for parallel programming:

Resilient distributed datasets
Parallel operations on these datasets (invoked by passing a function to apply on a dataset)

These are based on typical functional programming concepts of map, flatMap, filter, etc.

In addition, Spark supports two restricted types of shared variables that can be used in functions running on the cluster:

Broadcast variables — read-only variables shared across all workers
Accumulators — variables that workers can only “add” to using an associative operation

Example Spark Code

val file = spark.textFile("hdfs://...")
val errs = file.filter(_.contains("ERROR"))
val ones = errs.map(_ => 1)
val count = ones.reduce(_ + _)

Databricks Spark Data Engineering Big Data

Disclosure: Ideas and analysis are my own. AI assisted with drafting and editing.