Data EngineeringBig DataHadoop

Hadoop 2.3 Centralized Cache Feature Comparison to Spark RDD

A comparison of the new HDFS centralized cache management feature in Hadoop 2.3 with Spark RDDs, and why Spark still held the edge for in-memory processing.

28 February 2014 · 1 min read

Hadoop 2.3 introduced two notable features:

  • Heterogeneous Storage Hierarchy in HDFS (HDFS-2832)
  • In-memory Cache for HDFS data via DataNodes (HDFS-4949)

This post focuses on the centralized cache management feature in HDFS.

How It Works

The feature allows you to cache a particular HDFS directory into memory at the start of your job. Applications like Hive and Impala can then read data directly from memory. The existing approach was Short Circuit Reads (SCR), which allows SCR-aware applications to read directly from disk, bypassing the DataNode.

Sample command:

hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication <replication>] [-ttl <time-to-live>]

Comparison with Spark RDDs

Compared to Spark’s model, RDDs are still superior because they maintain lineage for both transformations and writes happening on in-memory data. This means Spark can write intermediate data to RAM and work faster.

The HDFS cache management feature only boosts read performance. A few more improvements would be needed for Hadoop to match Spark in overall processing speed.

I am excited to see how downstream systems like Pig, Hive, and Impala will leverage this feature to process data faster. Things will keep getting better in the Hadoop ecosystem over the coming releases.