2013

7 posts

How Apache Spark Works (Short Summary)

A concise overview of why Apache Spark was created, how RDDs enable in-memory processing for iterative and interactive workloads, and its key programming abstractions.

8 Aug 2013 · 2 min read

Big DataYARN

YARN vs Mesos

A comparison of YARN and Mesos as cluster resource managers, including their architectural differences, recent developments, and insights from the Google Omega paper.

3 Aug 2013 · 2 min read

Data EngineeringBig Data

Handle Schema Changes and Evolution in Hadoop

Approaches for handling schema evolution in Hadoop using Avro and ORC file formats, including a practical workflow for managing schema changes with Hive.

30 Mar 2013 · 2 min read

JavaData Engineering

Chain Mapper Example

How to use the ChainMapper class in Hadoop to call multiple mappers in sequence, with a working example and key points about configuration and type compatibility.

16 Feb 2013 · 3 min read

Data EngineeringBig Data

Merging Small Files in Hadoop

The small files problem in Hadoop and five approaches to solve it: HDFSConcat, IdentityMapper/Reducer, FileUtil.copyMerge, Hadoop File Crush, and Hive concatenate.

26 Jan 2013 · 3 min read

HBase

How HBase Major Compaction Works

Understanding HBase major compaction -- how it differs from minor compaction, the configuration properties that control it, and the three methods that trigger it.

2 Jan 2013 · 2 min read

HBase

How HBase Minor Compaction Works

Understanding HBase minor compaction -- how files are selected for compaction using the ratio algorithm, with a worked example showing the selection logic.

2 Jan 2013 · 2 min read