How Apache Spark Works (Short Summary)
A concise overview of why Apache Spark was created, how RDDs enable in-memory processing for iterative and interactive workloads, and its key programming abstractions.
7 posts
A concise overview of why Apache Spark was created, how RDDs enable in-memory processing for iterative and interactive workloads, and its key programming abstractions.
A comparison of YARN and Mesos as cluster resource managers, including their architectural differences, recent developments, and insights from the Google Omega paper.
Approaches for handling schema evolution in Hadoop using Avro and ORC file formats, including a practical workflow for managing schema changes with Hive.
How to use the ChainMapper class in Hadoop to call multiple mappers in sequence, with a working example and key points about configuration and type compatibility.
The small files problem in Hadoop and five approaches to solve it: HDFSConcat, IdentityMapper/Reducer, FileUtil.copyMerge, Hadoop File Crush, and Hive concatenate.
Understanding HBase minor compaction -- how files are selected for compaction using the ratio algorithm, with a worked example showing the selection logic.
Understanding HBase major compaction -- how it differs from minor compaction, the configuration properties that control it, and the three methods that trigger it.