archived

Apache Oozie Essentials

Published book covering workflow scheduling and coordination for Hadoop ecosystems. A deep dive into building production data pipelines.

Big DataHadoopWriting

Background

In 2015, Apache Oozie was the primary workflow scheduler for Hadoop ecosystems, but documentation was scattered and incomplete. Engineers building data pipelines on HDFS, Hive, and Pig spent more time debugging XML workflow definitions than writing business logic. There was no single resource that covered Oozie from fundamentals through production operations.

The Book

Apache Oozie Essentials (Packt Publishing) fills that gap. It covers:

Workflow fundamentals — actions, control flow, fork/join patterns, and parameterisation
Coordinator jobs — time and data-triggered scheduling for recurring pipelines
Bundle jobs — managing sets of coordinators as a single deployable unit
Integration — Hive, Pig, Spark, Sqoop, and shell actions with real-world examples
Production operations — monitoring, SLA alerting, retry strategies, and troubleshooting

The book was written while I was working with large-scale Hadoop deployments at Telstra and Nokia, drawing directly from production patterns and failure modes.

Legacy

While the Hadoop ecosystem has largely been superseded by cloud-native solutions (Databricks, Airflow, dbt), the book represents an earlier chapter of my platform engineering career — building reliable data pipelines at scale, long before “data engineering” was a common job title.