Apache Oozie Essentials
Published book covering workflow scheduling and coordination for Hadoop ecosystems. A deep dive into building production data pipelines.
Background
In 2015, Apache Oozie was the primary workflow scheduler for Hadoop ecosystems, but documentation was scattered and incomplete. Engineers building data pipelines on HDFS, Hive, and Pig spent more time debugging XML workflow definitions than writing business logic. There was no single resource that covered Oozie from fundamentals through production operations.
The Book
Apache Oozie Essentials (Packt Publishing) fills that gap. It covers:
- Workflow fundamentals — actions, control flow, fork/join patterns, and parameterisation
- Coordinator jobs — time and data-triggered scheduling for recurring pipelines
- Bundle jobs — managing sets of coordinators as a single deployable unit
- Integration — Hive, Pig, Spark, Sqoop, and shell actions with real-world examples
- Production operations — monitoring, SLA alerting, retry strategies, and troubleshooting
The book was written while I was working with large-scale Hadoop deployments at Telstra and Nokia, drawing directly from production patterns and failure modes.
Legacy
While the Hadoop ecosystem has largely been superseded by cloud-native solutions (Databricks, Airflow, dbt), the book represents an earlier chapter of my platform engineering career — building reliable data pipelines at scale, long before “data engineering” was a common job title.