DatabricksSparkH2OMachine Learning

Production Implementation of Machine Learning models

How we built an end-to-end pipeline using Spark and H2O Sparkling Water to get machine learning models from source control into production with minimal friction.

10 October 2015 · 2 min read

This post describes how we implemented an end-to-end framework to push machine learning models into production.

Working in large organisations presents a challenge: how do you actually run your code in production to create meaningful business value? Often, machine learning models and analytics sit in source control (e.g., Git) for a long time before they are actually running and helping customers. In the old workflow, data scientists and analysts would build something — say, an R model — and then have no straightforward way to test it in a live environment where customers actually are.

We built a machine learning pipeline using Spark and H2O Sparkling Water, which provides clean, modular APIs for everything from data munging to training to scoring — all in a single codebase.

Spark fundamentally changed the big data world. All the work that previously required different sets of tools has been unified into a single Swiss Army knife.

Our new machine learning pipeline works as follows:

Data scientists and analysts working on a specific use case use Spark + Sparkling Water to create machine learning models.
They commit their model code to Git.
Jenkins builds the code and creates jar/RPM artefacts stored in Nexus.
Deployment is automated via Chef.

The total time from creating a model to running it in production is now as short as the training time plus about five minutes.

Data scientists have full power to push any new model to production without going through heavy bureaucracy, and they can do A/B testing for new ideas. All they have to do is push new code and follow the standard peer code review process.

An organisation that can quickly try out new things can fail quickly, learn quickly, and innovate quickly. This is the kind of culture we were trying to create — a data-driven culture of experimentation.

If you enjoyed this post, you may also be interested in my book Apache Oozie Essentials, a use-case-driven Oozie implementation guide with examples and exercises to take your big data learning to the next level.

Databricks Spark H2O Machine Learning

Disclosure: Ideas and analysis are my own. AI assisted with drafting and editing.