Machine LearningInfrastructureData Engineering

Machine learning for large organisations

A practical approach to building an organisational machine learning pipeline that supports multiple tools, PMML-based model deployment, CI/CD practices, and A/B testing.

30 May 2015 · 4 min read

While doing machine learning and analytics in large organisations, we often have to cater to many different objectives to fulfil business needs. This post shares my learnings from working with large-scale organisations that have diverse, often siloed teams using similar technologies but with varying business goals.

The design of this organisational machine learning pipeline is based on the following goals:

Let users work with the tools they prefer
Provide a unified way of executing the models they create
Enable quick A/B testing of models
Follow continuous integration principles

I will cover both open-source and licensed tools that can be used to build this kind of environment.

At a high level, a typical data science flow is:

Analyse the data
Experiment and create a machine learning model
Export the model in PMML format
Execute the PMML model to generate scores in batch or real time

Different people are involved: those who create PMML files and those who execute (operate) them. There is some overlap — the people who build the models and produce PMML artefacts are also responsible for running validation pipelines (more on this later).

Analysis and Modelling

Analysts and data scientists use various tools — R, Spark, scikit-learn, and others — accessed via web front-ends or native clients. Since the majority of users prefer these three tools, supporting them satisfies roughly 90% of the team.

PMML Export

Most modelling tools support export to PMML. The end artefact for the analytics team is the PMML file. See the references below for tooling details. The only tricky scenario is when your model is not supported by PMML, but that is rare. Most of the time, people work with regressions and other well-supported model types.

Continuous Integration and A/B Testing

Each PMML file is committed to a Git repository that produces a deployable artefact (RPM). With each A/B test, a new release of models is pushed by the team. Each team has its own project so they can make changes and deploy according to their own release cycles, with separate configurations for production and development environments.

Scoring Engine

The final step is executing the PMML file. Spark at this point could only produce PMML files, not run them. In the open-source world, JPMML is the scoring engine that supports running PMML files. It has a REST API that gives teams the ability to trigger models on demand.

Teams also want to bring results back into their own systems, and this is where Spring XD comes into the picture. It allows anyone to write a few lines of configuration and handle the entire end-to-end workflow.

Sample code for running PMML in Spring XD:

analytic-pmml
    --location=/models/iris-flower-naive-bayes.pmml.xml
    --inputFieldMapping=
      'sepalLength:Sepal.Length,
       sepalWidth:Sepal.Width,
       petalLength:Petal.Length,
       petalWidth:Petal.Width'
    --outputFieldMapping='Predicted_Species:predictedSpecies' | SomeSINK"

Many teams want to push results to a database, others want them in HDFS, and others want them in RabbitMQ. All they change is the sink location and Spring XD does the rest.

With REST API integration into business unit applications, the modelling and data science pipeline becomes self-service. Business teams are empowered to build machine learning models, deploy them with CI practices, and run them whenever they want. The time to A/B test is minimal, and you can fail fast.

Validation Pipelines

Validation pipelines run on the same dataset but with a small sample volume, using the tool of choice for the person who created the model. For example: generate a model using scikit-learn, export it as PMML, run validation using scikit-learn, and compare the output with what the real scoring engine produces.

References

Machine Learning Infrastructure Data Engineering

Disclosure: Ideas and analysis are my own. AI assisted with drafting and editing.