Reducing Data Drift

Note: This feature is part of the Custom Model Operator once you upload a Custom Model.

For the past few years the data community has been focusing on a very important task in Data Science which is building great ML models. As time goes by the industry evolves, new challenges arise and heads turn over putting ML models into production. Companies now want to use their models in actual business applications to get business value out of them via data workflows. But with that there are many challenges to solve, for example Model Drift.

What’s Drift?

When a machine learning model is deployed into production, the main concern of data scientists is the model pertinence over time. Is the model still capturing the pattern of new incoming data, and is it still performing as well as during its design phase?

Let's go over a simple example. Let’s say a company that owns different Fast Food restaurants creates a model to predict inventory. Inventory of french fries, lettuce, etc. The team is happy with the metrics of the model and decides to put it into production. It works well for a few days but then, unexpectedly COVID-19 hits. This means that the distribution of the data changed (compared to training data). This is called data drift.

Monitoring model performance Drift is a crucial step in production ML; however, in practice, it proves challenging for many reasons, one of which is the delay in retrieving the labels of new data. Without ground truth labels, drift detection techniques based on the model’s accuracy are off the table.

There’s much documentation on how to reduce or eliminate drift, but no real tools and step by step examples that can address this and the reason is because to solve the problem, end-to-end workflows need to be built and building them at scale is one of the biggest challenges in Data Science. I want to emphasize that putting ML into production is not just making an API endpoint of a model available. Putting ML into production is an end-to-end workflow that goes from ingesting data up to sending the output to a specific business app.

Let’s start with the main reasons of why an end-to-end workflow can eliminate Drift:

Periodically Re-Fit

A good first-level intervention to solve Drift is to periodically update your model with more recent historical data.

An end-to-end pipeline will solve this issue by ingesting new data and feeding the algorithm. For example, perhaps you can update the model each week or each month with the data collected from the data source.

You can do this by setting a condition and if the model is Drifting then you can trigger a re-train of your model.

Weight Data

Some algorithms allow you to weigh the importance of input data.