An Introduction to Data Drifts
The world is currently dominated by artificial intelligence (AI). More businesses are adopting machine learning, and now understand the importance of automating the machine learning pipeline as mentioned in our previous article. After the model makes it to production, it is still at risk of losing relevance overtime; thus, the model performance is likely to eventually decline. One of the issues that might cause a decline in the model performance is when the distribution of training data differs from the production/serving data. This change is known as data drift/covariate shift. There are other types of drift, like concept drift, where the relationship between the model inputs and the target variable changes, but for now let’s focus on data drift to see the effect of the model’s input data distribution shift overtime.
To simplify the concept of data drift, the main example used throughout the article will be that of a credit scoring model (Model A) for loan applicants. In this example, this model predicts whether the applicant is more likely to pay off the loan on time or not. The primary features that affect the final decision (acceptance/rejection) are age, salary, job title, applicant’s region, application date, and number of dependents. This model is originally designed for a financial provider to assess loan applicants who recently graduated college in Cairo. Applicants apply for a loan via the financial provider’s regularly updated app. The financial provider won’t be able to validate this prediction until the loan lifetime ends, or until they’ve monitored the applicant’s payment behaviour over a period of time.
What causes a data drift?
There is always a time gap between the collection of training data and the collection of serving data; thus, feature composition may change overtime. For instance, in Model A’s case, the model was originally trained with an age group that ranged between 22 and 25. However, the financial provider launched a new marketing campaign that targeted an older age group, resulting in the age distribution shifting. Besides the introduction of external factors and seasonal changes, data quality is also another culprit when it comes to data drift. Let's consider an instance where a new update or a faulty data engineering pipeline changes the format of the application date, switching the day and month. Without proper validation in place, the model will receive month values ranging from 1 - 31 as opposed to 1 - 12.
Cost of data drifts
Ideally, we can monitor model performance by measuring the error rate of the model predictions compared to the ground truth. However, ground truths for new predictions are not always available, and need time to be obtained.
Data drift should be detected as early as possible; otherwise, business decisions driven by model predictions may cause an undesirable impact. Due to data drift, a trustworthy applicant might be rejected based on an outdated model score, or higher customer churn rate might occur after constantly recommending wrong products to customers. In the case of Model A, monitoring the data showed that an influx of customers from outside Cairo started applying for loans. This change in region distribution may require a new model specially targeted for this segment of customers. This sudden data drift should be addressed quickly before the new applicants lose interest. If data drift is regularly monitored, the domain expert or the data scientist can recognize any alarming changes in the data distribution without waiting for the ground truth of the prediction to be provided.
How do we calculate data drifts?
We begin by automating the process that identifies data drift. The distribution of the serving data, i.e., data received by the model in production, should be regularly compared to the training data distribution as a baseline.
Data drifts can be detected using model-based methods, sequential analysis methods, and time distribution-based methods. Below are some of these techniques:
Sequential analysis methods like Drift Detection Method (DDM) detect a notable increase in the error rate of the learning algorithm within a period of time. The change in the distribution of classes indicates that the current model needs to be updated.
Building a machine learning model to detect drift and schedule it to run periodically to detect any change in the data distribution.
Probability Stability Index (PSI) compares the distribution of a prediction probability between the serving data set to the training data set.
Kullback–Leibler (KL) computes a score that measures the divergence of one probability distribution from another. This can be applied on the training data set distribution and the serving data distribution. If the KL divergence goes above a certain threshold, your data may be at risk of drift.
Jensen-Shannon is a symmetrised and smoothed version of KL divergence; a threshold is set, above which drift is indicated.
Kolmogorov-Smirnov test (KS test) is a non-parametric and distribution-free test; hence, it does not assume any underlying distribution. The KS test can be used to compare a sample with a reference probability distribution (one-sample KS test), or to compare two samples (two-sample KS test).
Not all drifts should be treated equally; primary features with higher importance should be given higher drift weight. If a primary feature that your model heavily depends on drifts, that would require more attention than a secondary feature drifting. Additionally, drift thresholds should be tested to avoid false drift alerts.
How should we handle data drifts?
Assuming the data drift was not significant, then there will be no need to retrain the model. However, if it was significant, then there are different approaches to be considered.
In the case of Model A, a domain expert in loans might be consulted. Perhaps the applicants’ credit score in other institutions could be evaluated and taken as a feature. This change can only be detected by a domain expert and can later be handled either by adding coefficients to existing features or by rebuilding the model;
Instead of running the model on all applicants regardless of their age groups and regions, another model can be built for underperforming applicants from different age groups or regions;
The initial features might be insufficient on their own, and we might need to collect more data. In Model A’s example, a proxy between the salary and applicants’ requested loan amount might be added as a new feature;
The data scientist or the ML engineer might need to retrain or rebuild the model. If labelled serving data is available; i.e., the ground truth has been provided, it can be used to create a new relevant training set;
The application of business logic and manual correction to accommodate market constraints and new marketing campaigns.
Who should monitor data drifts?
As poor data quality impacts the AI model and could cause revenue loss as elaborated in one of our previous articles, data scientists, ML engineers, business users, data engineers, and all of those who are contributing to the data collection should be aware of how that data is utilised in order to ensure the data pipeline’s integrity. It is essential for these users to have an automated platform where they can visualise data drift and monitor its effect. Given a data drift monitoring system, domain experts and business users can view any sudden trends or customer base behavioural changes.
Konan: ML Deployments Made Easy!
As we’ve seen, data drift monitoring is a crucial component in the ML life-cycle and shouldn’t be an afterthought when deploying the model to production. Konan, our MLOps platform, incorporates out-of-the-box data drift monitoring for your deployed models by monitoring incoming prediction requests and firing alerts on features that exhibit data drift. You can also see what exactly caused the drift, ex: new column, new value, type change, or high distance between training and serving data. Based on the exhibited data drifts, you can choose to retrain your model on Konan and add the serving data to create your new training set.
Like what you hear? Head over to Konan’s page on our website to know more or sign up for a free trial.
A.Tsymbal, “The problem of concept drift: Definitions and related work,” Computer Science Department, Trinity College Dublin, 05 2004.
C. Duckworth et al., “Using explainable machine learning to characterise data drift and detect emergent health risks for emergency department admissions during COVID-19,” Scientific Reports, vol. 11, no. 1, Dec. 2021, doi: 10.1038/s41598-021-02481-y.
D. L. Giudice, “No testing, no artificial intelligence!”, Forrester blogs, 2019
G. Barash, E. Farchi, I. Jayaraman, O. Raz, R. Tzoref-Brill, and M. Zalmanovici, “Bridging the gap between ML solutions and their business requirements using feature interactions,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Aug. 2019, pp. 1048–1058. doi: 10.1145/3338906.3340442.
Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. 2020. Learning under Concept Drift: A Review. (April 2020). DOI:https://doi.org/10.1109/TKDE.2018.2876857
R. Connor, F. A. Cardillo, R. Moss, and F. Rabitti, “Evaluation of Jensen-Shannon Distance over Sparse Data,” 2013, pp. 163–168. doi: 10.1007/978-3-642-41062-8_16.
S. Agarwal and S. Mishra, Responsible AI. Cham: Springer International Publishing, 2021. doi: 10.1007/978-3-030-76860-7.