The good, the bad, and the machine: data quality’s impact on machine learning
Updated: Dec 9, 2021
Artificial intelligence (AI) is being adopted by many businesses to reduce operational costs, with machine and deep learning models being developed to mimic human behaviours and decisions. Although machine learning (ML) and AI are crucial shortcuts to the future, it’s more about the data than the models themselves; this is especially the case in our era of big data and hardware advancements.
Data is what allows modern enterprises to conquer, learn, adapt, and evolve. Within this context, the term ‘quality’ can be defined as the relevance of the data to the business’ model and purpose. However, this definition is slightly generic and too qualitative.
Quantifying the terminology is a must. ‘Data quality’ is often viewed as a bi-dimensional term that deals with i. data completeness and ii. data accuracy. However, in the current AI world, the term is defined multidimensionally; accessibility, comparability, consistency, validity, timeliness, and uniqueness (to name a few) are fundamental elements contributing to our understanding of data quality.
The data quality gap
You may have came across this quote before:
“Don’t let the data slip through the cracks.”
Today, the absorptive capacity of manipulating big data analytics by data engineers and data scientists is lower than what’s required. This is a consequence of our tendency to prioritise data growth without taking the time to manually check that data’s quality - the fifth V in the Five V’s of data (Fig. 1). Veracity plays the most vital role in ensuring data quality.
Data collectors and controllers themselves can contribute to poor quality data. This is mainly due to a lack of awareness of how this data will be utilised. Inaccurate measurement tools as well as human error can also negatively impact data quality. According to Harvard Business Review, “only 3% of companies’ data meets the basic quality standards.”
The cost of poor data quality on industry
A number of studies show the impact poor data quality has on revenue loss. A Royal Mail report identified a 6% revenue loss for the organisation as a consequence of poor customers’ data management. Loss is not restricted to revenue; time as a resource was also negatively impacted. These losses can be avoided if initial data collection and verification processes are carried out correctly.
Research conducted by Gartner placed organisations’ average revenue loss at $15 million per year as a result of poor quality data. Prior to this research being conducted, about 60% of these organizations weren’t even aware of the negative impact poor data management practices had on their businesses.
The impact of data quality on AI
The first step in any ML project is assessing the data’s quality to ensure its authenticity. ML modeling is sensitive to data by its nature; any small error in the data could lead to large scale errors in the output of the model, leading to incorrect system output.
“If your data is bad, your ML tools are useless.” - Thomas Redman
While AI has boomed globally, challenges in maintaining data quality have limited that growth. Research has found that 87% of organizations adopting AI fail to successfully implement machine and deep learning models due to data quality issues.
“Garbage in...Garbage out”
In ML, the challenge of poor data quality presents in stages; if a) poor historical data is used to build predictive models, then b) the models will make decisions based upon this unseen, poor quality data. As a result, data scientists lose valuable time having to work through the data in order to identify whether the model’s behaviour is accurate, and whether the new unseen data was met with this challenge during the inference stage.
Even the largest and deepest models can be affected by poor data quality. Contradictions in the data can hinder learning internally, while externally the model can appear to be behaving normally. This is especially challenging when the model is deep in learning and, often, these challenges go unnoticed until the model is actually deployed.
A dual perspective: the data journey
Let’s take a deeper look at the key aspects relating to data quality that can present themselves in modelling outputs. We’ll approach this from the perspective of big data and ML, both of which affect modelling behaviour and output (Gudivada et al., 2017).
The big data perspective
Missing data can be approached based on its randomness. Deleting the missing values can be a solution in the case of randomness, or randomness with completeness. However, missing data not at random can induce strong bias, which has negative effects on the statistical power of the observations.
Imputing is mostly used in such cases by either mean or mode imputation for numerical and categorical features respectively.
Expectation-Maximization and Markov Chain Monte-Carlo simulation could also be solutions for dealing with missing data.
Data registered by users can be negatively impacted by duplication. One type of duplication is partial duplication. An example could be the same address listed with several users, all belonging to the same household. In this case, duplicates would need to be eliminated, even if some of the features contain non-duplicate information. The model’s logic may require all entries to have unique data, making it very difficult to match the logic with the existing data.
Data heterogeneity - also known as data variety - is one of the five V’s of data. Data variety is not inherently considered a good or bad aspect of data, but needs to be investigated in each unique case in order to provide useful information about the variability in the data.
Data currency and consistency
The data’s current state of relevance should be analysed by comparing the data’s history to its current state. In doing so, the validity of the pattern is tested and stale data can be identified and purged.
Data consistency is also a key aspect of data quality. The integrity of the data format must be checked alongside all records, or against other data.
The ML perspective
While the data might be in good form from a big data perspective, that alone does not ensure that the results expected from the modelling will be met. While there are various forms of data quality checks that can be explored prior to modelling, the following points are considered the most essential:
Bias and variance tradeoff
This is a common challenge in modelling. While it is often seen as just a modelling issue, the data’s statistical distribution itself can induce some biases or variances that models cannot handle.
Data sufficiency can be overlooked from the big data perspective, but remains a limiting factor to modelling quality. Techniques like cross-validation and bootstrapping can be adopted to overcome these issues.
This is a crucial challenge when building an ML model, wherein some data records are very far in the hyper-space from the rest of the values in the dataset. Such challenges can be approached by applying outlier detection techniques such as interquartile range (IQR) check, Z-score check, or studying the skewness of the independent variables.
Dimensionality mainly affects modelling that is dependent on distance measurements. Some ML models lose their capabilities of measuring the inter-distances between points, where the observations start to be equidistant from each other.
Use cases from a computer vision perspective
A great example of the effect of bad quality (selection problem) on models can be illustrated by an experiment carried out by the University of Washington. ML researchers sought to create a classifier model that would learn to distinguish between images of huskies and wolves. The results were excellent, with the model displaying over 90% accuracy in identification. The researchers were ecstatic that, despite the two animal’s close resemblance, their model was able to distinguish between the two with such precision.
When running a LIME explainer, a “novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner” (Ribeiro et al., 2016), they discovered that the decision wasn’t actually based on the animals’ features. Instead, it depended mainly on the background. Wolf images usually have snowy backgrounds, while husky images usually have grass backgrounds. As it turned out, the model was actually distinguishing between snow and grass, not huskies and wolves.
Raw data and explanation of a bad model’s prediction in the “Husky vs Wolf ” task.
At Synapse Analytics, we’ve also encountered challenges relating to data quality. As was explored in our previous article on algorithmic biases, our product AzkaVision misclassified veiled women due to inaccurate training data. In another instance, the model encountered challenges in the detection and classification of car logos. We dedicated a lot of time to tuning our model, but the initial test sets all seemed to fail.
At that point, we took a step back and analysed the quality of the training data. We removed unclear images, added more data related to the distribution, and addressed incorrect annotations in order to boost our modelling prediction power.
The results were fascinating: we were approaching the challenge from the wrong angle. The problem was in the initial phases of the pipeline - the data quality - not the model itself. By addressing these issues, our detector and classifier model was able to pass all our test sets and identify car logos with great accuracy and speed.
Karpathy A. “A Recipe for Training Neural Networks”. Available from: http://karpathy.github.io/2019/04/25/recipe/.
Gudivada V, Apon A, and Ding J. “Data Quality Considerations for Big Data and Machine Learning: Going Beyond Data Cleaning and Transformations". In: International Journal on Advances in Software 10.1 (2017), pp. 1 - 20. Available from: https://www.researchgate.net/publication/318432363_Data_Quality_Considerations_for_Big_Data_and_Machine_Learning_Going_Beyond_Data_Cleaning_and_Transformations.
Ribeiro MT, Singh S and Guestrin C. “Why Should I Trust You?” Explaining the Predictions of Any Classifier (2016). Available from: https://arxiv.org/abs/1602.04938.