Credit Scoring Explained - Challenges & Pitfalls II
Welcome back to another entry in The Data Scientist’s Handbook! In the previous article, we talked about one of the issues a data scientist might encounter in credit scoring problems, namely imbalanced data. In this article, we’ll continue on the same note, discussing how to overcome common challenges and avoid pitfalls. Keep in mind that these are not only exclusive to credit scoring problems, but it’s always good to have an application in mind.
The first couple of articles highlighted how important data quality is in dictating how good our model will end up being. In the previous article, we talked about imbalanced data. There’s a multitude of other data quality issues that affect our model’s performance, some of which need more effort to spot and fix than others. Here’s an incomprehensive list of common issues to watch out for in your data set.
Data points that deviate far beyond the distribution of other data points are called ‘outliers’, for example an age of 127 or a monthly salary of 2 billion. The reason outliers are quite challenging is because there are many types of outliers which may (or may not) affect your model in different ways. Outliers can be univariate or multivariate depending on the dimensionality of the feature space being considered. Outliers can also be categorized to point anomalies, contextual outliers or collective outliers. Let’s take the example of weather temperature to illustrate the difference:
Point outliers: A temperature of 50°C in Finland or a temperature of -50°C in Egypt.
Contextual outliers: A temperature of 35°C in the winter. Note that 35°C might not be a global outlier, but given the “context” i.e., winter, 35°C is considered an outlier.
Collective outliers: Temperatures below 0°C in Egypt, this may indicate a new phenomena i.e., colder temperatures in winter.
Detecting outliers can be done with visual methods such as box plots or scatter plots, or numerical methods such as Z-score and IQR score. You can find each of those explained in depth here. To drop or not to drop, that is the question. Case 1: When the outliers clearly occur from measurement or data entry errors, like an age of 500 or a salary of -2000, then the verdict is to drop.
Fig. 1: Outlier on same line, doesn’t affect
Fig. 2: Outlier pulls regression line towards it.
Consider the regression line (or decision boundary for a classification problem) in Fig. 1, the outlier at the edge does not affect the final output. To test that, you can remove the outlier(s) and the result should remain the same. In this case leaving the outlier should be okay.
In Fig. 2, the model’s output is clearly affected by the outlier. If this is case 1, (outlier is a data entry or measurement error), then drop the outlier. Otherwise, you may need to further investigate reasons and assumption changes any outliers introduce. Outliers might convey significant information, which is why it is good practice to investigate them first. Sometimes data transformations may reduce the effect of outliers as well as using outlier-robust models.
2- Free-text features:
One very common feature of real world data is free-text columns. This is not just for comment or opinion sections, but sometimes includes addresses, regions and job titles as well, which are features that may be key to the loan officer’s decision. Humans are smart enough to read beyond the variation caused by typos and inconsistent capitalization to know that “Enginer”, “engIneer”, “Ingineer” all just mean Engineer. Our model might not be quite there yet, but with a bit of preprocessing, the model can reach satisfactory results.
With the help of Natural Language Processing models, we can cluster similar words together and use the cluster label as a feature. Not only that, but NLP allows to group by semantics as well, so “Doctor” and “orthopedic surgeon” can be in the same cluster.
3- Output Inconsistency
Another thing that may mess up the quality of our training data is output inconsistency given similar input features. This can be due to the loan officer’s bias, it can be a data entry error or there could’ve been other tie-breaking features that were removed in the data wrangling process. The reason output inconsistency affects data quality is because these examples are rendered useless by the model; it doesn’t learn anything new from them. The model sees the following:
Nothing can be inferred, these and both examples cancel each other out. If one class outnumbers the other with the duplicated input, the model will be biased to the more frequent class when it sees the same input again. Before considering dropping or keeping these duplicated rows, these entries must be closely analyzed to determine whether it's just stochasticity or if other untracked or dropped features play a hand in the final decision.
4- Target leakage
Now imagine spending days and weeks cleaning the data, adding engineered features, building the model, tuning its parameters and then you finally get to see the performance of your iterative efforts. You run the model on the validation set and, to your surprise, you get a near perfect performance. The model’s accuracy is beyond anything that was achieved before, did you really crack it?
A true data scientist puts suspicion before celebration. Now how could this have happened? It’s almost as if the model is cheating the output from another feature. You inspect the dataframe and find a column np_app_off which indicates how many of the loan officers who review the application approved it. Now that information should not be included in the training data as it won’t be provided at the time of prediction; the model steps in place of the manual reviewing process. By including this feature, the model was able to deduce that np_app_off>1 is always accepted. Another example of a feature that would cause target leakage is loan_status , which indicates the payment status of the loan. That could only be available after the loan is accepted, hence leaking information that any record with a loan_status has been approved.
Features that are available after the time of prediction should not be included in the training data. Leakage features can be detected in the data exploration phase by analyzing statistical correlations with the target variable. If a feature has a high correlation, it should be investigated more closely. The above example also demonstrates how domain knowledge is extremely important in understanding why a feature should or should not be excluded from the training data.
Coupling explainability methods with domain knowledge is a promising method of understanding your model’s behavior and gaining some insight into the black box. Model explainability techniques make it possible to understand the magnitude of impact a feature has on the model’s prediction. With domain knowledge, you get to know what features should have the most impact on the model’s decision, explainability will verify if that’s what’s happening inside the model or not. This information can guide you to reiterate over data transformations and try out different representations of features that should be considered important but are not according to the model, or drop features that should not be considered important but are (ex: target leakage).
Some models are already easily interpretable like linear regression (explainable by coefficients) and decision trees models (explainable by splitting criterion based feature importance). Thanks to more advanced explainability techniques like SHAP, LIME and Permutation Importance, we can get more insights about more complex models. Christoph Molnar’s Interpretable Machine Learning is a great source that goes in-depth into a lot of explainability methods.
Digression - Biased A.I.
One valuable insight we can gain from model explainability is gauging how much social bias has found its way into our model. Loan officers' opinions on credit risks should be purely objective, but there is no guarantee that their decisions are exempt from bias. With model explainability, we can see whether demographic features such as gender, race and ethnicity influence our model’s output too much. One might think why not just drop those features altogether and build a demographically agnostic model, but it may not be that simple.
If the data set was already extremely biased and you create an unbiased model, it would be difficult for the model to reach a satisfactory performance on that data, because the model’s output would greatly differ from the true labels. There is also no simple way of guaranteeing that a rejected loan would have been paid repaid had it been approved and hence no way to guarantee that a model has promising results despite the disparity between label and model output. This is one of the prominent topics in the ethics of A.I., should our models reinforce preexisting social biases? Or should they transcend and evolve beyond them?
In this article, we’ve covered a handful of issues a data scientist might find in the data such as outliers, free-text features, output inconsistency and target leakage. Detection and handling methods were also covered as well as a brief introduction on model explainability and why it’s useful from a data scientist’s point of view.
Throughout this series, we’ve seen that there are a lot of challenges that a data scientist might (and will) face when dealing with real world data sets. It’s worthy of emphasizing the importance of domain knowledge, be it by the data scientist or an expert on the project. A data scientist might know of 10 different ways to address a certain issue, but a data scientist equipped with domain knowledge would know which one to choose.