Credit Scoring Explained: Challenges & Pitfalls 1
This blog is the second chapter of The Data Scientist’s Handbook. After introducing the credit scoring problem in the first chapter, let’s dive right into one of the problems we might come across when dealing with credit scoring data sets.
One of the challenges one may encounter while working on a credit scoring problem is imbalanced data. In classification problems, imbalanced data refers to when the number of observations in one class notably outnumbers the other(s). That skew in the class distribution poses a problem due to the underlying assumption in most machine learning algorithms, namely, that there’s an equal number of examples in each class.
Imbalanced data will produce a model with poor predictive performance on the minority class due to the insufficient number of training examples for that class. In the credit scoring use-case, the class of interest is usually the rejected class since it bears more risk, i.e., accepting someone who should be rejected is more costly than rejecting someone who should be accepted. If the class of interest is indeed the minority by a significant margin, data imbalance must be addressed.
Detecting the problem is the first step to solving it, and using the correct evaluation metrics for your model is an easy way to achieve that. As we mentioned in the first chapter of the series, the problem is modelled as a binary classification problem, meaning we have only 2 classes.
If the data is 90% accepted loans and 10% rejected, a model that labels everything as accepted would achieve a 90%. We can agree that’s not a good model neither is accuracy a good metric in that case. Precision and recall are generally good metrics for problems that center around imbalanced data, such as fraud detection, rare diseases diagnosis and churn prediction.
Selected elements are the model predictions and relevant elements are all the items in our class of interest.
Once we label the rejected class as the “positive class” i.e., the “relevant elements” we’re interested in predicting correctly, we can calculate both precision and recall to get a good indication of the model’s performance.
As shown in Fig 1, precision is a measure of the number of correct predictions (true positives) over all positive predictions made by the model (true positives + false positives). While recall is a measure of the number of correct predictions (true positives) over all the items in positive class (true positives + false negatives).
High precision is used in the cases where we don’t care about false negatives but care more about true and false positives. For example in spam classification, we don’t really care if spam email finds its way to our inbox (false negative), but we do care that important emails don’t make it to spam (false positive).
Recall is used when we’re more tolerant to false positives but less tolerant to false negatives. For example, using the famous cancer classification problem, falsely labelling a benign tumor as cancerous (false positive) has less severe consequences than falsely labelling a cancerous tumor as benign (false negative).
With credit scoring, it’s up to the risk assessment of the loan providers to decide what they care more about, falsely rejecting applicants who would’ve paid back their loans or falsely accepting applicants who fail to pay. If both are equally important, then using F1-score is the way to go.
Another way to overcome data imbalance is to, well, balance it. The two ways to achieve that is by random under-sampling or over-sampling the data set.
under-sampling: removing instances from the majority class.
over-sampling: adding duplicates of the minority class.
Choosing either of the methods depends on the class distributions. Under-sampling is constrained by the number of examples in the minority class. If there are too few examples in the minority class and we balance the data by under-sampling the majority class, we would end up with a very small data set that might not be sufficient for training the model, i.e., the model would under-fit due to information loss. On the other hand, over-sampling just adds duplicates of existing records to the data set so no new information enters the model. Adding duplicates may lead to over-fitting due to copying the same information. Resampling is clearly not always foolproof, but it’s usually a good approach to start with.
Synthetic Minority Over-sampling Technique (SMOTE) is a more advanced method of over-sampling. Unlike, the naïve method which merely duplicates data, SMOTE synthesizes new data points. It works as follows: For some data points in the minority class:
Identify the feature vector and its nearest neighbor
Take the difference between the two
Multiply the difference with a random number between 0 and 1
Identify a new point on the line segment by adding the random number to the feature vector
Repeat process until desired number of data points are created
Fig 2: SMOTE process
SMOTE makes the model less prone to over-fitting on the minority class, but that is only given that we have enough variety in the minority class examples to generate new data points.
Sometimes choosing the right model may spare you from addressing issues with the data. That’s why it’s useful to know the underlying structure of machine learning models; to aid you in situations like these. With regards to imbalanced data, simple decision trees don’t inherently perform well unless the distribution of the feature space meets some criteria 1. Weighted/cost-sensitive decision trees on the other hand have proven to be quite robust because of the class weight parameter that is added, which allows us to optimize for the minority class. A few good models to get started with are Decision Tree Classifier and Random Forest which both support the class_weight parameter. And of course, the all popular XGBoost, which uses scale_pos_weight instead of class_weight for the same purpose.
To wrap up, in this blog we’ve seen how the nature of credit scoring data sets may present one challenge (of many) that data scientists have to deal with, namely, data imbalance. Then a handful of methods that help in addressing that challenge are presented. There are of course a lot more pitfalls to watch out for in the data, so make sure you stay tuned with the rest of the series to find out more!
Like what you read? Check out our other blogs here and stay tuned for more!
About the author:
Nourhan is a friendly neighborhood ML/DevOps Engineer, avid writer and skateboarding enthusiast.