Credit Scoring Explained - The Problem Statement
Updated: May 18
Credit scoring is one of the interesting real world financial applications that a data scientist might encounter beyond the "iris classification" classroom-level problems. This blog is the first chapter in The Data Scientist's Handbook.
So what is credit scoring? According to Investopedia:
Credit scoring is a statistical analysis performed by lenders and financial institutions to access a person's creditworthiness. Credit scoring is used by lenders to help decide on whether to extend or deny credit.
In plain terms, when a client requests a loan/access to credit from a lender, the lender needs to know how risky it is to lend that client money. That riskiness is expressed using what is called a credit score. Usually, credit bureaus have a set of criteria as to how to compute the score and assign certain weights to different financial features, such as the clients' payment history, debts, length of credits, etc. In an unbanked population, this score can be difficult to compute since most of the required features are missing or difficult to obtain. One of the ways loan providers overcome this is by asking clients to provide a set of financial documents that indicate their ability to pay back the loan. Sometimes field visits are additionally conducted to view and inquire about assets owned, living standard, reputation at work..etc.
The final verdict of whether to approve the loan request or not is decided by the loan officer after assessing the client’s application and other gathered information. This process is evidently very labor intensive, inefficient and scales poorly with the volume of incoming applications. Also, the final decision is prone to the loan officer’s bias, which poses quite a challenge in modelling the problem.
Framing credit scoring as a machine learning problem
Framing the credit scoring problem as a supervised machine learning problem addresses many of the limitations of the manual process. For the objective of simulating the decision issued by the lending party, the most straight-forward approach is to formulate the problem as a binary classification problem with a target variable that labels the client’s status; approved for the requested loan or rejected. In essence, we would like our model to step in place of the application reviewer(s), so we would feed the same data that a loan officer processes. Which brings us to the the crux of any data-driven technique; the data.
One of the main input data sources that would fuel the model is the client’s application combined with gathered information from further inquiries by the loan officer. This data source might include:
Personal information such as name, gender, age, address, job title, marital status and home ownership.
Financial features such as income, rent amount, credit card limits, insurances and existing mortgages.
It is worth noting that if the lending party wants to further minimize the labor cost of the process, i.e., reduce field visits, it’s best not to include features gathered from those visits in the input data. If the applicant is a returning client, i.e., previously applied for a loan, another data source that can complement the aforementioned data is historical loan payment data. This might include:
Previous loan request status: indicates whether the client was previously accepted or rejected
Previous loan limit: indicates the loan amount requested in an earlier application
Loan payments: the amount payed back from previous funded loans
Tardiness in payments: paying on time, late or early
and this is only the tip of the iceberg for possible features. All of the features mentioned above may be extremely useful to the model, however raw datasets normally aren’t.
Real world datasets come with their own set of challenges, as the saying goes “Garbage in, garbage out”, so the data needs to be clean. On top of that, the cost of a wrong prediction in a financial application might be too expensive, which can be mitigated by improving the data’s quality. For those reasons, a significant amount of time is spent analyzing the data, removing irrelevant features, eliminating data leakage, addressing bias, transforming raw inputs into model readable features and much more. Some of the common challenges and pitfalls encountered in credit scoring data will be covered in a dedicated blog(s).
To wrap up, in this blog the general credit scoring problem was explained as well as the limitations of the manual process. A machine learning framework which mimics the lending party’s decision is then defined to address those limitations. And finally, two of the possible data sources that can be used by the model were introduced.