Skip to content

Expyrix/Credit-Default-Risk-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Credit Default Risk Prediction with Machine Learning

Python Machine Learning CatBoost SHAP

Overview

This project predicts credit card default risk using machine learning models and turns model probabilities into business-oriented credit risk decisions.

The main goal is not only to classify customers as default / non-default, but also to:

  • compare several machine learning models,
  • estimate reliable default probabilities,
  • select a cost-sensitive decision threshold,
  • explain model predictions using SHAP,
  • translate model results into practical credit risk recommendations.

The project is based on the UCI Default of Credit Card Clients dataset.


Business Problem

Credit institutions need to identify customers who are likely to default on their credit card payments.

A standard classification model is not enough for this type of problem. In credit risk, the business also needs:

  • probability estimates, not only class labels;
  • explainability, because financial decisions should be interpretable;
  • a decision threshold that reflects business costs;
  • a way to balance missed defaults and false alarms.

In this project, a false negative means that the model misses a risky customer. This is usually more expensive than a false positive, where a safe customer is incorrectly flagged as risky.


Dataset

The dataset contains information about credit card clients, including:

  • credit limit,
  • demographic variables,
  • repayment status over previous months,
  • bill statement amounts,
  • previous payment amounts,
  • default status for the next month.

The target variable is:

default payment next month

Where:

0 = no default
1 = default

Important note about repayment columns:

The repayment status variables are named PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, and PAY_6 in the original UCI dataset. There is no PAY_1 column. PAY_0 represents the most recent repayment status, while PAY_2PAY_6 represent previous months.

The raw dataset should be placed in the data/ folder.

Expected file name:

data/default of credit card clients.xls

Project Structure

credit-default-risk-prediction/
│
├── data/
│   └── README.md
│
├── images/
│   ├── target_distribution.png
│   ├── model_comparison.png
│   ├── roc_curves.png
│   ├── pr_curves.png
│   ├── calibration_curve.png
│   ├── threshold_cost_curve.png
│   └── shap_top_features.png
│
├── notebooks/
│   └── credit_default_prediction_clean.ipynb
│
├── README.md
├── requirements.txt
└── .gitignore

Methods Used

The project follows a full machine learning workflow:

  1. Data loading and inspection
  2. Data quality checks
  3. Exploratory data analysis
  4. Data cleaning and preprocessing
  5. Feature engineering
  6. Train/test split with stratification
  7. Model training and comparison
  8. Probability calibration
  9. Cost-sensitive threshold selection
  10. SHAP-based model explainability
  11. Business recommendations

Feature Engineering

Several additional features were created to better describe customer repayment behavior and credit usage patterns:

  • TOTAL_BILL_6M — total bill amount across the previous six months;
  • TOTAL_PAY_6M — total payment amount across the previous six months;
  • PAY_TO_BILL_RATIO — ratio between total payments and total bill amount;
  • MAX_DPD — maximum repayment delay across the observed months;
  • NUM_DELINQ_MONTHS — number of months with payment delay;
  • NUM_NO_CONSUMPTION — number of months with no credit card consumption;
  • BILL_CHANGE_6M — change in bill amount between the most recent and oldest observed month;
  • PAY_CHANGE_6M — change in payment amount between the most recent and oldest observed month.

These features were designed to capture repayment discipline, credit utilization behavior, and changes in customer financial activity over time.


Models Compared

The following models were tested:

  • Logistic Regression
  • Random Forest
  • CatBoost

CatBoost was selected as the final model because it provided the best overall performance and worked well with the structure of the dataset.


Evaluation Metrics

The project uses several metrics because credit default prediction is an imbalanced classification problem.

Main metrics:

  • ROC-AUC
  • PR-AUC
  • Brier Score
  • Log Loss
  • Precision
  • Recall
  • F1-score
  • Confusion Matrix

Accuracy alone is not enough here, because the target variable is imbalanced. Most customers do not default, so a model could achieve high accuracy while still missing many risky customers.


Key Results

CatBoost achieved the strongest overall performance among the tested models.

Model comparison on the test set:

Model ROC-AUC PR-AUC Brier Score Log Loss
CatBoost 0.7756 0.5540 0.1357 0.4332
Random Forest 0.7682 0.5432 0.1403 0.4435
Logistic Regression 0.7543 0.5137 0.1918 0.5750

CatBoost was selected as the final model because it achieved the best ROC-AUC and PR-AUC while also producing the strongest overall probability quality.

After probability calibration, CatBoost with isotonic calibration achieved:

Model Calibration Brier Score Log Loss
CatBoost Isotonic 0.1348 0.4300

The project also tested cost-sensitive decision thresholds. Under the main business scenario where a missed default is five times more costly than a false alarm, the selected validation-optimised threshold was around 0.1940 instead of the default 0.5.

At this threshold, the final CatBoost model achieved:

Threshold Precision Recall F1-score Accuracy
0.1940 0.4154 0.6719 0.5134 0.7182

This threshold increases the number of detected defaults compared with the default 0.5 threshold, which is more suitable for a conservative credit risk policy.


Model Explainability

SHAP was used to interpret the CatBoost model and identify the main drivers of predicted default risk.

The most important features were related to:

  • recent repayment status,
  • credit limit,
  • maximum delinquency,
  • number of delinquent months,
  • bill statement amounts,
  • payment behavior.

This confirms that the model relies mostly on financial behavior variables, especially recent repayment history.


Business Recommendation

For credit risk management, the final model should not use the default classification threshold of 0.5.

A lower threshold is more suitable when the cost of missing a default is higher than the cost of incorrectly flagging a safe customer.

The recommended approach is:

  • use CatBoost as the final model,
  • use calibrated probabilities,
  • choose the threshold based on business cost assumptions,
  • monitor recall and false positives together,
  • use SHAP explanations to support model transparency.

Visual Results

Target Distribution

Target Distribution

Figure 1. Target distribution. The dataset is imbalanced: most clients did not default, while defaults represent a smaller but important risk group.

Repayment Behavior

PAY_0 vs Default

Figure 2. Most recent repayment status vs default. Recent repayment delays are strongly associated with a higher number of defaults.

Model Performance

Model Comparison

Figure 3. Model comparison by PR-AUC. CatBoost achieved the highest PR-AUC, which is especially important for this imbalanced classification problem.

ROC Curves

Figure 4. ROC curves. CatBoost achieved the highest ROC-AUC among the tested models.

Precision-Recall Curves

Figure 5. Precision-Recall curves. CatBoost achieved the strongest average precision, making it the best model for identifying default cases.

Probability Calibration

Calibration Curve

Figure 6. Reliability diagram. Probability calibration was used to improve the quality of predicted default probabilities.

Business Threshold Selection

Threshold Cost Curve

Figure 7. Cost-sensitive threshold selection. The best validation threshold is much lower than 0.5 under the 5:1 cost scenario.

Model Explainability

SHAP Top Features

Figure 8. SHAP feature importance. The most important predictors are recent repayment status, credit limit, maximum delay, and bill/payment behavior.


How to Run the Project

1. Clone the repository

git clone https://github.com/Expyrix/Credit-Default-Risk-Prediction.git
cd Credit-Default-Risk-Prediction

2. Create a virtual environment

python -m venv env

Activate it on Windows PowerShell:

.\env\Scripts\Activate.ps1

3. Install dependencies

pip install -r requirements.txt

4. Add the dataset

Place the dataset file into the data/ folder:

data/default of credit card clients.xls

5. Run the notebook

Open:

notebooks/credit_default_prediction_clean.ipynb

Then run all cells.


Limitations

This project uses a public dataset, so the results should not be interpreted as production-ready banking decisions.

Main limitations:

  • the dataset is historical and limited to one credit card portfolio;
  • macroeconomic variables are not included;
  • customer income and employment variables are not available;
  • the cost ratios are simplified business assumptions;
  • model performance should be validated on newer real-world data before deployment.

Planned Improvements

The next planned improvements for this project are:

  • add hyperparameter tuning for CatBoost;
  • create a simple Streamlit app for interactive default risk scoring;
  • add cross-validation for more stable model comparison;
  • add feature importance comparison across models;
  • add a simple scoring function for new applicants;
  • add model card with limitations and ethical considerations;
  • save final model pipeline for reproducible inference.

Author

Yaroslav Tsibirinko

Informatics graduate from the Czech University of Life Sciences Prague
Interested in data analytics, machine learning, business intelligence, and applied data science.

Made in Prague.

About

Machine learning project for credit card default prediction using CatBoost, probability calibration, SHAP explainability and cost-sensitive decision thresholds.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors