This project predicts credit card default risk using machine learning models and turns model probabilities into business-oriented credit risk decisions.
The main goal is not only to classify customers as default / non-default, but also to:
- compare several machine learning models,
- estimate reliable default probabilities,
- select a cost-sensitive decision threshold,
- explain model predictions using SHAP,
- translate model results into practical credit risk recommendations.
The project is based on the UCI Default of Credit Card Clients dataset.
Credit institutions need to identify customers who are likely to default on their credit card payments.
A standard classification model is not enough for this type of problem. In credit risk, the business also needs:
- probability estimates, not only class labels;
- explainability, because financial decisions should be interpretable;
- a decision threshold that reflects business costs;
- a way to balance missed defaults and false alarms.
In this project, a false negative means that the model misses a risky customer. This is usually more expensive than a false positive, where a safe customer is incorrectly flagged as risky.
The dataset contains information about credit card clients, including:
- credit limit,
- demographic variables,
- repayment status over previous months,
- bill statement amounts,
- previous payment amounts,
- default status for the next month.
The target variable is:
default payment next month
Where:
0 = no default
1 = default
Important note about repayment columns:
The repayment status variables are named PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, and PAY_6 in the original UCI dataset. There is no PAY_1 column. PAY_0 represents the most recent repayment status, while PAY_2–PAY_6 represent previous months.
The raw dataset should be placed in the data/ folder.
Expected file name:
data/default of credit card clients.xls
credit-default-risk-prediction/
│
├── data/
│ └── README.md
│
├── images/
│ ├── target_distribution.png
│ ├── model_comparison.png
│ ├── roc_curves.png
│ ├── pr_curves.png
│ ├── calibration_curve.png
│ ├── threshold_cost_curve.png
│ └── shap_top_features.png
│
├── notebooks/
│ └── credit_default_prediction_clean.ipynb
│
├── README.md
├── requirements.txt
└── .gitignore
The project follows a full machine learning workflow:
- Data loading and inspection
- Data quality checks
- Exploratory data analysis
- Data cleaning and preprocessing
- Feature engineering
- Train/test split with stratification
- Model training and comparison
- Probability calibration
- Cost-sensitive threshold selection
- SHAP-based model explainability
- Business recommendations
Several additional features were created to better describe customer repayment behavior and credit usage patterns:
TOTAL_BILL_6M— total bill amount across the previous six months;TOTAL_PAY_6M— total payment amount across the previous six months;PAY_TO_BILL_RATIO— ratio between total payments and total bill amount;MAX_DPD— maximum repayment delay across the observed months;NUM_DELINQ_MONTHS— number of months with payment delay;NUM_NO_CONSUMPTION— number of months with no credit card consumption;BILL_CHANGE_6M— change in bill amount between the most recent and oldest observed month;PAY_CHANGE_6M— change in payment amount between the most recent and oldest observed month.
These features were designed to capture repayment discipline, credit utilization behavior, and changes in customer financial activity over time.
The following models were tested:
- Logistic Regression
- Random Forest
- CatBoost
CatBoost was selected as the final model because it provided the best overall performance and worked well with the structure of the dataset.
The project uses several metrics because credit default prediction is an imbalanced classification problem.
Main metrics:
- ROC-AUC
- PR-AUC
- Brier Score
- Log Loss
- Precision
- Recall
- F1-score
- Confusion Matrix
Accuracy alone is not enough here, because the target variable is imbalanced. Most customers do not default, so a model could achieve high accuracy while still missing many risky customers.
CatBoost achieved the strongest overall performance among the tested models.
Model comparison on the test set:
| Model | ROC-AUC | PR-AUC | Brier Score | Log Loss |
|---|---|---|---|---|
| CatBoost | 0.7756 | 0.5540 | 0.1357 | 0.4332 |
| Random Forest | 0.7682 | 0.5432 | 0.1403 | 0.4435 |
| Logistic Regression | 0.7543 | 0.5137 | 0.1918 | 0.5750 |
CatBoost was selected as the final model because it achieved the best ROC-AUC and PR-AUC while also producing the strongest overall probability quality.
After probability calibration, CatBoost with isotonic calibration achieved:
| Model | Calibration | Brier Score | Log Loss |
|---|---|---|---|
| CatBoost | Isotonic | 0.1348 | 0.4300 |
The project also tested cost-sensitive decision thresholds. Under the main business scenario where a missed default is five times more costly than a false alarm, the selected validation-optimised threshold was around 0.1940 instead of the default 0.5.
At this threshold, the final CatBoost model achieved:
| Threshold | Precision | Recall | F1-score | Accuracy |
|---|---|---|---|---|
| 0.1940 | 0.4154 | 0.6719 | 0.5134 | 0.7182 |
This threshold increases the number of detected defaults compared with the default 0.5 threshold, which is more suitable for a conservative credit risk policy.
SHAP was used to interpret the CatBoost model and identify the main drivers of predicted default risk.
The most important features were related to:
- recent repayment status,
- credit limit,
- maximum delinquency,
- number of delinquent months,
- bill statement amounts,
- payment behavior.
This confirms that the model relies mostly on financial behavior variables, especially recent repayment history.
For credit risk management, the final model should not use the default classification threshold of 0.5.
A lower threshold is more suitable when the cost of missing a default is higher than the cost of incorrectly flagging a safe customer.
The recommended approach is:
- use CatBoost as the final model,
- use calibrated probabilities,
- choose the threshold based on business cost assumptions,
- monitor recall and false positives together,
- use SHAP explanations to support model transparency.
Figure 1. Target distribution. The dataset is imbalanced: most clients did not default, while defaults represent a smaller but important risk group.
Figure 2. Most recent repayment status vs default. Recent repayment delays are strongly associated with a higher number of defaults.
Figure 3. Model comparison by PR-AUC. CatBoost achieved the highest PR-AUC, which is especially important for this imbalanced classification problem.
Figure 4. ROC curves. CatBoost achieved the highest ROC-AUC among the tested models.
Figure 5. Precision-Recall curves. CatBoost achieved the strongest average precision, making it the best model for identifying default cases.
Figure 6. Reliability diagram. Probability calibration was used to improve the quality of predicted default probabilities.
Figure 7. Cost-sensitive threshold selection. The best validation threshold is much lower than 0.5 under the 5:1 cost scenario.
Figure 8. SHAP feature importance. The most important predictors are recent repayment status, credit limit, maximum delay, and bill/payment behavior.
git clone https://github.com/Expyrix/Credit-Default-Risk-Prediction.git
cd Credit-Default-Risk-Predictionpython -m venv envActivate it on Windows PowerShell:
.\env\Scripts\Activate.ps1pip install -r requirements.txtPlace the dataset file into the data/ folder:
data/default of credit card clients.xls
Open:
notebooks/credit_default_prediction_clean.ipynb
Then run all cells.
This project uses a public dataset, so the results should not be interpreted as production-ready banking decisions.
Main limitations:
- the dataset is historical and limited to one credit card portfolio;
- macroeconomic variables are not included;
- customer income and employment variables are not available;
- the cost ratios are simplified business assumptions;
- model performance should be validated on newer real-world data before deployment.
The next planned improvements for this project are:
- add hyperparameter tuning for CatBoost;
- create a simple Streamlit app for interactive default risk scoring;
- add cross-validation for more stable model comparison;
- add feature importance comparison across models;
- add a simple scoring function for new applicants;
- add model card with limitations and ethical considerations;
- save final model pipeline for reproducible inference.
Yaroslav Tsibirinko
Informatics graduate from the Czech University of Life Sciences Prague
Interested in data analytics, machine learning, business intelligence, and applied data science.
Made in Prague.







