ML-Based Fraud Detection + LLM Investigation Assistant
A production-style risk evaluation system that combines ensemble machine learning with Generative AI to detect suspicious financial transactions, explain risk indicators via SHAP, and generate structured investigation reports — directly mirroring the workflow of an Applied Scientist on a Buyer Risk Prevention team.
| Component | Technology |
|---|---|
| Fraud detection models | Random Forest · XGBoost · Logistic Regression |
| Class balancing | SMOTE (Synthetic Minority Over-sampling) |
| Risk scoring | Probabilistic score 0–100 with verdict thresholds |
| Explainability | SHAP TreeExplainer — feature-level attribution |
| Risk flags | Rule-based human-readable indicators |
| Investigation reports | Anthropic Claude API (claude-sonnet-4-20250514) |
| Dashboard | Streamlit (3-tab interactive UI) |
Credit Card Fraud Detection — Kaggle / ULB Machine Learning Group
| Attribute | Value |
|---|---|
| Total transactions | 284,807 |
| Fraudulent | 492 (0.173%) |
| Genuine | 284,315 (99.827%) |
| Features | V1–V28 (PCA-transformed) + Amount + Time |
| Train / Test split | 80% / 20% stratified |
The dataset is severely imbalanced (1 fraud per ~578 genuine transactions). Three strategies were combined:
- SMOTE applied to training set only (10% minority sampling ratio)
scale_pos_weightin XGBoostclass_weight="balanced"in Logistic Regression and Random Forest- Primary evaluation metric: F1 Score and Average Precision rather than accuracy
AmountandTimestandardised withStandardScaler- V1–V28 PCA components used directly (no further transformation needed)
| Model | Precision | Recall | F1 Score | ROC-AUC | Avg Precision |
|---|---|---|---|---|---|
| Logistic Regression | 0.058 | 0.918 | 0.109 | 0.969 | 0.726 |
| Random Forest ✅ | 0.878 | 0.806 | 0.840 | 0.965 | 0.867 |
| XGBoost | 0.748 | 0.847 | 0.794 | 0.978 | 0.870 |
XGBoost achieved the highest ROC-AUC (0.978) but Random Forest was selected for deployment due to:
- Superior F1 Score (0.840 vs 0.794) — better precision-recall balance, meaning fewer false positives sent to manual review
- Higher Precision (0.878 vs 0.748) — in a fraud review queue, low-precision models waste analyst time on false alarms
- Interpretability — Random Forest SHAP values are more stable and consistent, better suited for explainable risk decisions
- Production stability — Random Forest is less sensitive to hyperparameter tuning and performs reliably without extensive calibration
In a real Buyer Risk Prevention context, a model that catches 80.6% of fraud with 87.8% precision is preferable to one that catches slightly more fraud but generates 13% more false positives.
prob = model.predict_proba(transaction)[0, 1]
risk_score = prob * 100 # continuous 0–100
# Decision thresholds
HIGH (≥ 70) → Block / Manual Review
MEDIUM (≥ 40) → Flag for Review
LOW (< 40) → ApproveConfusion Matrix Results (test set):
Predicted Genuine Predicted Fraud
Actual Genuine 56,853 11
Actual Fraud 19 79
Recall = 79 / (79 + 19) = 80.6%
Precision = 79 / (79 + 11) = 87.8%
SHAP (SHapley Additive exPlanations) via TreeExplainer is used to attribute each prediction to individual features.
Top Risk Drivers (Mean |SHAP Value|):
| Rank | Feature | Mean |SHAP| | Interpretation | |---|---|---|---| | 1 | V14 | 0.0799 | Spending behaviour pattern | | 2 | V12 | 0.0720 | Transaction frequency pattern | | 3 | V4 | 0.0683 | Risk profile indicator | | 4 | V3 | 0.0532 | Historical deviation | | 5 | V10 | 0.0483 | Merchant category pattern |
These SHAP values are used directly in the LLM investigation prompt to ground the AI report in quantitative evidence rather than heuristics.
Each high-risk transaction is passed to Claude claude-sonnet-4-20250514 with:
- Risk score and model verdict
- Top SHAP drivers
- Human-readable risk flags
Claude returns a structured 5-section investigation report:
RISK LEVEL: High
EXECUTIVE SUMMARY:
Transaction TXN-45821 exhibits multiple strong indicators of fraudulent
activity, with a risk score of 91/100 driven primarily by anomalous
patterns in V14 and V12 features.
DETAILED ANALYSIS:
The dominant SHAP driver V14 (impact: 0.079) represents a strong
deviation from the account's historical spending behaviour...
RECOMMENDED ACTION: Flag for Manual Review
REASONING:
Score of 91 with 4 concurrent risk flags exceeds the high-risk
threshold; human review is warranted before blocking.
| File | Description |
|---|---|
class_distribution.png |
Fraud vs Genuine count (log scale) |
roc_curve.png |
ROC curves — all 3 models |
pr_curve.png |
Precision-Recall curves (primary metric for imbalanced data) |
confusion_matrix.png |
Best model — TN/FP/FN/TP breakdown |
feature_importance.png |
Top 15 features — Random Forest Gini importance |
shap_summary.png |
SHAP beeswarm — feature impact direction + magnitude |
shap_bar.png |
SHAP bar chart — mean absolute impact per feature |
Intelligent-Risk-Review-System/
├── data/
│ └── creditcard.csv # Kaggle ULB dataset (add manually)
├── notebooks/
│ └── model_training.ipynb # Full EDA + training walkthrough
├── src/
│ ├── train_model.py # Phase 1+2: Training & evaluation
│ ├── risk_scoring.py # Phase 2: Risk score computation
│ ├── risk_flags.py # Phase 3: Human-readable risk flags
│ └── llm_investigator.py # Phase 4: Claude API investigation
├── models/ # Saved artefacts (auto-generated)
│ ├── best_model.pkl
│ └── feature_cols.pkl
├── screenshots/ # Evaluation plots (auto-generated)
├── app.py # Phase 5: Streamlit dashboard
├── requirements.txt
└── README.md
# 1. Clone and install
git clone https://github.com/YOUR_USERNAME/Intelligent-Risk-Review-System
cd Intelligent-Risk-Review-System
pip install -r requirements.txt
# 2. Add dataset (download from Kaggle)
# Place creditcard.csv in data/
# 3. Train models + generate all plots
python src/train_model.py
# 4. Set Anthropic API key
export ANTHROPIC_API_KEY=your_key_here
# 5. Launch dashboard
streamlit run app.py- Real-time transaction streaming via Kafka
- Per-transaction SHAP waterfall plots in the dashboard
- Threshold optimisation via cost-sensitive learning (asymmetric FP/FN costs)
- A/B testing framework for model variant comparison
- Active learning loop to retrain on reviewed decisions
Intelligent Risk Review System | Python · Random Forest · XGBoost · SHAP · Anthropic Claude API · Streamlit
- Developed an ML-driven risk evaluation system for detecting fraudulent financial transactions across 284,807 labelled samples (0.17% fraud rate), achieving F1 of 0.840 and ROC-AUC of 0.965 on held-out test data.
- Addressed severe class imbalance (1:578 ratio) via SMOTE and cost-sensitive learning; selected Random Forest over XGBoost based on superior precision (87.8%) critical for minimising false positives in fraud review queues.
- Implemented SHAP TreeExplainer to produce per-transaction feature attribution, identifying V14, V12, and V4 as the dominant fraud drivers.
- Integrated Anthropic Claude API to auto-generate structured investigation reports grounding LLM reasoning in quantitative SHAP evidence and model risk scores.