🛡️ Intelligent Risk Review System

ML-Based Fraud Detection + LLM Investigation Assistant

A production-style risk evaluation system that combines ensemble machine learning with Generative AI to detect suspicious financial transactions, explain risk indicators via SHAP, and generate structured investigation reports — directly mirroring the workflow of an Applied Scientist on a Buyer Risk Prevention team.

Overview

Component	Technology
Fraud detection models	Random Forest · XGBoost · Logistic Regression
Class balancing	SMOTE (Synthetic Minority Over-sampling)
Risk scoring	Probabilistic score 0–100 with verdict thresholds
Explainability	SHAP TreeExplainer — feature-level attribution
Risk flags	Rule-based human-readable indicators
Investigation reports	Anthropic Claude API (`claude-sonnet-4-20250514`)
Dashboard	Streamlit (3-tab interactive UI)

Dataset

Credit Card Fraud Detection — Kaggle / ULB Machine Learning Group

Attribute	Value
Total transactions	284,807
Fraudulent	492 (0.173%)
Genuine	284,315 (99.827%)
Features	V1–V28 (PCA-transformed) + Amount + Time
Train / Test split	80% / 20% stratified

Methodology

Handling Class Imbalance

The dataset is severely imbalanced (1 fraud per ~578 genuine transactions). Three strategies were combined:

SMOTE applied to training set only (10% minority sampling ratio)
scale_pos_weight in XGBoost
class_weight="balanced" in Logistic Regression and Random Forest
Primary evaluation metric: F1 Score and Average Precision rather than accuracy

Feature Engineering

Amount and Time standardised with StandardScaler
V1–V28 PCA components used directly (no further transformation needed)

Model Comparison

Model	Precision	Recall	F1 Score	ROC-AUC	Avg Precision
Logistic Regression	0.058	0.918	0.109	0.969	0.726
Random Forest ✅	0.878	0.806	0.840	0.965	0.867
XGBoost	0.748	0.847	0.794	0.978	0.870

Why Random Forest over XGBoost?

XGBoost achieved the highest ROC-AUC (0.978) but Random Forest was selected for deployment due to:

Superior F1 Score (0.840 vs 0.794) — better precision-recall balance, meaning fewer false positives sent to manual review
Higher Precision (0.878 vs 0.748) — in a fraud review queue, low-precision models waste analyst time on false alarms
Interpretability — Random Forest SHAP values are more stable and consistent, better suited for explainable risk decisions
Production stability — Random Forest is less sensitive to hyperparameter tuning and performs reliably without extensive calibration

In a real Buyer Risk Prevention context, a model that catches 80.6% of fraud with 87.8% precision is preferable to one that catches slightly more fraud but generates 13% more false positives.

Risk Scoring

prob       = model.predict_proba(transaction)[0, 1]
risk_score = prob * 100   # continuous 0–100

# Decision thresholds
HIGH   (≥ 70) → Block / Manual Review
MEDIUM (≥ 40) → Flag for Review  
LOW    (< 40) → Approve

Confusion Matrix Results (test set):

                Predicted Genuine   Predicted Fraud
Actual Genuine       56,853               11
Actual Fraud             19               79

Recall    = 79 / (79 + 19) = 80.6%
Precision = 79 / (79 + 11) = 87.8%

SHAP Explainability

SHAP (SHapley Additive exPlanations) via TreeExplainer is used to attribute each prediction to individual features.

Top Risk Drivers (Mean |SHAP Value|):

| Rank | Feature | Mean |SHAP| | Interpretation | |---|---|---|---| | 1 | V14 | 0.0799 | Spending behaviour pattern | | 2 | V12 | 0.0720 | Transaction frequency pattern | | 3 | V4 | 0.0683 | Risk profile indicator | | 4 | V3 | 0.0532 | Historical deviation | | 5 | V10 | 0.0483 | Merchant category pattern |

These SHAP values are used directly in the LLM investigation prompt to ground the AI report in quantitative evidence rather than heuristics.

LLM Investigation Layer

Each high-risk transaction is passed to Claude claude-sonnet-4-20250514 with:

Risk score and model verdict
Top SHAP drivers
Human-readable risk flags

Claude returns a structured 5-section investigation report:

RISK LEVEL: High

EXECUTIVE SUMMARY:
Transaction TXN-45821 exhibits multiple strong indicators of fraudulent
activity, with a risk score of 91/100 driven primarily by anomalous
patterns in V14 and V12 features.

DETAILED ANALYSIS:
The dominant SHAP driver V14 (impact: 0.079) represents a strong
deviation from the account's historical spending behaviour...

RECOMMENDED ACTION: Flag for Manual Review

REASONING:
Score of 91 with 4 concurrent risk flags exceeds the high-risk
threshold; human review is warranted before blocking.

Visualisations

File	Description
`class_distribution.png`	Fraud vs Genuine count (log scale)
`roc_curve.png`	ROC curves — all 3 models
`pr_curve.png`	Precision-Recall curves (primary metric for imbalanced data)
`confusion_matrix.png`	Best model — TN/FP/FN/TP breakdown
`feature_importance.png`	Top 15 features — Random Forest Gini importance
`shap_summary.png`	SHAP beeswarm — feature impact direction + magnitude
`shap_bar.png`	SHAP bar chart — mean absolute impact per feature

Repository Structure

Intelligent-Risk-Review-System/
├── data/
│   └── creditcard.csv              # Kaggle ULB dataset (add manually)
├── notebooks/
│   └── model_training.ipynb        # Full EDA + training walkthrough
├── src/
│   ├── train_model.py              # Phase 1+2: Training & evaluation
│   ├── risk_scoring.py             # Phase 2: Risk score computation
│   ├── risk_flags.py               # Phase 3: Human-readable risk flags
│   └── llm_investigator.py         # Phase 4: Claude API investigation
├── models/                         # Saved artefacts (auto-generated)
│   ├── best_model.pkl
│   └── feature_cols.pkl
├── screenshots/                    # Evaluation plots (auto-generated)
├── app.py                          # Phase 5: Streamlit dashboard
├── requirements.txt
└── README.md

Setup & Run

# 1. Clone and install
git clone https://github.com/YOUR_USERNAME/Intelligent-Risk-Review-System
cd Intelligent-Risk-Review-System
pip install -r requirements.txt

# 2. Add dataset (download from Kaggle)
# Place creditcard.csv in data/

# 3. Train models + generate all plots
python src/train_model.py

# 4. Set Anthropic API key
export ANTHROPIC_API_KEY=your_key_here

# 5. Launch dashboard
streamlit run app.py

Future Work

Real-time transaction streaming via Kafka
Per-transaction SHAP waterfall plots in the dashboard
Threshold optimisation via cost-sensitive learning (asymmetric FP/FN costs)
A/B testing framework for model variant comparison
Active learning loop to retrain on reviewed decisions

Resume Entry

Intelligent Risk Review System | Python · Random Forest · XGBoost · SHAP · Anthropic Claude API · Streamlit

Developed an ML-driven risk evaluation system for detecting fraudulent financial transactions across 284,807 labelled samples (0.17% fraud rate), achieving F1 of 0.840 and ROC-AUC of 0.965 on held-out test data.
Addressed severe class imbalance (1:578 ratio) via SMOTE and cost-sensitive learning; selected Random Forest over XGBoost based on superior precision (87.8%) critical for minimising false positives in fraud review queues.
Implemented SHAP TreeExplainer to produce per-transaction feature attribution, identifying V14, V12, and V4 as the dominant fraud drivers.
Integrated Anthropic Claude API to auto-generate structured investigation reports grounding LLM reasoning in quantitative SHAP evidence and model risk scores.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ Intelligent Risk Review System

Overview

Dataset

Methodology

Handling Class Imbalance

Feature Engineering

Model Comparison

Why Random Forest over XGBoost?

Risk Scoring

SHAP Explainability

LLM Investigation Layer

Visualisations

Repository Structure

Setup & Run

Future Work

Resume Entry

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
notebooks		notebooks
screenshots		screenshots
src		src
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🛡️ Intelligent Risk Review System

Overview

Dataset

Methodology

Handling Class Imbalance

Feature Engineering

Model Comparison

Why Random Forest over XGBoost?

Risk Scoring

SHAP Explainability

LLM Investigation Layer

Visualisations

Repository Structure

Setup & Run

Future Work

Resume Entry

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages