Skip to content

man5ic/dns-detection

Repository files navigation

🛡️ ML-Based DNS Tunneling & Botnet Detection System

Hybrid Model: Synthetic Training + CTU-13 Real-World Validation

A machine-learning system that detects DNS tunneling and botnet C2 activity by analysing statistical features of DNS queries. Now features a hybrid data approach — trained on synthetic DNS data and validated on real CTU-13 network flow data.


✨ Features

  • Hybrid ML pipeline
  • Synthetic DNS training
  • CTU-13 validation
  • Random Forest classifier
  • Interactive Bootstrap 5 interface
  • Confidence scoring
  • Feature importance visualization
  • Confusion matrices
  • Excel-based performance report
  • Vercel deployment

🌐 Live Demo

Deployed Application

🚀 Quick Start

1 · Install dependencies

pip install -r requirements.txt

2 · Train the model (both phases)

python train_model.py

This will run two phases:

  • Phase 1 — Generate & train on 2,000 synthetic DNS records
  • Phase 2 — Load & validate on real CTU-13 binetflow data

Outputs generated:

  • models/rf_model.joblib + models/scaler.joblib ← used by Flask UI
  • models/rf_real.joblib + models/scaler_real.joblib ← CTU-13 model
  • model_results.xlsx ← 9 sheets with full results
  • static/*.png ← 5 charts (feature importance, confusion matrices, comparison)

3 · Launch the web app

python app.py

Open http://127.0.0.1:5000


📁 Project Structure

dns_detection/
│
├── dataset/
│   ├── dns_data.csv              ← synthetic DNS dataset (auto-generated)
│   └── ctu13_sample.csv          ← CTU-13 real binetflow data
│
├── models/
│   ├── rf_model.joblib           ← trained synthetic RF model (used by UI)
│   ├── scaler.joblib             ← synthetic scaler
│   ├── rf_real.joblib            ← trained CTU-13 RF model
│   └── scaler_real.joblib        ← CTU-13 scaler
│
├── static/
│   ├── css/style.css
│   ├── feature_importance.png       ← synthetic model chart
│   ├── confusion_matrix.png         ← synthetic confusion matrix
│   ├── real_feature_importance.png  ← CTU-13 model chart
│   ├── real_confusion_matrix.png    ← CTU-13 confusion matrix
│   └── comparison_chart.png         ← side-by-side comparison
│
├── templates/
│   └── index.html                ← modern Bootstrap 5 UI
│
├── app.py                        ← Flask application
├── train_model.py                ← hybrid ML pipeline
├── utils.py                      ← DNS feature extraction
├── generate_dataset.py           ← synthetic data generator
├── model_results.xlsx            ← 9-sheet Excel report
├── project_details.txt           ← full documentation
├── requirements.txt
└── README.md

🔀 Hybrid Data Approach

Phase Dataset Features Purpose
Phase 1 Synthetic (2,000 rows) domain_length, entropy, num_subdomains, query_frequency, digit_count, special_char_count, longest_label_len Train primary model for UI
Phase 2 CTU-13 real binetflow Dur, SrcBytes, DstBytes, TotPkts, TotBytes, SrcPkts, DstPkts Validate on real network flows

📊 model_results.xlsx — 9 Sheets

Sheet Contents
Synthetic Results Accuracy, Precision, Recall, F1
Synthetic Conf Matrix TP/TN/FP/FN
Synthetic Predictions 200 sample predictions
Synthetic Feature Imp Feature importance scores
Real Data Results CTU-13 metrics
Real Data Conf Matrix CTU-13 TP/TN/FP/FN
Real Data Predictions 200 CTU-13 predictions
Real Feature Importance CTU-13 feature scores
Comparison Side-by-side metric delta

🖥️ Web UI

Built with Bootstrap 5 — dark, modern, responsive.

  • Enter domain → click Analyse (or press Enter)
  • See: Verdict badge · Confidence bar · 7-feature breakdown table
  • Charts embedded: feature importance, confusion matrices, comparison

Example Domains

Domain Expected
google.com ✅ Normal
mail.yahoo.com ✅ Normal
dGhpcyBpcyBhIHRlc3Q.tunnel.xyz ⚠️ Malicious
a1b2c3defghijklmnopqrstuvwxyz1234567890.evil.com ⚠️ Malicious
sub1.sub2.sub3.sub4.botnet.ru ⚠️ Malicious

📦 Dependencies

flask>=2.3.0
scikit-learn>=1.3.0
pandas>=2.0.0
numpy>=1.24.0
matplotlib>=3.7.0
seaborn>=0.12.0
openpyxl>=3.1.0
joblib>=1.3.0

🔄 What Changed from a previous version (v1)

Component Change
train_model.py Full rewrite — 2-phase hybrid pipeline
dataset/ Added ctu13_sample.csv (CTU-13 binetflow)
model_results.xlsx Expanded from 4 → 9 sheets
templates/index.html Redesigned with Bootstrap 5 dark theme
static/ 5 charts instead of 2
models/ Now saves 4 files (2 models + 2 scalers)

🌐 Deployment

Hosted on Vercel

About

ML-based DNS Tunneling & Botnet Detection System using a hybrid synthetic + CTU-13 approach, featuring a modern Flask UI & deployed on Vercel.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors