A machine-learning system that detects DNS tunneling and botnet C2 activity by analysing statistical features of DNS queries. Now features a hybrid data approach — trained on synthetic DNS data and validated on real CTU-13 network flow data.
- Hybrid ML pipeline
- Synthetic DNS training
- CTU-13 validation
- Random Forest classifier
- Interactive Bootstrap 5 interface
- Confidence scoring
- Feature importance visualization
- Confusion matrices
- Excel-based performance report
- Vercel deployment
Deployed Application
pip install -r requirements.txtpython train_model.pyThis will run two phases:
- Phase 1 — Generate & train on 2,000 synthetic DNS records
- Phase 2 — Load & validate on real CTU-13 binetflow data
Outputs generated:
models/rf_model.joblib+models/scaler.joblib← used by Flask UImodels/rf_real.joblib+models/scaler_real.joblib← CTU-13 modelmodel_results.xlsx← 9 sheets with full resultsstatic/*.png← 5 charts (feature importance, confusion matrices, comparison)
python app.pydns_detection/
│
├── dataset/
│ ├── dns_data.csv ← synthetic DNS dataset (auto-generated)
│ └── ctu13_sample.csv ← CTU-13 real binetflow data
│
├── models/
│ ├── rf_model.joblib ← trained synthetic RF model (used by UI)
│ ├── scaler.joblib ← synthetic scaler
│ ├── rf_real.joblib ← trained CTU-13 RF model
│ └── scaler_real.joblib ← CTU-13 scaler
│
├── static/
│ ├── css/style.css
│ ├── feature_importance.png ← synthetic model chart
│ ├── confusion_matrix.png ← synthetic confusion matrix
│ ├── real_feature_importance.png ← CTU-13 model chart
│ ├── real_confusion_matrix.png ← CTU-13 confusion matrix
│ └── comparison_chart.png ← side-by-side comparison
│
├── templates/
│ └── index.html ← modern Bootstrap 5 UI
│
├── app.py ← Flask application
├── train_model.py ← hybrid ML pipeline
├── utils.py ← DNS feature extraction
├── generate_dataset.py ← synthetic data generator
├── model_results.xlsx ← 9-sheet Excel report
├── project_details.txt ← full documentation
├── requirements.txt
└── README.md
| Phase | Dataset | Features | Purpose |
|---|---|---|---|
| Phase 1 | Synthetic (2,000 rows) | domain_length, entropy, num_subdomains, query_frequency, digit_count, special_char_count, longest_label_len | Train primary model for UI |
| Phase 2 | CTU-13 real binetflow | Dur, SrcBytes, DstBytes, TotPkts, TotBytes, SrcPkts, DstPkts | Validate on real network flows |
| Sheet | Contents |
|---|---|
Synthetic Results |
Accuracy, Precision, Recall, F1 |
Synthetic Conf Matrix |
TP/TN/FP/FN |
Synthetic Predictions |
200 sample predictions |
Synthetic Feature Imp |
Feature importance scores |
Real Data Results |
CTU-13 metrics |
Real Data Conf Matrix |
CTU-13 TP/TN/FP/FN |
Real Data Predictions |
200 CTU-13 predictions |
Real Feature Importance |
CTU-13 feature scores |
Comparison |
Side-by-side metric delta |
Built with Bootstrap 5 — dark, modern, responsive.
- Enter domain → click Analyse (or press Enter)
- See: Verdict badge · Confidence bar · 7-feature breakdown table
- Charts embedded: feature importance, confusion matrices, comparison
| Domain | Expected |
|---|---|
google.com |
✅ Normal |
mail.yahoo.com |
✅ Normal |
dGhpcyBpcyBhIHRlc3Q.tunnel.xyz |
|
a1b2c3defghijklmnopqrstuvwxyz1234567890.evil.com |
|
sub1.sub2.sub3.sub4.botnet.ru |
flask>=2.3.0
scikit-learn>=1.3.0
pandas>=2.0.0
numpy>=1.24.0
matplotlib>=3.7.0
seaborn>=0.12.0
openpyxl>=3.1.0
joblib>=1.3.0
| Component | Change |
|---|---|
train_model.py |
Full rewrite — 2-phase hybrid pipeline |
dataset/ |
Added ctu13_sample.csv (CTU-13 binetflow) |
model_results.xlsx |
Expanded from 4 → 9 sheets |
templates/index.html |
Redesigned with Bootstrap 5 dark theme |
static/ |
5 charts instead of 2 |
models/ |
Now saves 4 files (2 models + 2 scalers) |
Hosted on Vercel