Hyb-DysNet — Hybrid Feature Fusion and Ensemble Learning for Dysarthria Severity Classification in ALS Patients
Authors: Simone Cioffi, Emanuel Di Nardo, Angelo Ciaramella Affiliation: Department of Science and Technology, University of Naples Parthenope, Naples, Italy Venue: AIPHEA2026 — AI in Predictive HEAlth: architectures for prevention, IJCNN, June 22, 2026, Maastricht, NL
This repository contains the implementation of Hyb-DysNet, a framework for automatic dysarthria severity classification in ALS patients, developed for SAND Challenge Task 1 (Speech-based Assessment of Neurological Disorders).
Task 1 is a 5-class classification problem: given 8 audio recordings from a patient at an initial clinical visit, classify the severity of dysarthria according to the ALSFRS-R speech subscore.
Hyb-DysNet achieves Macro F1 = 0.69 on the official held-out test set.
Voice recordings were acquired at the ALS Centre of the Federico II University Hospital of Naples using the Vox4Health mobile application, during routine outpatient clinical visits.
| Split | Subjects | Audio Files (8 per subject) |
|---|---|---|
| Training | 219 | 1,752 |
| Validation | 53 | 424 |
| Test | 67 | 536 |
Severity classes (ALSFRS-R speech subscore):
| Label | Class | ALSFRS-R |
|---|---|---|
| 1 | Severe | ≤ 1 |
| 2 | Moderate | 2 |
| 3 | Mild | 3 |
| 4 | No Dysarthria | 4 |
| 5 | Healthy | 5 |
The dataset is severely imbalanced: Class 1 (Severe) contains only ~6 subjects in total.
Vocal tasks (8 per subject):
- Sustained phonation of the 5 Italian vowels:
/a/,/e/,/i/,/o/,/u/ - Diadochokinetic (DDK) sequences:
/pa/,/ta/,/ka/
- Resampled to 16 kHz
- Padded or truncated to a fixed duration of 5 seconds
- Missing recordings imputed with zeros
| Stream | Tool / Model | Features |
|---|---|---|
| eGeMAPSv02 | OpenSMILE | 88 features |
| Deep embeddings | Wav2Vec2-XLS-R (mean pooling) | 1,024 features |
| Spectral descriptors | Librosa (MFCC, ZCR, SC, etc.) | ~90 features |
| Total | Concatenated hybrid vector | >1,200 |
eGeMAPSv02 captures clinically interpretable parameters: pitch, jitter, shimmer, formant frequencies (F1/F2/F3), HNR, loudness, Alpha Ratio, Hammarberg Index. Wav2Vec2-XLS-R is used as a frozen feature extractor; the last transformer layer is mean-pooled over the time dimension.
- SMOTE (training set only) — oversamples minority classes until all 5 are equally represented → 3,440 balanced training samples
- Z-score standardization — StandardScaler fitted on training, applied to validation/test
- Feature selection — SelectKBest (ANOVA F-value, k=200)
No information from the validation or test partition is used at any stage of training, scaling, or feature selection.
Three heterogeneous classifiers combined via weighted soft voting (weights: XGBoost×2, LightGBM×2, LR×1):
| Classifier | Key Hyperparameters |
|---|---|
| XGBoost | n_estimators=2000, lr=0.01, max_depth=6, subsample=0.8, GPU |
| LightGBM | n_estimators=2000, lr=0.01, num_leaves=31, class_weight=balanced, GPU |
| Logistic Regression | C=0.1, class_weight=balanced |
Predictions are made independently on each of the 8 audio files per subject. The final subject-level severity class is obtained by majority voting over the 8 per-file predictions, aggregating evidence across vowel phonation and DDK tasks.
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| Severe | 0.33 | 0.06 | 0.11 | 16 |
| Moderate | 0.39 | 0.44 | 0.41 | 32 |
| Mild | 0.40 | 0.30 | 0.35 | 96 |
| No Dysarthria | 0.35 | 0.30 | 0.33 | 112 |
| Healthy | 0.57 | 0.73 | 0.64 | 168 |
| Macro avg | 0.41 | 0.37 | 0.37 | 424 |
Accuracy: 47.41% — Macro F1: 0.3656
| Team | Macro F1 |
|---|---|
| Hyb-DysNet (Ours) | 0.6900 |
| TUKE (Technical University of Kosice) | 0.6079 |
| UTL (University of Texas at Austin) | 0.6005 |
| PRIME (Université de Moncton) | 0.5945 |
.
├── notebook/
│ └── sand-challenge-task-1-submission.ipynb # Full pipeline notebook
├── models/
│ ├── ensemble_opensmile.joblib # Trained VotingClassifier (XGB + LGB + LR)
│ ├── scaler_opensmile.joblib # StandardScaler (OpenSMILE-only version)
│ ├── scaler_hybrid.joblib # StandardScaler (hybrid pipeline)
│ └── selector_hybrid.joblib # SelectKBest (k=200)
├── submissions/
│ └── submission_task1.csv # Test set predictions (67 subjects, classes 1–5)
└── README.md
The primary metric is Macro F1-Score, which computes the unweighted average of per-class F1-scores and is equally sensitive to minority and majority classes — critical given the severe class imbalance.
@inproceedings{cioffi2026hybdysnet,
title = {Hybrid Feature Fusion and Ensemble Learning for Dysarthria Severity Classification in ALS Patients},
author = {Cioffi, Simone and Di Nardo, Emanuel and Ciaramella, Angelo},
booktitle = {AIPHEA2026: AI in Predictive HEAlth: architectures for prevention, IJCNN},
year = {2026},
address = {Maastricht, NL}
}