Protein antigenicity classifier using Machine Learning.
Final project of the Saturdays.AI Machine Learning track.
Given a FASTA file containing protein sequences from a pathogen, the system predicts which ones are most likely to be antigenic β that is, recognized by the human immune system.
The output is a ranked table of antigenicity scores (0β1) per protein, plus a sorted bar chart of candidates. Each prediction is accompanied by a natural language explanation generated by Gemini 2.0 Flash, offering biological reasoning behind the score.
This kind of tool could be used as a fast in silico filter to prioritize vaccine candidate antigens, reducing the initial search time. It does not replace experimental assays or clinical trials.
This is an educational Machine Learning project applied to bioinformatics. The goal is to build a complete pipeline β from obtaining and cleaning real data to deploying a functional web interface β while learning to critically evaluate the results.
The pathogens used as case studies are SARS-CoV-2 and Influenza A, chosen for their clinical relevance and the abundance of available experimental data.
π Full technical write-up: Cazando antΓgenos con Machine Learning: cΓ³mo construimos AiGenix
The project is structured as five sequential notebooks, each producing a concrete artifact:
download_files β 00_acquisition β 01_exploration β 02_construction β 03_model
β β β β β
raw data CSV filtered protein_labels dataset.csv model.pkl
(3.9 GB) (29 MB) (1,365 prot.) (1,310 prot.) (RF + meta)
- download_files β Downloads the three raw IEDB export files (~3.9 GB total): T-cell, B-cell, and antigen assays.
- 00_acquisition β Processes files in 100k-row chunks, retains only the 5 relevant columns, filters to SARS-CoV-2 and Influenza A. Reduces data by 99.3% (to 158,289 assays / 29 MB).
- 01_exploration β Shifts the unit of analysis from epitopes to proteins. Labels each protein:
1if it has at least one positive assay,0if all assays are negative. Produces 1,365 labeled proteins. - 02_construction β Fetches full amino acid sequences from UniProt and NCBI APIs, calculates 24 physicochemical features per protein with Biopython. Produces
dataset.csv(1,310 Γ 29). - 03_model β Trains and evaluates three classifiers. Produces
model.pkl.
Source: IEDB β Immune Epitope Database
IEDB is a free, public database containing experimental epitope data: protein fragments recognized by the immune system in laboratory assays. It is maintained by the National Institute of Allergy and Infectious Diseases (NIAID) of the United States.
Dataset construction strategy:
- Download of the full IEDB export files (T-cell, B-cell, antigen assays).
- Processing in chunks; filtering by organism (SARS-CoV-2 and Influenza A) and positive assay result.
- Aggregation from 158,289 assays to 1,365 unique proteins.
- Sequence retrieval from UniProt and NCBI (96% match rate β 1,310 proteins).
- Binary labeling:
label = 1for proteins with at least one experimentally validated epitope,label = 0for proteins from the same pathogen with no antigenicity evidence.
The resulting dataset has a 7:1 class imbalance (1,198 antigenic vs. 167 non-antigenic proteins), handled with class_weight=balanced in the model.
Note: The raw IEDB files (~3.9 GB) are not included in this repository. They must be downloaded from iedb.org/database_export_v3.php.
Calculated with Biopython from the amino acid sequence of each protein (24 features total):
| Feature | Description |
|---|---|
| Length | Number of amino acids |
| Amino acid composition | Percentage of each of the 20 standard amino acids |
| Molecular weight | In Daltons |
| Isoelectric point | pH at which the net charge of the protein is zero |
| Mean hydrophobicity | GRAVY index β negative values indicate hydrophilic proteins more likely to be surface-exposed |
No protein language model embeddings or 3D structure features are used.
Most predictive features: Molecular weight and GRAVY index consistently led feature importance rankings. The model effectively learned that hydrophilic proteins (negative GRAVY) tend to be surface-exposed on the virus, making them more accessible to the immune system β without ever seeing a 3D structure.
- Main algorithm: Random Forest (200 trees,
class_weight=balanced) - Validation: Stratified cross-validation (k=5) + independent hold-out test set (20%)
- Baselines: Majority classifier and Logistic Regression
- Serialization:
models/model.pklwith joblib
| Model | AUC-ROC (CV) | AUC-ROC (Test) | F1-score (Test) | Recall (Test) |
|---|---|---|---|---|
| Random Forest | 0.72 Β± Ο | 0.65 | 0.943 | 1.0 |
The most important result is Recall = 1.0 on the independent test set: the model detected 100% of antigenic proteins without a single false negative. In a vaccine candidate screening context, this means no potential antigen is discarded in the initial filter β which is the critical safety property for this use case.
The drop from CV AUC (0.72) to test AUC (0.65) reflects a known structural limitation: antigenicity depends on the 3D folded surface of the protein, which sequence-only features can only approximate statistically.
- Upload any FASTA file with protein sequences
- Instant antigenicity score (0β1) per protein, ranked by likelihood
- Bar chart visualization of candidates
- AI-powered biological explanation for each prediction, generated by Gemini 2.0 Flash
Live at aigenix.streamlit.app
AIGENIX/
β
βββ app/
β βββ app.py
β
βββ assets/
β
βββ data/
β βββ processed/
β βββ raw/
β
βββ docs/
β βββ glossary_of_terms.md
β βββ project_viability_analysis.md
β
βββ models/
β βββ best_model_mvp.pkl
βββ notebooks/
β βββ download_files.ipynb
β βββ 00_acquisition.ipynb
β βββ 01_exploration.ipynb
β βββ 02_construction.ipynb
β βββ 03_model.ipynb
β βββ 04_model_comparison.ipynb
β βββ 05_overfitting_analysis.ipynb
β
βββ sample_input/
β βββ influenza_a_h1n1.fasta
β βββ sars_cov2_structurals.fasta
β
βββ src/
β βββ train_model_mvp.py
β
βββ main.py
βββ check_model.py
βββ pyproject.toml
βββ README.md
Python 3.9+
biopython
scikit-learn
pandas
matplotlib
streamlit
joblib
Installation:
pip install biopython scikit-learn pandas matplotlib streamlit joblibModel training:
Run the notebooks in order from Google Colab: download_files β 00 β 01 β 02 β 03.
Web interface:
cd app
streamlit run app.pyOr access the deployed app directly at aigenix.streamlit.app.
Upload a FASTA file with the proteins to evaluate. The app loads the saved model and returns the ranked score table, the chart, and a Gemini-generated biological explanation per protein.
- The model is trained exclusively on SARS-CoV-2 and Influenza A data. Its ability to generalize to other pathogens is unknown.
- Features are based solely on sequence. 3D structure, glycosylation, and cellular processing are not considered.
- The dataset contains label noise: proteins labeled as non-antigenic may simply be understudied, not truly non-antigenic. In immunology, absence of evidence is not evidence of absence.
- A high score does not guarantee that a protein is a good vaccine antigen. It is an indicative filter, not a clinical predictor.
Developed as the capstone project of the Saturdays.AI Machine Learning track β Madrid edition.
| Name | GitHub | |
|---|---|---|
| Iris Amorim | GitHub | |
| Alejandro Aparicio | GitHub | |
| Joaquin Lazaro | GitHub |
We welcome contributions of all kinds β bug reports, feature suggestions, improvements to the model or the app.
Saturdays.AI is a global, volunteer-driven program where participants learn machine learning by building real projects over the course of several weeks. This project was developed as the capstone project of the Machine Learning track.
