External Biological Validation of Foundation-Model Gene Regulatory Networks: Perturbation Bridging, ChIP-Seq Binding Support, and Essential-Gene Agreement
Standard evaluation of GRN inference from single-cell foundation models compares predicted edges against a single curated reference database, conflating reference biases with inference quality. This paper presents a three-modality external validation framework that tests foundation-model GRNs against independent biological evidence:
- Perturbation bridging: Functional evidence from Perturb-seq experiments measuring causal transcriptional consequences of gene knockouts.
- ChIP-seq binding support: Physical binding evidence from five ChIP-seq atlases (ChEA 2015/2016/2022, ENCODE 2014/2015).
- Essential-gene agreement: Phenotypic dependency evidence from genome-wide CRISPR screens (DepMap 23Q4).
Central finding: external support is narrow, tissue-specific, and null-family sensitive. The three modalities are near-independent (|rho| < 0.2), meaning single-reference evaluation is fundamentally unreliable.
external-validation/
├── README.md # This file
├── LICENSE # MIT License
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment specification
├── setup.py # Package installation
│
├── src/ # Source code (analysis modules)
│ ├── perturbation/ # Modality 1: perturbation bridging
│ │ ├── __init__.py
│ │ ├── enrichment.py # Bootstrap enrichment computation
│ │ ├── independent_ref.py # Independent union reference construction
│ │ ├── rank_shift.py # Cross-regime rank shift analysis
│ │ └── auprc.py # Precision-recall curve computation
│ │
│ ├── chipseq/ # Modality 2: ChIP-seq binding support
│ │ ├── __init__.py
│ │ ├── atlas_query.py # Atlas querying and TF normalization
│ │ ├── null_testing.py # Method- and source-conditioned null models
│ │ ├── support_curves.py # Top-k support curve computation
│ │ └── cross_atlas.py # Cross-atlas consistency analysis
│ │
│ ├── essentiality/ # Modality 3: essential-gene agreement
│ │ ├── __init__.py
│ │ ├── depmap_query.py # DepMap data loading and tissue grouping
│ │ ├── zscore.py # Per-TF z-score computation
│ │ ├── concordance.py # Cross-tissue concordance
│ │ └── calibration.py # Dependency threshold calibration
│ │
│ └── synthesis/ # Cross-modality synthesis
│ ├── __init__.py
│ ├── cross_modality.py # Cross-modality rank correlation
│ └── validation_card.py # External validation card construction
│
├── scripts/ # Runnable analysis scripts
│ ├── 01_run_perturbation.py
│ ├── 02_run_chipseq.py
│ ├── 03_run_essentiality.py
│ ├── 04_run_synthesis.py
│ └── 05_generate_figures.py
│
├── data/ # Data directory
│ ├── raw/ # Raw input data (not tracked; see instructions)
│ │ └── .gitkeep
│ └── processed/ # Processed intermediate results
│ └── .gitkeep
│
├── paper/ # Manuscript
│ ├── main.tex # Full paper source
│ ├── main.pdf # Compiled output
│ ├── figures/ # Generated figure PNGs and PDFs (14 figures)
│ ├── supplementary/ # Supplementary materials
│ ├── generate_figures.py # Composite figure generation
│ └── generate_standalone_figures.py # Synthesized figures (concordance, card)
│
└── tests/ # Unit tests
├── test_perturbation.py
├── test_chipseq.py
└── test_essentiality.py
# Clone the repository
git clone https://github.com/Biodyn-AI/external-validation.git
cd external-validation
# Option 1: pip
pip install -r requirements.txt
# Option 2: conda
conda env create -f environment.yml
conda activate external-validationThis analysis requires three categories of external data:
- Perturbation data: Perturb-seq datasets from Dixit et al. (2016), Adamson et al. (2016), and Shifrut et al. (2018).
- ChIP-seq atlases: ChEA (2015/2016/2022) from Enrichr and ENCODE TF ChIP-seq (2014/2015) from ENCODE.
- DepMap CRISPR dependency data: DepMap 23Q4 Chronos dependency scores.
- scGPT edge scores: Computed using the scGPT-human checkpoint on immune-tissue single-cell RNA-seq data.
Place downloaded files in data/raw/. See individual script headers for expected file formats.
# Modality 1: Perturbation bridging
python scripts/01_run_perturbation.py --tissue immune --top-k 1000
# Modality 2: ChIP-seq binding support
python scripts/02_run_chipseq.py --top-k 1000 --n-permutations 500
# Modality 3: Essential-gene agreement
python scripts/03_run_essentiality.py --depmap-release 23Q4 --n-tfs 50
# Cross-modality synthesis
python scripts/04_run_synthesis.py
# Generate paper figures
python scripts/05_generate_figures.pypytest tests/ -vcd paper
pdflatex main.tex
pdflatex main.tex # second pass for references| Modality | Key Metric | Value |
|---|---|---|
| Perturbation | Best enrichment (canonical) | 87.4x |
| Perturbation | Best enrichment (independent) | 265.7x |
| Perturbation | Perturbations with recall > 0 | 24.3% |
| Perturbation | Cross-regime rank agreement | r = 0.449 |
| ChIP-seq | Significant method-atlas pairs | 5/30 (all ENCODE 2014) |
| ChIP-seq | Source-conditioned null | All significance lost (p >= 0.683) |
| Essentiality | Top TF (immune) | EZH2 (z = 5.51, q = 0.016) |
| Essentiality | Significant TFs (lung/kidney) | 0 |
| Essentiality | Cross-tissue concordance | rho = 0.15-0.31 |
| Synthesis | Cross-modality correlations | All |
This project is licensed under the MIT License. See LICENSE for details.