Skip to content

THGLab/BioEdaDatabase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

BioEdaDatabase

Paper

Energetics of Non-covalent Interactions of Protein-Ligand Complexes for Drug Discovery
Yingze Wang, Dong Jun Shin, Martin Head-Gordon, Teresa Head-Gordon — ChemRxiv preprint

Scope

BioEdaDatabase provides quantum-mechanical interaction energies and ALMO energy decomposition analysis (EDA) for 14,905 protein-ligand fragment dimers extracted from high-quality experimental structures in HiQBind. Each dimer corresponds to a specific non-covalent interaction (NCI) type identified by PLIP, with reference energies computed at the ωB97X-V/def2-TZVPD level in Q-Chem and decomposed into electrostatics, Pauli repulsion, dispersion, polarization, and charge transfer. The dataset also includes interaction energies from classical force fields (GAFF2, AMOEBA) and machine-learned interaction potentials (MACE-OFF, MACE-OMOL, UMA, AIMNet2) for benchmarking.

Dataset files

All cleaned tables live in dataset/. Energies are reported in kJ/mol unless noted otherwise.

File Entries Description
hbond_EDA_clean.csv 4,302 Hydrogen-bonded protein-ligand fragment dimers, including neutral and charged cases (category). Charged hydrogen bonds involving oppositely charged fragments are analyzed separately from salt bridges in the paper.
salt_bridge_EDA_clean.csv 2,252 Salt-bridge dimers between charged protein and ligand fragments, including dimers reclassified from charged hydrogen bonds. lig_group identifies the anion or cation type on the ligand side (e.g., carboxylate, phosphate, guanidinium).
halogen_EDA_clean.csv 1,064 Halogen-bond dimers with C−X···Y geometry (donortype: F, Cl, Br, or I). Includes donor/acceptor atom indices, distances, and angles.
pi_stack_EDA_clean.csv 966 π−π stacking dimers between aromatic protein and ligand fragments. type is parallel (P) or T-shaped (T); centdist, angle, and offset describe the stacking geometry.
pi_cation_EDA_clean.csv 1,505 π-cation dimers between an aromatic ligand group and a cationic protein side chain (lig_group, e.g., aromatic ring, guanidinium, tertiary amine).
hydrophobic_EDA_clean.csv 4,816 Hydrophobic contacts between protein and ligand carbon atoms, filtered to exclude charged fragments and overlapping NCI motifs.
omol_pocket_EDA_clean.csv 111 Supplementary pocket-scale fragment dimer benchmark set with the same EDA and force-field/MLIP columns as the main tables. category labels the dominant interaction motif (hbond, salt, or disp). Row IDs encode the source complex, ligand, conformational state/frame, and fragment pair.

Common columns

The first unnamed column is a unique row identifier for each dimer. Many columns are shared across the six main NCI tables; omol_pocket_EDA_clean.csv retains the structure, EDA, and benchmark columns but omits PLIP/PDB metadata fields.

Structure and provenance

Column Description
PDBID Four-letter PDB code of the parent complex.
full_PDBID Full PDB entry identifier, including ligand/residue context.
subdir Source subdirectory within the HiQBind-derived structure set.
resnr Residue number of the interacting protein residue.
restype Three-letter amino-acid residue type.
reschain Protein chain identifier.

Fragment geometry and composition

Column Description
natoms0, natoms1 Number of atoms in fragment 0 (ligand) and fragment 1 (protein).
charge0, charge1 Net formal charge of each fragment.
smiles0, smiles1 SMILES strings for the ligand and protein fragments.
elements Space-separated element symbols for all atoms in the dimer, in the same order as xyz.
xyz Flattened Cartesian coordinates (x y z per atom) in angstroms.

ALMO-EDA reference energies (DFT)

Computed with ALMO-EDA at ωB97X-V/def2-TZVPD in Q-Chem. Component definitions follow Eq. (1) in the paper.

Column Description
ELEC Permanent electrostatic interaction energy.
PAULI Pauli repulsion energy.
DISP Dispersion energy.
POLARIZATION Polarization energy.
CHARGE_TRANSFER Charge-transfer energy.
TOTAL Total ALMO interaction energy; sum of the EDA components used for analysis.
CLS_ELEC Classified electrostatic contribution used for ternary-plot analysis.
MOD_PAULI Modified Pauli contribution used for ternary-plot analysis.
FROZEN Frozen-core term (ELEC + PAULI) in the ternary decomposition.

Force-field and MLIP benchmark energies

Total interaction energies from each method evaluated on the same dimer geometries. Lower error against TOTAL indicates better agreement with the DFT reference.

Column Method
GAFF2 GAFF2 with AM1-BCC charges (OpenMM). For the OMOL dataset, the GAFF2 is AMBER14SB+GAFF2-BCC energy.
GAFF2/RESP GAFF2 with RESP charges (OpenMM); present in the six main NCI tables only.
AMOEBA AMOEBA polarizable force field (Tinker), with multipoles from Poltype2 at ωB97X-V/def2-TZVPD.
mace_off/medium MACE-OFF-23(M).
mace_omol/extra_large MACE-OMOL (extra-large model).
uma-s-1p1 UMA small model (uma-s-1.1).
aimnet2 AIMNet2 neural network potential.

Citation

If you use this dataset, please cite the preprint:

Wang, Y.; Shin, D. J.; Head-Gordon, M.; Head-Gordon, T. Energetics of Non-covalent Interactions of Protein-Ligand Complexes for Drug Discovery. ChemRxiv 2026. https://doi.org/10.26434/chemrxiv.10001956/v1

About

EDA database for protein-ligand complexes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors