Predicting pairwise protein-protein interactions and structures from multiple sequence alignments. Now with interspecies (host-pathogen) interactions and automatic chunking of large sequences!
- Installation
- Credit
- Command-line usage
- Python API
- Scaling up
- Issues, problems, suggestions
- Further help
yunta provides several implementations of protein-protein interaction evaluation. In increasing computational cost:
- GPU-accelerated direct coupling analysis (DCA) in PyTorch
- RoseTTAFold-2track via the
rf2t-micropackage - AlphaFold2 for protein-protein structure prediction
yunta has streamlined installation, a command-line interface, a Python API, and resilience to GPU out-of-memory errors through chunking of long sequences and CPU-fallback. It takes as input unpaired multiple-sequence alignments in A3M format (as generated by tools like hhblits), and outputs a matrix of inter-residue contacts.
Rough timings for a pair of ~200 amino-acid proteins (S. cerevisiae DHFR and WW domain-containing protein) on CPU:
- DCA: 5 seconds
- RosettaFold-2track: 10 seconds
- AlphaFold2: 1 hour
Note that times increase quadratically with total protein length.
pip install yuntaTo enable AlphaFold2 with CUDA 12 (recommended for GPU):
pip install yunta[af_cuda12]For a local CUDA 12 installation:
pip install yunta[af_cuda12_local]For CUDA 11:
pip install yunta[af_cuda11]For AlphaFold2 without a specific CUDA version (CPU or custom JAX install):
pip install yunta[af]To enable RosettaFold-2track:
pip install yunta[rf2t]Using the embedded models requires the RoseTTAFold-2track and AlphaFold2 weights. These are automatically downloaded on first use. By doing so you agree that the trained weights for RoseTTAFold are made available for non-commercial use only under the terms of the Rosetta-DL Software license and AlphaFold2's pretrained parameters fall under the CC BY 4.0 license.
| Variable | Default | Description |
|---|---|---|
YUNTA_CACHE |
~/.cache/yunta |
Directory for the organism interaction lookup table cache. |
YUNTA_USE_CACHE |
False |
Set to True to load a pre-built cache from disk rather than rebuilding. |
YUNTA_TEST |
0 |
Set to 1 to build and hold the interaction lookup table in memory only (no disk write). |
yunta is a fork of SpeedPPI, which is itself inspired by FoldDock. This method used AlphaFold2 to evaluate 65,484 protein-protein interactions from the human proteome in Towards a structurally resolved human protein interaction network.
The idea of using DCA, RoseTTAFold-2track, and AlphaFold2 in a cascade of increasingly expensive and specific PPI detection methods has been explored in a series of papers from David Baker's lab:
- Cong et al., Protein interaction networks revealed by proteome coevolution. Science, 2019
- Humphreys et al., Computed structures of core eukaryotic protein complexes. Science, 2021
- Humphreys et al., Protein interactions in human pathogens revealed through deep learning. Nature Microbiology, 2024
yunta puts these algorithms in one place with easy installation, a command-line interface, and a Python API. It also enables interspecies co-evolutionary analysis using a built-in host-pathogen interaction mapping.
$ yunta --help
usage: yunta [-h] {dca-single,dca-many,rf2t-single,af2-single,af2-many} ...
Screening protein-protein interactions using DCA, RosettaFold-2track, and AlphaFold2.
options:
-h, --help show this help message and exit
Sub-commands:
{dca-single,dca-many,rf2t-single,af2-single,af2-many}
Use these commands to specify the tool you want to use.
dca-single Calculate DCA for one protein-protein interaction.
dca-many Calculate DCA between two sets of proteins, or all pairs in one set of proteins.
rf2t-single Calculate RF-2track contacts between one protein and a series of others.
af2-single Model one protein-protein interaction.
af2-many Model all interactions between two sets of proteins, or all pairs in one set of proteins.All algorithms depend on pre-computed multiple-sequence alignments (MSAs) between a protein of interest and as many homologs as possible. You can generate MSAs using hhblits with pre-clustered databases like UniClust:
hhblits -e 0.01 -v 3 -d /path/to/UniClust-database -i input.fasta -oa3m output-msa.a3m -o /dev/null -cov 60 -n 3 -realign -realign_max 10000This typically takes 1β40 min depending on query complexity. See the hhsuite documentation for details.
Given two MSAs, yunta calculates a contact map using DCA, RF2t, or AlphaFold2, and produces a summary table for each pair.
Using DCA or RF2t produces a table like this:
$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc| ID | uniprot_id_1 | uniprot_id_2 | seq_len | chain_a_len | chain_b_len | msa1_depth | msa2_depth | msa_depth | n_eff | DCA:apc | DCA:mean | DCA:median | DCA:maximum | DCA:minimum | DCA:var | DCA:sigma1 | DCA:focality | DCA:top_A | DCA:top_B |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| O13297-D6VTK4 | O13297 | D6VTK4 | 980 | 549 | 431 | 14246 | 1546 | 670 | 2 | False | 0.0183 | 0.0147 | 0.0743 | 2.28e-06 | ... | ... | ... | ... | ... |
Method-specific columns are prefixed with DCA: or RF2t:. Common columns across all methods:
sigma1β leading singular value of the inter-chain contact submatrixfocalityβ ratio of first to second singular value; higher values indicate a more concentrated interaction signaltop_A,top_Bβ indices of the top-scoring residues in each chain (from the leading SVD eigenvector)
If you also give --plot, contact maps for the full complex and inter-chain contacts only are saved as PNG, alongside CSV files of the raw matrices:
$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc --plot test/outputs/DYR_YEAST-CAPZA_YEASTyunta can feed MSAs into AlphaFold2 to predict binary protein complex structures:
$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single.tsvThis writes a summary TSV with AF2:-prefixed metrics β n_contacts, mean_interface_plddt, pdockq, seed β in addition to the standard contact map statistics. PDB structure files are written to the current working directory, named by protein pair ID.
Using --plot generates contact map plots as with the other commands:
$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single.tsv --plot test/outputs/af2-single-plot*-single commands run one protein against one or more others:
$ yunta dca-single --help
usage: yunta dca-single [-h] [--msa2 [MSA2 ...]] [--list-file] [--interspecies] [--strict-match]
[--output [OUTPUT]] [--plot PLOT] [--apc] [msa1]
positional arguments:
msa1 MSA file. Default: STDIN.
options:
-h, --help show this help message and exit
--msa2 [MSA2 ...], -2 [MSA2 ...]
Second MSA file(s). Default: if not provided, all pairwise from msa1.
--list-file, -l Treat inputs as plain-text list of MSA files, rather than MSA filenames.
Default: treat as MSA filenames.
--interspecies, -i MSAs are from different species; enables built-in host-pathogen interaction
map. Default: assume same species.
--strict-match, -S For interspecies mode, require query MSA sequences to be from known
interacting species. Default: relax this constraint for query sequences.
--output [OUTPUT], -o [OUTPUT]
Output filename. Default: STDOUT.
--plot PLOT, -p PLOT Directory for saving plots. Default: don't plot.
--apc, -a Apply average product correction (APC) to DCA scores. Default: off.If one MSA is provided (no -2), homodimeric interactions are probed. Use --list-file to pass a single plain-text file containing one MSA path per line.
*-many commands run all pairwise combinations across two sets:
$ yunta af2-many --help
usage: yunta af2-many [-h] [--msa2 [MSA2 ...]] [--list-file] [--interspecies] [--strict-match]
[--output [OUTPUT]] [--params PARAMS] [--recycles RECYCLES] [--plot PLOT]
[msa1 ...]
positional arguments:
msa1 MSA file(s).
options:
-h, --help show this help message and exit
--msa2 [MSA2 ...], -2 [MSA2 ...]
Second MSA file(s). Default: if not provided, all pairwise from msa1.
--list-file, -l Treat inputs as plain-text list of MSA files, rather than MSA filenames.
--interspecies, -i MSAs are from different species; enables built-in host-pathogen interaction map.
--strict-match, -S For interspecies mode, require query MSA sequences to be from known
interacting species.
--output [OUTPUT], -o [OUTPUT]
Output filename. Default: STDOUT.
--params PARAMS, -w PARAMS
Path to AlphaFold2 params file (.npz). Downloaded automatically if absent.
--recycles RECYCLES, -x RECYCLES
Maximum number of recycles through the model. Default: 10.
--plot PLOT, -p PLOT Directory for saving plots. Default: don't plot.Use --interspecies / -i when the two MSAs come from organisms that interact as host and pathogen. yunta uses a built-in host-pathogen interaction (HPI) map to pair aligned sequences across species rather than requiring exact species identity:
$ yunta dca-single test/inputs/crypto/Q5CPK5_CRYPI.a3m \
-2 test/inputs/human/EZRI_HUMAN.a3m \
--interspecies --apc \
-o test/outputs/dca-single-interspecies.tsvBy default (without --strict-match), the HPI constraint is relaxed for the query sequences themselves β useful when screening an uncharacterised query against a known host or pathogen proteome. Add --strict-match to require that query sequences come from species in the HPI map.
Load and inspect an MSA:
from yunta.structs.msa import MSA, PairedMSA
msa = MSA.from_file("my-msa-file.a3m")
print(msa) # MSA(name=P07807) of sequence length 549, with 14246 sequences.
print(msa.neff()) # effective sequence countPair two MSAs and run DCA:
from yunta.structs.msa import MSA, PairedMSA
from yunta.interactions.dca.dca_torch import calculate_dca
msa1 = MSA.from_file("protein-a.a3m")
msa2 = MSA.from_file("protein-b.a3m")
paired = PairedMSA.from_msa(msa1, msa2)
contact_matrix = calculate_dca(paired, apc=True)For interspecies pairing, pass interaction_map="builtin":
paired = PairedMSA.from_msa(msa1, msa2, interaction_map="builtin")Or supply a custom dict mapping species IDs to lists of interacting species IDs:
paired = PairedMSA.from_msa(
msa1, msa2,
interaction_map={"NCBI:562": ["NCBI:10710"], "NCBI:10710": ["NCBI:562"]},
)Run the full screening pipeline programmatically:
from yunta.screening import dca_one_vs_many, rf2track_one_vs_many
outputs = dca_one_vs_many(
msa_file1="query.a3m",
msa_file2=["target1.a3m", "target2.a3m"],
apc=True,
interaction_map="builtin", # omit for same-species
)
for result_matrix, interaction_matrix, metrics in outputs:
print(metrics.ID, metrics.focality)Each element of outputs is a 3-tuple (full_contact_matrix, inter-chain_contact_matrix, metrics_dataclass). Metrics dataclasses (DCAMetrics, RF2TMetrics, AF2Metrics) can be written directly to TSV:
metrics.write("results.tsv")(More documentation coming soon!)
While the *-many commands handle batches of PPIs, for large-scale screening across a HPC cluster our nf-ggi Nextflow pipeline is more efficient and can also generate MSAs for you.
Add to the issue tracker.




