🍐 yunta

Predicting pairwise protein-protein interactions and structures from multiple sequence alignments. Now with interspecies (host-pathogen) interactions and automatic chunking of large sequences!

Installation
Credit
Command-line usage
Python API
Scaling up
Issues, problems, suggestions
Further help

yunta provides several implementations of protein-protein interaction evaluation. In increasing computational cost:

GPU-accelerated direct coupling analysis (DCA) in PyTorch
RoseTTAFold-2track via the rf2t-micro package
AlphaFold2 for protein-protein structure prediction

yunta has streamlined installation, a command-line interface, a Python API, and resilience to GPU out-of-memory errors through chunking of long sequences and CPU-fallback. It takes as input unpaired multiple-sequence alignments in A3M format (as generated by tools like hhblits), and outputs a matrix of inter-residue contacts.

Rough timings for a pair of ~200 amino-acid proteins (S. cerevisiae DHFR and WW domain-containing protein) on CPU:

DCA: 5 seconds

RosettaFold-2track: 10 seconds

AlphaFold2: 1 hour

Note that times increase quadratically with total protein length.

Installation

pip install yunta

To enable AlphaFold2 with CUDA 12 (recommended for GPU):

pip install yunta[af_cuda12]

For a local CUDA 12 installation:

pip install yunta[af_cuda12_local]

For CUDA 11:

pip install yunta[af_cuda11]

For AlphaFold2 without a specific CUDA version (CPU or custom JAX install):

pip install yunta[af]

To enable RosettaFold-2track:

pip install yunta[rf2t]

Using the embedded models requires the RoseTTAFold-2track and AlphaFold2 weights. These are automatically downloaded on first use. By doing so you agree that the trained weights for RoseTTAFold are made available for non-commercial use only under the terms of the Rosetta-DL Software license and AlphaFold2's pretrained parameters fall under the CC BY 4.0 license.

Environment variables

Variable	Default	Description
`YUNTA_CACHE`	`~/.cache/yunta`	Directory for the organism interaction lookup table cache.
`YUNTA_USE_CACHE`	`False`	Set to `True` to load a pre-built cache from disk rather than rebuilding.
`YUNTA_TEST`	`0`	Set to `1` to build and hold the interaction lookup table in memory only (no disk write).

Credit

yunta is a fork of SpeedPPI, which is itself inspired by FoldDock. This method used AlphaFold2 to evaluate 65,484 protein-protein interactions from the human proteome in Towards a structurally resolved human protein interaction network.

The idea of using DCA, RoseTTAFold-2track, and AlphaFold2 in a cascade of increasingly expensive and specific PPI detection methods has been explored in a series of papers from David Baker's lab:

yunta puts these algorithms in one place with easy installation, a command-line interface, and a Python API. It also enables interspecies co-evolutionary analysis using a built-in host-pathogen interaction mapping.

Command-line usage

$ yunta --help
usage: yunta [-h] {dca-single,dca-many,rf2t-single,af2-single,af2-many} ...

Screening protein-protein interactions using DCA, RosettaFold-2track, and AlphaFold2.

options:
  -h, --help            show this help message and exit

Sub-commands:
  {dca-single,dca-many,rf2t-single,af2-single,af2-many}
                        Use these commands to specify the tool you want to use.
    dca-single          Calculate DCA for one protein-protein interaction.
    dca-many            Calculate DCA between two sets of proteins, or all pairs in one set of proteins.
    rf2t-single         Calculate RF-2track contacts between one protein and a series of others.
    af2-single          Model one protein-protein interaction.
    af2-many            Model all interactions between two sets of proteins, or all pairs in one set of proteins.

Generating multiple-sequence alignments

All algorithms depend on pre-computed multiple-sequence alignments (MSAs) between a protein of interest and as many homologs as possible. You can generate MSAs using hhblits with pre-clustered databases like UniClust:

hhblits -e 0.01 -v 3 -d /path/to/UniClust-database -i input.fasta -oa3m output-msa.a3m -o /dev/null -cov 60 -n 3 -realign -realign_max 10000

This typically takes 1–40 min depending on query complexity. See the hhsuite documentation for details.

Calculating contact maps

Given two MSAs, yunta calculates a contact map using DCA, RF2t, or AlphaFold2, and produces a summary table for each pair.

Using DCA or RF2t produces a table like this:

$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc

ID	uniprot_id_1	uniprot_id_2	seq_len	chain_a_len	chain_b_len	msa1_depth	msa2_depth	msa_depth	n_eff	DCA:apc	DCA:mean	DCA:median	DCA:maximum	DCA:minimum	DCA:var	DCA:sigma1	DCA:focality	DCA:top_A	DCA:top_B
O13297-D6VTK4	O13297	D6VTK4	980	549	431	14246	1546	670	2	False	0.0183	0.0147	0.0743	2.28e-06	...	...	...	...	...

Method-specific columns are prefixed with DCA: or RF2t:. Common columns across all methods:

sigma1 — leading singular value of the inter-chain contact submatrix
focality — ratio of first to second singular value; higher values indicate a more concentrated interaction signal
top_A, top_B — indices of the top-scoring residues in each chain (from the leading SVD eigenvector)

If you also give --plot, contact maps for the full complex and inter-chain contacts only are saved as PNG, alongside CSV files of the raw matrices:

$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc --plot test/outputs/DYR_YEAST-CAPZA_YEAST

Predicting protein complex structures

yunta can feed MSAs into AlphaFold2 to predict binary protein complex structures:

$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single.tsv

This writes a summary TSV with AF2:-prefixed metrics — n_contacts, mean_interface_plddt, pdockq, seed — in addition to the standard contact map statistics. PDB structure files are written to the current working directory, named by protein pair ID.

Using --plot generates contact map plots as with the other commands:

$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single.tsv --plot test/outputs/af2-single-plot

Command-line tools

*-single commands run one protein against one or more others:

$ yunta dca-single --help
usage: yunta dca-single [-h] [--msa2 [MSA2 ...]] [--list-file] [--interspecies] [--strict-match]
                        [--output [OUTPUT]] [--plot PLOT] [--apc] [msa1]

positional arguments:
  msa1                  MSA file. Default: STDIN.

options:
  -h, --help            show this help message and exit
  --msa2 [MSA2 ...], -2 [MSA2 ...]
                        Second MSA file(s). Default: if not provided, all pairwise from msa1.
  --list-file, -l       Treat inputs as plain-text list of MSA files, rather than MSA filenames.
                        Default: treat as MSA filenames.
  --interspecies, -i    MSAs are from different species; enables built-in host-pathogen interaction
                        map. Default: assume same species.
  --strict-match, -S    For interspecies mode, require query MSA sequences to be from known
                        interacting species. Default: relax this constraint for query sequences.
  --output [OUTPUT], -o [OUTPUT]
                        Output filename. Default: STDOUT.
  --plot PLOT, -p PLOT  Directory for saving plots. Default: don't plot.
  --apc, -a             Apply average product correction (APC) to DCA scores. Default: off.

If one MSA is provided (no -2), homodimeric interactions are probed. Use --list-file to pass a single plain-text file containing one MSA path per line.

*-many commands run all pairwise combinations across two sets:

$ yunta af2-many --help
usage: yunta af2-many [-h] [--msa2 [MSA2 ...]] [--list-file] [--interspecies] [--strict-match]
                      [--output [OUTPUT]] [--params PARAMS] [--recycles RECYCLES] [--plot PLOT]
                      [msa1 ...]

positional arguments:
  msa1                  MSA file(s).

options:
  -h, --help            show this help message and exit
  --msa2 [MSA2 ...], -2 [MSA2 ...]
                        Second MSA file(s). Default: if not provided, all pairwise from msa1.
  --list-file, -l       Treat inputs as plain-text list of MSA files, rather than MSA filenames.
  --interspecies, -i    MSAs are from different species; enables built-in host-pathogen interaction map.
  --strict-match, -S    For interspecies mode, require query MSA sequences to be from known
                        interacting species.
  --output [OUTPUT], -o [OUTPUT]
                        Output filename. Default: STDOUT.
  --params PARAMS, -w PARAMS
                        Path to AlphaFold2 params file (.npz). Downloaded automatically if absent.
  --recycles RECYCLES, -x RECYCLES
                        Maximum number of recycles through the model. Default: 10.
  --plot PLOT, -p PLOT  Directory for saving plots. Default: don't plot.

Interspecies (host-pathogen) analysis

Use --interspecies / -i when the two MSAs come from organisms that interact as host and pathogen. yunta uses a built-in host-pathogen interaction (HPI) map to pair aligned sequences across species rather than requiring exact species identity:

$ yunta dca-single test/inputs/crypto/Q5CPK5_CRYPI.a3m \
    -2 test/inputs/human/EZRI_HUMAN.a3m \
    --interspecies --apc \
    -o test/outputs/dca-single-interspecies.tsv

By default (without --strict-match), the HPI constraint is relaxed for the query sequences themselves — useful when screening an uncharacterised query against a known host or pathogen proteome. Add --strict-match to require that query sequences come from species in the HPI map.

Python API

Load and inspect an MSA:

from yunta.structs.msa import MSA, PairedMSA

msa = MSA.from_file("my-msa-file.a3m")
print(msa)         # MSA(name=P07807) of sequence length 549, with 14246 sequences.
print(msa.neff())  # effective sequence count

Pair two MSAs and run DCA:

from yunta.structs.msa import MSA, PairedMSA
from yunta.interactions.dca.dca_torch import calculate_dca

msa1 = MSA.from_file("protein-a.a3m")
msa2 = MSA.from_file("protein-b.a3m")
paired = PairedMSA.from_msa(msa1, msa2)
contact_matrix = calculate_dca(paired, apc=True)

For interspecies pairing, pass interaction_map="builtin":

paired = PairedMSA.from_msa(msa1, msa2, interaction_map="builtin")

Or supply a custom dict mapping species IDs to lists of interacting species IDs:

paired = PairedMSA.from_msa(
    msa1, msa2,
    interaction_map={"NCBI:562": ["NCBI:10710"], "NCBI:10710": ["NCBI:562"]},
)

Run the full screening pipeline programmatically:

from yunta.screening import dca_one_vs_many, rf2track_one_vs_many

outputs = dca_one_vs_many(
    msa_file1="query.a3m",
    msa_file2=["target1.a3m", "target2.a3m"],
    apc=True,
    interaction_map="builtin",  # omit for same-species
)
for result_matrix, interaction_matrix, metrics in outputs:
    print(metrics.ID, metrics.focality)

Each element of outputs is a 3-tuple (full_contact_matrix, inter-chain_contact_matrix, metrics_dataclass). Metrics dataclasses (DCAMetrics, RF2TMetrics, AF2Metrics) can be written directly to TSV:

metrics.write("results.tsv")

(More documentation coming soon!)

... if you want to scale up

While the *-many commands handle batches of PPIs, for large-scale screening across a HPC cluster our nf-ggi Nextflow pipeline is more efficient and can also generate MSAs for you.

Issues, problems, suggestions

Add to the issue tracker.

Name		Name	Last commit message	Last commit date
Latest commit History 389 Commits
.github/workflows		.github/workflows
docs		docs
test		test
yunta		yunta
.gitattributes		.gitattributes
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍐 yunta

Installation

Environment variables

Credit

Command-line usage

Generating multiple-sequence alignments

Calculating contact maps

Predicting protein complex structures

Command-line tools

Interspecies (host-pathogen) analysis

Python API

... if you want to scale up

Issues, problems, suggestions

Further help

About

Uh oh!

Releases 17

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🍐 yunta

Installation

Environment variables

Credit

Command-line usage

Generating multiple-sequence alignments

Calculating contact maps

Predicting protein complex structures

Command-line tools

Interspecies (host-pathogen) analysis

Python API

... if you want to scale up

Issues, problems, suggestions

Further help

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 17

Uh oh!

Contributors

Uh oh!

Languages