Skip to content

scbirlab/yunta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

389 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🍐 yunta

GitHub Workflow Status (with branch) PyPI - Python Version PyPI

Predicting pairwise protein-protein interactions and structures from multiple sequence alignments. Now with interspecies (host-pathogen) interactions and automatic chunking of large sequences!

yunta provides several implementations of protein-protein interaction evaluation. In increasing computational cost:

  • GPU-accelerated direct coupling analysis (DCA) in PyTorch
  • RoseTTAFold-2track via the rf2t-micro package
  • AlphaFold2 for protein-protein structure prediction

yunta has streamlined installation, a command-line interface, a Python API, and resilience to GPU out-of-memory errors through chunking of long sequences and CPU-fallback. It takes as input unpaired multiple-sequence alignments in A3M format (as generated by tools like hhblits), and outputs a matrix of inter-residue contacts.

Rough timings for a pair of ~200 amino-acid proteins (S. cerevisiae DHFR and WW domain-containing protein) on CPU:

  • DCA: 5 seconds

  • RosettaFold-2track: 10 seconds

  • AlphaFold2: 1 hour

Note that times increase quadratically with total protein length.

Installation

pip install yunta

To enable AlphaFold2 with CUDA 12 (recommended for GPU):

pip install yunta[af_cuda12]

For a local CUDA 12 installation:

pip install yunta[af_cuda12_local]

For CUDA 11:

pip install yunta[af_cuda11]

For AlphaFold2 without a specific CUDA version (CPU or custom JAX install):

pip install yunta[af]

To enable RosettaFold-2track:

pip install yunta[rf2t]

Using the embedded models requires the RoseTTAFold-2track and AlphaFold2 weights. These are automatically downloaded on first use. By doing so you agree that the trained weights for RoseTTAFold are made available for non-commercial use only under the terms of the Rosetta-DL Software license and AlphaFold2's pretrained parameters fall under the CC BY 4.0 license.

Environment variables

Variable Default Description
YUNTA_CACHE ~/.cache/yunta Directory for the organism interaction lookup table cache.
YUNTA_USE_CACHE False Set to True to load a pre-built cache from disk rather than rebuilding.
YUNTA_TEST 0 Set to 1 to build and hold the interaction lookup table in memory only (no disk write).

Credit

yunta is a fork of SpeedPPI, which is itself inspired by FoldDock. This method used AlphaFold2 to evaluate 65,484 protein-protein interactions from the human proteome in Towards a structurally resolved human protein interaction network.

The idea of using DCA, RoseTTAFold-2track, and AlphaFold2 in a cascade of increasingly expensive and specific PPI detection methods has been explored in a series of papers from David Baker's lab:

yunta puts these algorithms in one place with easy installation, a command-line interface, and a Python API. It also enables interspecies co-evolutionary analysis using a built-in host-pathogen interaction mapping.

Command-line usage

$ yunta --help
usage: yunta [-h] {dca-single,dca-many,rf2t-single,af2-single,af2-many} ...

Screening protein-protein interactions using DCA, RosettaFold-2track, and AlphaFold2.

options:
  -h, --help            show this help message and exit

Sub-commands:
  {dca-single,dca-many,rf2t-single,af2-single,af2-many}
                        Use these commands to specify the tool you want to use.
    dca-single          Calculate DCA for one protein-protein interaction.
    dca-many            Calculate DCA between two sets of proteins, or all pairs in one set of proteins.
    rf2t-single         Calculate RF-2track contacts between one protein and a series of others.
    af2-single          Model one protein-protein interaction.
    af2-many            Model all interactions between two sets of proteins, or all pairs in one set of proteins.

Generating multiple-sequence alignments

All algorithms depend on pre-computed multiple-sequence alignments (MSAs) between a protein of interest and as many homologs as possible. You can generate MSAs using hhblits with pre-clustered databases like UniClust:

hhblits -e 0.01 -v 3 -d /path/to/UniClust-database -i input.fasta -oa3m output-msa.a3m -o /dev/null -cov 60 -n 3 -realign -realign_max 10000

This typically takes 1–40 min depending on query complexity. See the hhsuite documentation for details.

Calculating contact maps

Given two MSAs, yunta calculates a contact map using DCA, RF2t, or AlphaFold2, and produces a summary table for each pair.

Using DCA or RF2t produces a table like this:

$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc
ID uniprot_id_1 uniprot_id_2 seq_len chain_a_len chain_b_len msa1_depth msa2_depth msa_depth n_eff DCA:apc DCA:mean DCA:median DCA:maximum DCA:minimum DCA:var DCA:sigma1 DCA:focality DCA:top_A DCA:top_B
O13297-D6VTK4 O13297 D6VTK4 980 549 431 14246 1546 670 2 False 0.0183 0.0147 0.0743 2.28e-06 ... ... ... ... ...

Method-specific columns are prefixed with DCA: or RF2t:. Common columns across all methods:

  • sigma1 β€” leading singular value of the inter-chain contact submatrix
  • focality β€” ratio of first to second singular value; higher values indicate a more concentrated interaction signal
  • top_A, top_B β€” indices of the top-scoring residues in each chain (from the leading SVD eigenvector)

If you also give --plot, contact maps for the full complex and inter-chain contacts only are saved as PNG, alongside CSV files of the raw matrices:

$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc --plot test/outputs/DYR_YEAST-CAPZA_YEAST

Predicting protein complex structures

yunta can feed MSAs into AlphaFold2 to predict binary protein complex structures:

$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single.tsv

This writes a summary TSV with AF2:-prefixed metrics β€” n_contacts, mean_interface_plddt, pdockq, seed β€” in addition to the standard contact map statistics. PDB structure files are written to the current working directory, named by protein pair ID.

Using --plot generates contact map plots as with the other commands:

$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single.tsv --plot test/outputs/af2-single-plot

Command-line tools

*-single commands run one protein against one or more others:

$ yunta dca-single --help
usage: yunta dca-single [-h] [--msa2 [MSA2 ...]] [--list-file] [--interspecies] [--strict-match]
                        [--output [OUTPUT]] [--plot PLOT] [--apc] [msa1]

positional arguments:
  msa1                  MSA file. Default: STDIN.

options:
  -h, --help            show this help message and exit
  --msa2 [MSA2 ...], -2 [MSA2 ...]
                        Second MSA file(s). Default: if not provided, all pairwise from msa1.
  --list-file, -l       Treat inputs as plain-text list of MSA files, rather than MSA filenames.
                        Default: treat as MSA filenames.
  --interspecies, -i    MSAs are from different species; enables built-in host-pathogen interaction
                        map. Default: assume same species.
  --strict-match, -S    For interspecies mode, require query MSA sequences to be from known
                        interacting species. Default: relax this constraint for query sequences.
  --output [OUTPUT], -o [OUTPUT]
                        Output filename. Default: STDOUT.
  --plot PLOT, -p PLOT  Directory for saving plots. Default: don't plot.
  --apc, -a             Apply average product correction (APC) to DCA scores. Default: off.

If one MSA is provided (no -2), homodimeric interactions are probed. Use --list-file to pass a single plain-text file containing one MSA path per line.

*-many commands run all pairwise combinations across two sets:

$ yunta af2-many --help
usage: yunta af2-many [-h] [--msa2 [MSA2 ...]] [--list-file] [--interspecies] [--strict-match]
                      [--output [OUTPUT]] [--params PARAMS] [--recycles RECYCLES] [--plot PLOT]
                      [msa1 ...]

positional arguments:
  msa1                  MSA file(s).

options:
  -h, --help            show this help message and exit
  --msa2 [MSA2 ...], -2 [MSA2 ...]
                        Second MSA file(s). Default: if not provided, all pairwise from msa1.
  --list-file, -l       Treat inputs as plain-text list of MSA files, rather than MSA filenames.
  --interspecies, -i    MSAs are from different species; enables built-in host-pathogen interaction map.
  --strict-match, -S    For interspecies mode, require query MSA sequences to be from known
                        interacting species.
  --output [OUTPUT], -o [OUTPUT]
                        Output filename. Default: STDOUT.
  --params PARAMS, -w PARAMS
                        Path to AlphaFold2 params file (.npz). Downloaded automatically if absent.
  --recycles RECYCLES, -x RECYCLES
                        Maximum number of recycles through the model. Default: 10.
  --plot PLOT, -p PLOT  Directory for saving plots. Default: don't plot.

Interspecies (host-pathogen) analysis

Use --interspecies / -i when the two MSAs come from organisms that interact as host and pathogen. yunta uses a built-in host-pathogen interaction (HPI) map to pair aligned sequences across species rather than requiring exact species identity:

$ yunta dca-single test/inputs/crypto/Q5CPK5_CRYPI.a3m \
    -2 test/inputs/human/EZRI_HUMAN.a3m \
    --interspecies --apc \
    -o test/outputs/dca-single-interspecies.tsv

By default (without --strict-match), the HPI constraint is relaxed for the query sequences themselves β€” useful when screening an uncharacterised query against a known host or pathogen proteome. Add --strict-match to require that query sequences come from species in the HPI map.

Python API

Load and inspect an MSA:

from yunta.structs.msa import MSA, PairedMSA

msa = MSA.from_file("my-msa-file.a3m")
print(msa)         # MSA(name=P07807) of sequence length 549, with 14246 sequences.
print(msa.neff())  # effective sequence count

Pair two MSAs and run DCA:

from yunta.structs.msa import MSA, PairedMSA
from yunta.interactions.dca.dca_torch import calculate_dca

msa1 = MSA.from_file("protein-a.a3m")
msa2 = MSA.from_file("protein-b.a3m")
paired = PairedMSA.from_msa(msa1, msa2)
contact_matrix = calculate_dca(paired, apc=True)

For interspecies pairing, pass interaction_map="builtin":

paired = PairedMSA.from_msa(msa1, msa2, interaction_map="builtin")

Or supply a custom dict mapping species IDs to lists of interacting species IDs:

paired = PairedMSA.from_msa(
    msa1, msa2,
    interaction_map={"NCBI:562": ["NCBI:10710"], "NCBI:10710": ["NCBI:562"]},
)

Run the full screening pipeline programmatically:

from yunta.screening import dca_one_vs_many, rf2track_one_vs_many

outputs = dca_one_vs_many(
    msa_file1="query.a3m",
    msa_file2=["target1.a3m", "target2.a3m"],
    apc=True,
    interaction_map="builtin",  # omit for same-species
)
for result_matrix, interaction_matrix, metrics in outputs:
    print(metrics.ID, metrics.focality)

Each element of outputs is a 3-tuple (full_contact_matrix, inter-chain_contact_matrix, metrics_dataclass). Metrics dataclasses (DCAMetrics, RF2TMetrics, AF2Metrics) can be written directly to TSV:

metrics.write("results.tsv")

(More documentation coming soon!)

... if you want to scale up

While the *-many commands handle batches of PPIs, for large-scale screening across a HPC cluster our nf-ggi Nextflow pipeline is more efficient and can also generate MSAs for you.

Issues, problems, suggestions

Add to the issue tracker.

Further help

About

🍐 Predicting protein-protein interactions and structures from multiple sequence alignments.

Resources

License

Stars

Watchers

Forks

Contributors