Skip to content

timodonnell/pyconfind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyconfind

CI PyPI Python License Open In Colab

A modern Python implementation of ConFind — the rotamer-based protein side-chain contact-degree analysis introduced in Zheng et al 2015 and Zheng et al 2017.

The Python output is byte-for-byte identical to the upstream C++ binary on 248 of 253 real structures tested (100 single-chain PDB + 100 AlphaFold DB + 50 multi-chain + 3 high-resolution; see docs/stress_test_results.md), plus a further 100 RCSB entries cross-checked as both PDB and mmCIF. The 5 exceptions are insertion-code structures where the C++ ordering relies on undefined behavior (documented). The test suite runs against real PDB/mmCIF structures with committed C++-reference contact maps.

pyconfind is also faster than the C++ binary, with two interchangeable contact-degree backends (both byte-identical to the reference):

  • a pure NumPy/SciPy reference, which on its own already beats the C++ binary;
  • an optional Numba JIT/multi-threaded backend (pip install pyconfind[fast]) that is ~3× faster again.

With the Numba backend and the rotamer library pre-loaded, per-structure analysis is ~7-11× faster than the C++ binary in full mode (median ~10× over the benchmark set). native_only=True runs another ~26× faster again — under 0.36 s for everything in the benchmark set (largest 555 residues), and ~0.07 s for small ones.

runtime vs sequence length

Left: full analysis (every position considers all 18 substitutable AAs). Right: native_only=True — only the native AA is placed at each position (see native-only mode). The rotamer library is loaded once before measurement and excluded from every timing, so the numbers reflect per-structure analysis only. See docs/benchmark.md for the structure set and the harness.

Install

pip install pyconfind            # pure-Python reference backend
pip install "pyconfind[fast]"    # + Numba JIT/multi-threaded backend

From source (for development):

pip install -e ".[dev]"          # editable install with test/lint tooling

Example notebook

Open In Colab

examples/pyconfind_demo.ipynb is a runnable walkthrough (install → fetch a PDB → analyze via the library API → visualize a contact map, per-residue scores, and a 3D structure colored by contact degree). Click the badge to run it on a free Colab CPU runtime.

Quick start

The rotamer library is optional — if you don't pass one, pyconfind downloads the Dunbrack 2010 library once (~6 MB) and caches it per-user (via platformdirs), so the simplest invocation is just:

pyconfind --p input.pdb --o out.cont          # library auto-downloaded + cached

CLI (matches the original confind flag names, so existing pipelines drop in; pass --rLib to use your own library):

# Inputs may be PDB or mmCIF (format auto-detected via gemmi):
pyconfind --p input.cif --o out.cont
# Modern structured output:
pyconfind --p input.pdb --json --o out.json
# Only consider the native AA at each position (no AA substitution):
pyconfind --p input.pdb --native-only --o out.cont
# Restrict the computed/output residues (MSL selection language):
pyconfind --p input.pdb --sel "chain A AND resi 20-60" --o out.cont
# Pre-select part of the structure before anything runs:
pyconfind --p input.pdb --psel "NAME CA WITHIN 25 OF CHAIN A" --o out.cont
# Use your own library:
pyconfind --p input.pdb --rLib path/to/rotlibs --o out.cont

Library API:

from pyconfind import analyze

result = analyze("input.pdb")           # library auto-downloaded + cached
positions = result.positions_dataframe()  # one row per residue
contacts  = result.contacts_dataframe()   # one row per residue-residue contact
contacts.nlargest(10, "degree")

analyze() takes an assembly= argument too — by default it picks the first biological assembly, which is what you want for crystal structures whose asymmetric unit contains multiple independent copies of the complex (e.g. antibody/antigen structures like 5TRU). Pass assembly=None to keep the asymmetric unit as-is.

Rotamer libraries

Out of the box, pyconfind supports the Dunbrack 2010 MSL-format library that ships with the upstream confind source (EBL.out + BEBL.out); leave --rLib unset to auto-download it. Point --rLib at your own directory containing both files to use a different library. Only backbone-dependent libraries are supported.

Modern Dunbrack and Richardson-style libraries are next on the roadmap.

Native-only mode (extension over the C++ binary)

The original C++ confind substitutes in all 18 non-Gly/Pro amino acids at every position and computes contact degree across the full rotamer space. pyconfind adds --native-only: at each position, only place rotamers of the native amino acid (but still consider every rotamer of that AA).

Validation

The C++ reference binary is built from the upstream tarball by:

scripts/build-reference.sh

The byte-identity tests then compare pyconfind's output against the C++ output on every example PDB. To run them yourself:

pytest tests/

References

  • "Sequence statistics of tertiary structural motifs reflect protein stability", F. Zheng, G. Grigoryan, PLoS ONE, 12(5): e0178272, 2017.

  • "Tertiary Structural Propensities Reveal Fundamental Sequence/Structure Relationships", F. Zheng, J. Zhang, G. Grigoryan, Structure, 23(5): 961-971, 2015.

About

Python implementation of ConFind from Grigoryan Lab

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors