A modern Python implementation of ConFind — the rotamer-based protein side-chain contact-degree analysis introduced in Zheng et al 2015 and Zheng et al 2017.
The Python output is byte-for-byte identical to the upstream C++ binary on 248 of 253 real structures tested (100 single-chain PDB + 100 AlphaFold DB + 50 multi-chain + 3 high-resolution; see docs/stress_test_results.md), plus a further 100 RCSB entries cross-checked as both PDB and mmCIF. The 5 exceptions are insertion-code structures where the C++ ordering relies on undefined behavior (documented). The test suite runs against real PDB/mmCIF structures with committed C++-reference contact maps.
pyconfind is also faster than the C++ binary, with two interchangeable contact-degree backends (both byte-identical to the reference):
- a pure NumPy/SciPy reference, which on its own already beats the C++ binary;
- an optional Numba JIT/multi-threaded backend (
pip install pyconfind[fast]) that is ~3× faster again.
With the Numba backend and the rotamer library pre-loaded, per-structure
analysis is ~7-11× faster than the C++ binary in full mode (median ~10×
over the benchmark set). native_only=True runs another ~26× faster
again — under 0.36 s for everything in the benchmark set (largest 555
residues), and ~0.07 s for small ones.
Left: full analysis (every position considers all 18 substitutable AAs).
Right: native_only=True — only the native AA is placed at each position
(see native-only mode). The
rotamer library is loaded once before measurement and excluded from every
timing, so the numbers reflect per-structure analysis only. See
docs/benchmark.md for the structure set and the harness.
pip install pyconfind # pure-Python reference backend
pip install "pyconfind[fast]" # + Numba JIT/multi-threaded backendFrom source (for development):
pip install -e ".[dev]" # editable install with test/lint toolingexamples/pyconfind_demo.ipynb is a runnable
walkthrough (install → fetch a PDB → analyze via the library API → visualize a
contact map, per-residue scores, and a 3D structure colored by contact degree).
Click the badge to run it on a free Colab CPU runtime.
The rotamer library is optional — if you don't pass one, pyconfind downloads
the Dunbrack 2010 library once (~6 MB) and caches it per-user (via
platformdirs), so the simplest invocation is just:
pyconfind --p input.pdb --o out.cont # library auto-downloaded + cachedCLI (matches the original confind flag names, so existing pipelines drop in;
pass --rLib to use your own library):
# Inputs may be PDB or mmCIF (format auto-detected via gemmi):
pyconfind --p input.cif --o out.cont
# Modern structured output:
pyconfind --p input.pdb --json --o out.json
# Only consider the native AA at each position (no AA substitution):
pyconfind --p input.pdb --native-only --o out.cont
# Restrict the computed/output residues (MSL selection language):
pyconfind --p input.pdb --sel "chain A AND resi 20-60" --o out.cont
# Pre-select part of the structure before anything runs:
pyconfind --p input.pdb --psel "NAME CA WITHIN 25 OF CHAIN A" --o out.cont
# Use your own library:
pyconfind --p input.pdb --rLib path/to/rotlibs --o out.contLibrary API:
from pyconfind import analyze
result = analyze("input.pdb") # library auto-downloaded + cached
positions = result.positions_dataframe() # one row per residue
contacts = result.contacts_dataframe() # one row per residue-residue contact
contacts.nlargest(10, "degree")analyze() takes an assembly= argument too — by default it picks the first
biological assembly, which is what you want for crystal structures whose
asymmetric unit contains multiple independent copies of the complex
(e.g. antibody/antigen structures like 5TRU). Pass assembly=None to keep
the asymmetric unit as-is.
Out of the box, pyconfind supports the Dunbrack 2010 MSL-format library that
ships with the upstream confind source (EBL.out + BEBL.out); leave
--rLib unset to auto-download it. Point --rLib at your own directory
containing both files to use a different library. Only backbone-dependent
libraries are supported.
Modern Dunbrack and Richardson-style libraries are next on the roadmap.
The original C++ confind substitutes in all 18 non-Gly/Pro amino acids at
every position and computes contact degree across the full rotamer space.
pyconfind adds --native-only: at each position, only place rotamers of the
native amino acid (but still consider every rotamer of that AA).
The C++ reference binary is built from the upstream tarball by:
scripts/build-reference.shThe byte-identity tests then compare pyconfind's output against the C++ output on every example PDB. To run them yourself:
pytest tests/-
"Sequence statistics of tertiary structural motifs reflect protein stability", F. Zheng, G. Grigoryan, PLoS ONE, 12(5): e0178272, 2017.
-
"Tertiary Structural Propensities Reveal Fundamental Sequence/Structure Relationships", F. Zheng, J. Zhang, G. Grigoryan, Structure, 23(5): 961-971, 2015.
