Adding ligand processing to dataset and encoders.#86
Conversation
📝 WalkthroughWalkthrough
ChangesLigand atoms and cached encoder fusion
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR extends the data pipeline and cached-embedding encoders to include non-protein, non-water heavy atoms (“ligands”) as additional protein-type context nodes by default, and updates related scripts/tests accordingly.
Changes:
- Update
parse_asu_with_biotite()andProteinWaterDatasetpreprocessing to produce/append ligand atoms and persist anis_ligandmask withresidue_index = -1. - Add ligand handling to cached embedding encoders via a learnable ligand projection (
ligand_embed) and ensure residue pooling ignores the-1sentinel. - Update embedding-generation scripts and expand integration tests/fixtures for ligand parsing and node inclusion.
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Updates locked dependency set (adds jaxtyping, adjusts PyMOL package, other lock changes). |
pyproject.toml |
Switches PyMOL dependency to pymol-open-source-whl>=3.1.0.4 and adds jaxtyping. |
src/dataset.py |
Adds ligand parsing, dataset flag include_ligands, appends ligand nodes, and stores is_ligand in cache/data. |
src/encoder_base.py |
Requires embedding_dim for cached encoders and adds ligand projection for ligand nodes. |
src/gvp_encoder.py |
Filters out residue_index < 0 entries during residue pooling to avoid scatter issues. |
scripts/generate_esm_embeddings.py |
Updates unpacking of parse_asu_with_biotite() return to 3-tuple. |
scripts/generate_slae_embeddings.py |
Marks SLAE as legacy and updates unpacking of parse_asu_with_biotite() return to 3-tuple. |
tests/conftest.py |
Adds a pdb_4h0b fixture for ligand integration tests. |
tests/test_dataset.py |
Adds ligand parsing + include_ligands integration tests and updates existing parsing call sites. |
tests/test_encoder.py |
Updates cached encoder tests for required embedding_dim and ligand projection parameterization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.cache_dir = Path(processed_dir) | ||
| # Directory-based separation: geometry/ vs geometry_mates/ | ||
| # Directory-based separation: geometry/ vs geometry_mates/. Ligand inclusion | ||
| # is governed by the include_ligands config flag, not the cache directory | ||
| # name, so the geometry cache name is unaffected by include_ligands. | ||
| cache_suffix = "_mates" if include_mates else "" | ||
| self.geometry_dir = self.cache_dir / f"{geometry_cache_name}{cache_suffix}" | ||
| self.base_pdb_dir = Path(base_pdb_dir) |
There was a problem hiding this comment.
include_ligands should be on by default, not a concern here, both caches should have ligands
There was a problem hiding this comment.
Why is this not a concern? I understand that the default is include_ligands=True so for your own caches this is fine, but if another user wants to toggle and try excluding ligands would there not be silent errors from loading caches?
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/dataset.py (1)
678-715: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy liftVersion or canonicalize the geometry cache before reusing it across ligand settings.
The cache path ignores
include_ligands, but the saved payload changes at Line 1080 and__getitem__always trustscached["is_ligand"]at Line 1217. Reusing the sameprocessed_dir/geometry[_mates]after preprocessing once with the opposite ligand setting will silently return the wrong graph; pre-existing caches from before this PR will also raiseKeyErrorbecause they lackis_ligand.Either make the cache payload canonical and apply
include_ligandsat load time, or store schema/config metadata and invalidate/rebuild on mismatch.Possible localized guard to avoid silent cache misuse
torch.save( { + "cache_schema_version": 2, + "include_ligands": self.include_ligands, "protein_pos": final_protein_pos, "protein_x": final_protein_x, "protein_res_idx": final_protein_res_idx, "is_ligand": is_ligand,cached = torch.load(cache_path, weights_only=False) + if cached.get("cache_schema_version") != 2: + raise ValueError( + f"Geometry cache {cache_path} uses an old schema; regenerate it." + ) + if cached.get("include_ligands") != self.include_ligands: + raise ValueError( + f"Geometry cache {cache_path} was generated with " + f"include_ligands={cached.get('include_ligands')}; regenerate it " + f"or use a separate cache." + ) # load all data directly from cache (already includes mates if applicable)Also applies to: 741-754, 1072-1123, 1217-1239
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/dataset.py` around lines 678 - 715, The geometry cache path ignores the `include_ligands` parameter, but the saved data includes or excludes ligands based on this setting, causing silent cache misuse when the parameter changes. The `__getitem__` method at line 1217 always trusts the cached `is_ligand` field without validating the cache was created with matching settings. To fix this, either make the cache payload canonical by always storing all atoms with ligand metadata, then apply `include_ligands` filtering at load time in `__getitem__`, or embed cache schema and configuration metadata in the cache and validate it on load, rebuilding when settings mismatch. Also handle backward compatibility for pre-existing caches that lack the `is_ligand` field by detecting and regenerating them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/encoder_base.py`:
- Around line 223-233: The cached embeddings retrieved using self._embedding_key
on line 223 are not validated against the configured embedding_dim, which can
cause shape mismatches when ligand embeddings are assigned on line 233 or
silently return wrong shapes. After retrieving the embeddings tensor, add
validation to ensure its width (second dimension) matches self.embedding_dim
before the embeddings are used. If the width does not match, raise an
appropriate error to catch configuration mismatches early rather than allowing
silent failures or crashes downstream.
---
Outside diff comments:
In `@src/dataset.py`:
- Around line 678-715: The geometry cache path ignores the `include_ligands`
parameter, but the saved data includes or excludes ligands based on this
setting, causing silent cache misuse when the parameter changes. The
`__getitem__` method at line 1217 always trusts the cached `is_ligand` field
without validating the cache was created with matching settings. To fix this,
either make the cache payload canonical by always storing all atoms with ligand
metadata, then apply `include_ligands` filtering at load time in `__getitem__`,
or embed cache schema and configuration metadata in the cache and validate it on
load, rebuilding when settings mismatch. Also handle backward compatibility for
pre-existing caches that lack the `is_ligand` field by detecting and
regenerating them.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4b879dd2-d028-46c3-b2cb-f1094471d9cd
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
pyproject.tomlscripts/generate_esm_embeddings.pyscripts/generate_slae_embeddings.pysrc/dataset.pysrc/encoder_base.pysrc/gvp_encoder.pytests/conftest.pytests/test_dataset.pytests/test_encoder.pytests/test_files/4h0b/4h0b_final.pdb
|
I made a few comments that should mostly be minor issues to address or clarifying questions related to other PRs perhaps. Other than that, README.md should also be updated wherever ligand should be mentioned as optionally a part of "protein" now. |
| self.cache_dir = Path(processed_dir) | ||
| # Directory-based separation: geometry/ vs geometry_mates/ | ||
| # Directory-based separation: geometry/ vs geometry_mates/. Ligand inclusion | ||
| # is governed by the include_ligands config flag, not the cache directory | ||
| # name, so the geometry cache name is unaffected by include_ligands. | ||
| cache_suffix = "_mates" if include_mates else "" | ||
| self.geometry_dir = self.cache_dir / f"{geometry_cache_name}{cache_suffix}" | ||
| self.base_pdb_dir = Path(base_pdb_dir) |
There was a problem hiding this comment.
Why is this not a concern? I understand that the default is include_ligands=True so for your own caches this is fine, but if another user wants to toggle and try excluding ligands would there not be silent errors from loading caches?
| @@ -665,6 +675,7 @@ | |||
| base_pdb_dir: str = "/sb/wankowicz_lab/data/srivasv/pdb_redo_data", | |||
There was a problem hiding this comment.
out side of the diff, but just hard-coded path to remove
| self.entries = valid_entries | ||
| logger.info(f"Dataset contains {len(self.entries)} valid entries.") | ||
|
|
||
| def _preprocess_one(self, entry: dict, cache_path: Path): |
There was a problem hiding this comment.
this function is getting very very long...not really a blocking comment for this PR but if there is a future PR for cleaning up / refactoring this should be on the top of the list.
| # Ligands always go last so num_asu_protein and mate counts are unaffected, | ||
| # preserving ESM/SLAE embedding alignment via _pad_atom_embeddings_for_mates. | ||
|
|
||
| # TODO(ligands+mates): this only adds ASU ligands. Until dev_crystal_mates |
There was a problem hiding this comment.
I am a bit confused by the comment. Is the intention (future PR) to add ligands as part of mate protein nodes or not?
| @@ -480,32 +480,223 @@ class TestParseAsuWithBiotite: | |||
| """Tests for PDB parsing with biotite.""" | |||
|
|
|||
| def test_parse_returns_protein_and_water(self, pdb_6eey): | |||
There was a problem hiding this comment.
should rename this function
| "embedding_dim": 1536, | ||
| "hidden_s": 64, |
There was a problem hiding this comment.
where do these hard coded numbers come from (and similarly 128 above)? should some constants be declared/defined?
| ASU protein atoms carry the real ESM/SLAE vector; symmetry mates and | ||
| ligand atoms are zero-padded (they have no residue embedding). |
There was a problem hiding this comment.
I am not seeing relevant tests for testing if the embeddings work as expected -- whether should be zero-padded or not and have the one-hot component etc for ASU protein, mate, and ligands? I know the mate is not part of this PR, but probably should start adding tests to catch the kind of bug that you found that prevented crystal contact to work.
Adds non-protein, non-water heavy atoms as context nodes, on by default (
include_ligands=True).parse_asu_with_biotite()returns(protein, water, ligand); ligand atoms appended after protein/mate atoms with anis_ligandmask andresidue_index = -1. Cache dir is unaffected byinclude_ligands(only_mates).CachedEmbeddingEncodergets a learnableligand_embedprojection (embedding_dimnow required at init); device-safe indexing; pooling ignores the-1sentinel.rdkit; added ligand tests +4h0bfixture.Summary by CodeRabbit
New Features
Changes
Tests