Skip to content

Adding ligand processing to dataset and encoders.#86

Open
vratins wants to merge 11 commits into
mainfrom
dev_ligands
Open

Adding ligand processing to dataset and encoders.#86
vratins wants to merge 11 commits into
mainfrom
dev_ligands

Conversation

@vratins

@vratins vratins commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Adds non-protein, non-water heavy atoms as context nodes, on by default (include_ligands=True).

  • Dataset: parse_asu_with_biotite() returns (protein, water, ligand); ligand atoms appended after protein/mate atoms with an is_ligand mask and residue_index = -1. Cache dir is unaffected by include_ligands (only _mates).
  • Encoder: CachedEmbeddingEncoder gets a learnable ligand_embed projection (embedding_dim now required at init); device-safe indexing; pooling ignores the -1 sentinel.
  • Misc: ESM/SLAE scripts unpack the 3-tuple (SLAE marked legacy); removed unused rdkit; added ligand tests + 4h0b fixture.
  • Added fusion MLP to the encoders to make the logic cleaner (ESM and 1hots projected to common dim, and then concat, and another MLP, and passed to flow.

Summary by CodeRabbit

  • New Features

    • Added support for including ligand atoms from PDB structures in model training
    • Introduced configurable toggle to enable/disable ligand atom inclusion
  • Changes

    • Encoder initialization now requires explicit embedding dimension specification at configuration time
    • Updated dependencies for improved compatibility
  • Tests

    • Expanded test coverage for ligand processing and atom filtering

Copilot AI review requested due to automatic review settings June 24, 2026 01:45
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

parse_asu_with_biotite now extracts and returns ligand atoms as a 3-tuple (protein, water, ligand). ProteinWaterDataset adds an include_ligands flag that appends ligand nodes with residue_index = -1 and an is_ligand mask into the graph. CachedEmbeddingEncoder requires embedding_dim at construction, removes lazy inference, and fuses cached embeddings with element one-hot features to produce learned fusion_dim-width scalar outputs instead of raw cached embeddings. GVPEncoder._pool_by_residue filters atoms with negative residue indices before pooling. All call sites and tests are updated accordingly. The pymol-open-source dependency is replaced with pymol-open-source-whl>=3.1.0.4.

Changes

Ligand atoms and cached encoder fusion

Layer / File(s) Summary
parse_asu_with_biotite 3-tuple return
src/dataset.py
Extracts non-protein, non-water heavy atoms as a ligand set and returns (protein, water, ligand) instead of (protein, water).
ProteinWaterDataset ligand node integration
src/dataset.py
Adds include_ligands constructor parameter (default True); preprocessing unpacks 3-tuple, appends ligand atom positions/features after protein/mate nodes, assigns residue_index = -1 for ligand atoms, builds an is_ligand boolean mask, persists it to the geometry cache, and exposes it on the returned graph as data["protein"].is_ligand.
CachedEmbeddingEncoder: required embedding_dim and embedding+element fusion
src/encoder_base.py
Constructor now requires embedding_dim and fusion_dim (removes lazy inference). Initializes learned projections for cached embeddings and element one-hot features. output_dims returns (fusion_dim, 0) immediately. forward fuses cached embeddings with element features to produce scalar output width fusion_dim and empty vectors (N, 0, 3). from_config requires embedding_dim and hidden_s in config; raises ValueError if missing.
GVPEncoder negative residue index guard
src/gvp_encoder.py
_pool_by_residue filters out atoms with residue_index < 0 (ligand atoms marked with -1) before scatter-based pooling to prevent invalid indices.
Embedding script call-site updates
scripts/generate_esm_embeddings.py, scripts/generate_slae_embeddings.py
Updates parse_asu_with_biotite unpacking from 2-tuple to 3-tuple. generate_slae_embeddings.py docstring is updated to note SLAE encoder logic as legacy/reference-only.
Tests: ligand parsing, dataset integration, and encoder contracts
tests/conftest.py, tests/test_dataset.py, tests/test_encoder.py, tests/test_train_config.py
New pdb_4h0b fixture added. Parse-and-dataset tests updated for 3-tuple unpacking; new ligand-focused test classes verify partitioning, is_ligand mask shape/type, sentinel residue_index = -1, and include_ligands flag toggling. Quality-filter and water-filtering tests updated to ignore ligand return. Encoder tests require embedding_dim at construction, assert immediate output_dims = (fusion_dim, 0), validate learnable parameters including element projections, add error path for missing embedding_dim, and adjust output-shape expectations to match fusion width.
Dependency update
pyproject.toml
Replaces pymol-open-source with pymol-open-source-whl>=3.1.0.4.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐇 Ligands hop in with masks of gold,
The encoder learns their stories untold.
Fused embeddings dance with elements bright,
No scatter stumbles on negative light.
PyMOL wheels spin smooth and true,
This rabbit's PR brings atoms anew! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding ligand processing to dataset and encoder components, which is the primary focus across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 95.83% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev_ligands

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the data pipeline and cached-embedding encoders to include non-protein, non-water heavy atoms (“ligands”) as additional protein-type context nodes by default, and updates related scripts/tests accordingly.

Changes:

  • Update parse_asu_with_biotite() and ProteinWaterDataset preprocessing to produce/append ligand atoms and persist an is_ligand mask with residue_index = -1.
  • Add ligand handling to cached embedding encoders via a learnable ligand projection (ligand_embed) and ensure residue pooling ignores the -1 sentinel.
  • Update embedding-generation scripts and expand integration tests/fixtures for ligand parsing and node inclusion.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
uv.lock Updates locked dependency set (adds jaxtyping, adjusts PyMOL package, other lock changes).
pyproject.toml Switches PyMOL dependency to pymol-open-source-whl>=3.1.0.4 and adds jaxtyping.
src/dataset.py Adds ligand parsing, dataset flag include_ligands, appends ligand nodes, and stores is_ligand in cache/data.
src/encoder_base.py Requires embedding_dim for cached encoders and adds ligand projection for ligand nodes.
src/gvp_encoder.py Filters out residue_index < 0 entries during residue pooling to avoid scatter issues.
scripts/generate_esm_embeddings.py Updates unpacking of parse_asu_with_biotite() return to 3-tuple.
scripts/generate_slae_embeddings.py Marks SLAE as legacy and updates unpacking of parse_asu_with_biotite() return to 3-tuple.
tests/conftest.py Adds a pdb_4h0b fixture for ligand integration tests.
tests/test_dataset.py Adds ligand parsing + include_ligands integration tests and updates existing parsing call sites.
tests/test_encoder.py Updates cached encoder tests for required embedding_dim and ligand projection parameterization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/dataset.py
Comment on lines 740 to 746
self.cache_dir = Path(processed_dir)
# Directory-based separation: geometry/ vs geometry_mates/
# Directory-based separation: geometry/ vs geometry_mates/. Ligand inclusion
# is governed by the include_ligands config flag, not the cache directory
# name, so the geometry cache name is unaffected by include_ligands.
cache_suffix = "_mates" if include_mates else ""
self.geometry_dir = self.cache_dir / f"{geometry_cache_name}{cache_suffix}"
self.base_pdb_dir = Path(base_pdb_dir)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include_ligands should be on by default, not a concern here, both caches should have ligands

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not a concern? I understand that the default is include_ligands=True so for your own caches this is fine, but if another user wants to toggle and try excluding ligands would there not be silent errors from loading caches?

Comment thread src/dataset.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/dataset.py (1)

678-715: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Version or canonicalize the geometry cache before reusing it across ligand settings.

The cache path ignores include_ligands, but the saved payload changes at Line 1080 and __getitem__ always trusts cached["is_ligand"] at Line 1217. Reusing the same processed_dir/geometry[_mates] after preprocessing once with the opposite ligand setting will silently return the wrong graph; pre-existing caches from before this PR will also raise KeyError because they lack is_ligand.

Either make the cache payload canonical and apply include_ligands at load time, or store schema/config metadata and invalidate/rebuild on mismatch.

Possible localized guard to avoid silent cache misuse
         torch.save(
             {
+                "cache_schema_version": 2,
+                "include_ligands": self.include_ligands,
                 "protein_pos": final_protein_pos,
                 "protein_x": final_protein_x,
                 "protein_res_idx": final_protein_res_idx,
                 "is_ligand": is_ligand,
         cached = torch.load(cache_path, weights_only=False)
+        if cached.get("cache_schema_version") != 2:
+            raise ValueError(
+                f"Geometry cache {cache_path} uses an old schema; regenerate it."
+            )
+        if cached.get("include_ligands") != self.include_ligands:
+            raise ValueError(
+                f"Geometry cache {cache_path} was generated with "
+                f"include_ligands={cached.get('include_ligands')}; regenerate it "
+                f"or use a separate cache."
+            )
 
         # load all data directly from cache (already includes mates if applicable)

Also applies to: 741-754, 1072-1123, 1217-1239

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dataset.py` around lines 678 - 715, The geometry cache path ignores the
`include_ligands` parameter, but the saved data includes or excludes ligands
based on this setting, causing silent cache misuse when the parameter changes.
The `__getitem__` method at line 1217 always trusts the cached `is_ligand` field
without validating the cache was created with matching settings. To fix this,
either make the cache payload canonical by always storing all atoms with ligand
metadata, then apply `include_ligands` filtering at load time in `__getitem__`,
or embed cache schema and configuration metadata in the cache and validate it on
load, rebuilding when settings mismatch. Also handle backward compatibility for
pre-existing caches that lack the `is_ligand` field by detecting and
regenerating them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/encoder_base.py`:
- Around line 223-233: The cached embeddings retrieved using self._embedding_key
on line 223 are not validated against the configured embedding_dim, which can
cause shape mismatches when ligand embeddings are assigned on line 233 or
silently return wrong shapes. After retrieving the embeddings tensor, add
validation to ensure its width (second dimension) matches self.embedding_dim
before the embeddings are used. If the width does not match, raise an
appropriate error to catch configuration mismatches early rather than allowing
silent failures or crashes downstream.

---

Outside diff comments:
In `@src/dataset.py`:
- Around line 678-715: The geometry cache path ignores the `include_ligands`
parameter, but the saved data includes or excludes ligands based on this
setting, causing silent cache misuse when the parameter changes. The
`__getitem__` method at line 1217 always trusts the cached `is_ligand` field
without validating the cache was created with matching settings. To fix this,
either make the cache payload canonical by always storing all atoms with ligand
metadata, then apply `include_ligands` filtering at load time in `__getitem__`,
or embed cache schema and configuration metadata in the cache and validate it on
load, rebuilding when settings mismatch. Also handle backward compatibility for
pre-existing caches that lack the `is_ligand` field by detecting and
regenerating them.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4b879dd2-d028-46c3-b2cb-f1094471d9cd

📥 Commits

Reviewing files that changed from the base of the PR and between c3b9db6 and 6f14f4a.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • pyproject.toml
  • scripts/generate_esm_embeddings.py
  • scripts/generate_slae_embeddings.py
  • src/dataset.py
  • src/encoder_base.py
  • src/gvp_encoder.py
  • tests/conftest.py
  • tests/test_dataset.py
  • tests/test_encoder.py
  • tests/test_files/4h0b/4h0b_final.pdb

Comment thread src/encoder_base.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated 2 comments.

Comment thread src/dataset.py
Comment thread src/encoder_base.py
@DorisMai

Copy link
Copy Markdown
Collaborator

I made a few comments that should mostly be minor issues to address or clarifying questions related to other PRs perhaps. Other than that, README.md should also be updated wherever ligand should be mentioned as optionally a part of "protein" now.

Comment thread src/dataset.py
Comment on lines 740 to 746
self.cache_dir = Path(processed_dir)
# Directory-based separation: geometry/ vs geometry_mates/
# Directory-based separation: geometry/ vs geometry_mates/. Ligand inclusion
# is governed by the include_ligands config flag, not the cache directory
# name, so the geometry cache name is unaffected by include_ligands.
cache_suffix = "_mates" if include_mates else ""
self.geometry_dir = self.cache_dir / f"{geometry_cache_name}{cache_suffix}"
self.base_pdb_dir = Path(base_pdb_dir)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not a concern? I understand that the default is include_ligands=True so for your own caches this is fine, but if another user wants to toggle and try excluding ligands would there not be silent errors from loading caches?

Comment thread src/dataset.py
@@ -665,6 +675,7 @@
base_pdb_dir: str = "/sb/wankowicz_lab/data/srivasv/pdb_redo_data",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out side of the diff, but just hard-coded path to remove

Comment thread src/dataset.py
self.entries = valid_entries
logger.info(f"Dataset contains {len(self.entries)} valid entries.")

def _preprocess_one(self, entry: dict, cache_path: Path):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is getting very very long...not really a blocking comment for this PR but if there is a future PR for cleaning up / refactoring this should be on the top of the list.

Comment thread src/dataset.py
# Ligands always go last so num_asu_protein and mate counts are unaffected,
# preserving ESM/SLAE embedding alignment via _pad_atom_embeddings_for_mates.

# TODO(ligands+mates): this only adds ASU ligands. Until dev_crystal_mates

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused by the comment. Is the intention (future PR) to add ligands as part of mate protein nodes or not?

Comment thread tests/test_dataset.py
@@ -480,32 +480,223 @@ class TestParseAsuWithBiotite:
"""Tests for PDB parsing with biotite."""

def test_parse_returns_protein_and_water(self, pdb_6eey):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should rename this function

Comment thread tests/test_encoder.py
Comment on lines +161 to +162
"embedding_dim": 1536,
"hidden_s": 64,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do these hard coded numbers come from (and similarly 128 above)? should some constants be declared/defined?

Comment thread src/encoder_base.py
Comment on lines +150 to +151
ASU protein atoms carry the real ESM/SLAE vector; symmetry mates and
ligand atoms are zero-padded (they have no residue embedding).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing relevant tests for testing if the embeddings work as expected -- whether should be zero-padded or not and have the one-hot component etc for ASU protein, mate, and ligands? I know the mate is not part of this PR, but probably should start adding tests to catch the kind of bug that you found that prevented crystal contact to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants