Adding ligand processing to dataset and encoders. by vratins · Pull Request #86 · diff-use/WaterFlow

vratins · 2026-06-24T01:45:15Z

Adds non-protein, non-water heavy atoms as context nodes, on by default (include_ligands=True).

Dataset: parse_asu_with_biotite() returns (protein, water, ligand); ligand atoms appended after protein/mate atoms with an is_ligand mask and residue_index = -1. Cache dir is unaffected by include_ligands (only _mates).
Encoder: CachedEmbeddingEncoder gets a learnable ligand_embed projection (embedding_dim now required at init); device-safe indexing; pooling ignores the -1 sentinel.
Misc: ESM/SLAE scripts unpack the 3-tuple (SLAE marked legacy); removed unused rdkit; added ligand tests + 4h0b fixture.
Added fusion MLP to the encoders to make the logic cleaner (ESM and 1hots projected to common dim, and then concat, and another MLP, and passed to flow.

Summary by CodeRabbit

New Features
- Added support for including ligand atoms from PDB structures in model training
- Introduced configurable toggle to enable/disable ligand atom inclusion
Changes
- Encoder initialization now requires explicit embedding dimension specification at configuration time
- Updated dependencies for improved compatibility
Tests
- Expanded test coverage for ligand processing and atom filtering

… for the same

coderabbitai · 2026-06-24T01:45:27Z

📝 Walkthrough

Walkthrough

parse_asu_with_biotite now extracts and returns ligand atoms as a 3-tuple (protein, water, ligand). ProteinWaterDataset adds an include_ligands flag that appends ligand nodes with residue_index = -1 and an is_ligand mask into the graph. CachedEmbeddingEncoder requires embedding_dim at construction, removes lazy inference, and fuses cached embeddings with element one-hot features to produce learned fusion_dim-width scalar outputs instead of raw cached embeddings. GVPEncoder._pool_by_residue filters atoms with negative residue indices before pooling. All call sites and tests are updated accordingly. The pymol-open-source dependency is replaced with pymol-open-source-whl>=3.1.0.4.

Changes

Ligand atoms and cached encoder fusion

Layer / File(s)	Summary
`parse_asu_with_biotite` 3-tuple return `src/dataset.py`	Extracts non-protein, non-water heavy atoms as a ligand set and returns `(protein, water, ligand)` instead of `(protein, water)`.
`ProteinWaterDataset` ligand node integration `src/dataset.py`	Adds `include_ligands` constructor parameter (default `True`); preprocessing unpacks 3-tuple, appends ligand atom positions/features after protein/mate nodes, assigns `residue_index = -1` for ligand atoms, builds an `is_ligand` boolean mask, persists it to the geometry cache, and exposes it on the returned graph as `data["protein"].is_ligand`.
`CachedEmbeddingEncoder`: required `embedding_dim` and embedding+element fusion `src/encoder_base.py`	Constructor now requires `embedding_dim` and `fusion_dim` (removes lazy inference). Initializes learned projections for cached embeddings and element one-hot features. `output_dims` returns `(fusion_dim, 0)` immediately. `forward` fuses cached embeddings with element features to produce scalar output width `fusion_dim` and empty vectors `(N, 0, 3)`. `from_config` requires `embedding_dim` and `hidden_s` in config; raises `ValueError` if missing.
`GVPEncoder` negative residue index guard `src/gvp_encoder.py`	`_pool_by_residue` filters out atoms with `residue_index < 0` (ligand atoms marked with `-1`) before scatter-based pooling to prevent invalid indices.
Embedding script call-site updates `scripts/generate_esm_embeddings.py`, `scripts/generate_slae_embeddings.py`	Updates `parse_asu_with_biotite` unpacking from 2-tuple to 3-tuple. `generate_slae_embeddings.py` docstring is updated to note SLAE encoder logic as legacy/reference-only.
Tests: ligand parsing, dataset integration, and encoder contracts `tests/conftest.py`, `tests/test_dataset.py`, `tests/test_encoder.py`, `tests/test_train_config.py`	New `pdb_4h0b` fixture added. Parse-and-dataset tests updated for 3-tuple unpacking; new ligand-focused test classes verify partitioning, `is_ligand` mask shape/type, sentinel `residue_index = -1`, and `include_ligands` flag toggling. Quality-filter and water-filtering tests updated to ignore ligand return. Encoder tests require `embedding_dim` at construction, assert immediate `output_dims = (fusion_dim, 0)`, validate learnable parameters including element projections, add error path for missing `embedding_dim`, and adjust output-shape expectations to match fusion width.
Dependency update `pyproject.toml`	Replaces `pymol-open-source` with `pymol-open-source-whl>=3.1.0.4`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐇 Ligands hop in with masks of gold,
The encoder learns their stories untold.
Fused embeddings dance with elements bright,
No scatter stumbles on negative light.
PyMOL wheels spin smooth and true,
This rabbit's PR brings atoms anew! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding ligand processing to dataset and encoder components, which is the primary focus across multiple files.
Docstring Coverage	✅ Passed	Docstring coverage is 95.83% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev_ligands

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

Copilot

Pull request overview

This PR extends the data pipeline and cached-embedding encoders to include non-protein, non-water heavy atoms (“ligands”) as additional protein-type context nodes by default, and updates related scripts/tests accordingly.

Changes:

Update parse_asu_with_biotite() and ProteinWaterDataset preprocessing to produce/append ligand atoms and persist an is_ligand mask with residue_index = -1.
Add ligand handling to cached embedding encoders via a learnable ligand projection (ligand_embed) and ensure residue pooling ignores the -1 sentinel.
Update embedding-generation scripts and expand integration tests/fixtures for ligand parsing and node inclusion.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`uv.lock`	Updates locked dependency set (adds `jaxtyping`, adjusts PyMOL package, other lock changes).
`pyproject.toml`	Switches PyMOL dependency to `pymol-open-source-whl>=3.1.0.4` and adds `jaxtyping`.
`src/dataset.py`	Adds ligand parsing, dataset flag `include_ligands`, appends ligand nodes, and stores `is_ligand` in cache/data.
`src/encoder_base.py`	Requires `embedding_dim` for cached encoders and adds ligand projection for ligand nodes.
`src/gvp_encoder.py`	Filters out `residue_index < 0` entries during residue pooling to avoid scatter issues.
`scripts/generate_esm_embeddings.py`	Updates unpacking of `parse_asu_with_biotite()` return to 3-tuple.
`scripts/generate_slae_embeddings.py`	Marks SLAE as legacy and updates unpacking of `parse_asu_with_biotite()` return to 3-tuple.
`tests/conftest.py`	Adds a `pdb_4h0b` fixture for ligand integration tests.
`tests/test_dataset.py`	Adds ligand parsing + include_ligands integration tests and updates existing parsing call sites.
`tests/test_encoder.py`	Updates cached encoder tests for required `embedding_dim` and ligand projection parameterization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vratins · 2026-06-24T01:49:37Z

        self.cache_dir = Path(processed_dir)
-        # Directory-based separation: geometry/ vs geometry_mates/
+        # Directory-based separation: geometry/ vs geometry_mates/. Ligand inclusion
+        # is governed by the include_ligands config flag, not the cache directory
+        # name, so the geometry cache name is unaffected by include_ligands.
        cache_suffix = "_mates" if include_mates else ""
        self.geometry_dir = self.cache_dir / f"{geometry_cache_name}{cache_suffix}"
        self.base_pdb_dir = Path(base_pdb_dir)


include_ligands should be on by default, not a concern here, both caches should have ligands

Why is this not a concern? I understand that the default is include_ligands=True so for your own caches this is fine, but if another user wants to toggle and try excluding ligands would there not be silent errors from loading caches?

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/dataset.py (1)

678-715: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Version or canonicalize the geometry cache before reusing it across ligand settings.

The cache path ignores include_ligands, but the saved payload changes at Line 1080 and __getitem__ always trusts cached["is_ligand"] at Line 1217. Reusing the same processed_dir/geometry[_mates] after preprocessing once with the opposite ligand setting will silently return the wrong graph; pre-existing caches from before this PR will also raise KeyError because they lack is_ligand.

Either make the cache payload canonical and apply include_ligands at load time, or store schema/config metadata and invalidate/rebuild on mismatch.

Possible localized guard to avoid silent cache misuse

         torch.save(
             {
+                "cache_schema_version": 2,
+                "include_ligands": self.include_ligands,
                 "protein_pos": final_protein_pos,
                 "protein_x": final_protein_x,
                 "protein_res_idx": final_protein_res_idx,
                 "is_ligand": is_ligand,

         cached = torch.load(cache_path, weights_only=False)
+        if cached.get("cache_schema_version") != 2:
+            raise ValueError(
+                f"Geometry cache {cache_path} uses an old schema; regenerate it."
+            )
+        if cached.get("include_ligands") != self.include_ligands:
+            raise ValueError(
+                f"Geometry cache {cache_path} was generated with "
+                f"include_ligands={cached.get('include_ligands')}; regenerate it "
+                f"or use a separate cache."
+            )
 
         # load all data directly from cache (already includes mates if applicable)

Also applies to: 741-754, 1072-1123, 1217-1239

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dataset.py` around lines 678 - 715, The geometry cache path ignores the
`include_ligands` parameter, but the saved data includes or excludes ligands
based on this setting, causing silent cache misuse when the parameter changes.
The `__getitem__` method at line 1217 always trusts the cached `is_ligand` field
without validating the cache was created with matching settings. To fix this,
either make the cache payload canonical by always storing all atoms with ligand
metadata, then apply `include_ligands` filtering at load time in `__getitem__`,
or embed cache schema and configuration metadata in the cache and validate it on
load, rebuilding when settings mismatch. Also handle backward compatibility for
pre-existing caches that lack the `is_ligand` field by detecting and
regenerating them.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/encoder_base.py`:
- Around line 223-233: The cached embeddings retrieved using self._embedding_key
on line 223 are not validated against the configured embedding_dim, which can
cause shape mismatches when ligand embeddings are assigned on line 233 or
silently return wrong shapes. After retrieving the embeddings tensor, add
validation to ensure its width (second dimension) matches self.embedding_dim
before the embeddings are used. If the width does not match, raise an
appropriate error to catch configuration mismatches early rather than allowing
silent failures or crashes downstream.

---

Outside diff comments:
In `@src/dataset.py`:
- Around line 678-715: The geometry cache path ignores the `include_ligands`
parameter, but the saved data includes or excludes ligands based on this
setting, causing silent cache misuse when the parameter changes. The
`__getitem__` method at line 1217 always trusts the cached `is_ligand` field
without validating the cache was created with matching settings. To fix this,
either make the cache payload canonical by always storing all atoms with ligand
metadata, then apply `include_ligands` filtering at load time in `__getitem__`,
or embed cache schema and configuration metadata in the cache and validate it on
load, rebuilding when settings mismatch. Also handle backward compatibility for
pre-existing caches that lack the `is_ligand` field by detecting and
regenerating them.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4b879dd2-d028-46c3-b2cb-f1094471d9cd

📥 Commits

Reviewing files that changed from the base of the PR and between c3b9db6 and 6f14f4a.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

pyproject.toml
scripts/generate_esm_embeddings.py
scripts/generate_slae_embeddings.py
src/dataset.py
src/encoder_base.py
src/gvp_encoder.py
tests/conftest.py
tests/test_dataset.py
tests/test_encoder.py
tests/test_files/4h0b/4h0b_final.pdb

Copilot

Pull request overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated 2 comments.

DorisMai · 2026-06-25T15:52:32Z

I made a few comments that should mostly be minor issues to address or clarifying questions related to other PRs perhaps. Other than that, README.md should also be updated wherever ligand should be mentioned as optionally a part of "protein" now.

DorisMai · 2026-06-25T00:15:00Z

        self.cache_dir = Path(processed_dir)
-        # Directory-based separation: geometry/ vs geometry_mates/
+        # Directory-based separation: geometry/ vs geometry_mates/. Ligand inclusion
+        # is governed by the include_ligands config flag, not the cache directory
+        # name, so the geometry cache name is unaffected by include_ligands.
        cache_suffix = "_mates" if include_mates else ""
        self.geometry_dir = self.cache_dir / f"{geometry_cache_name}{cache_suffix}"
        self.base_pdb_dir = Path(base_pdb_dir)


Why is this not a concern? I understand that the default is include_ligands=True so for your own caches this is fine, but if another user wants to toggle and try excluding ligands would there not be silent errors from loading caches?

DorisMai · 2026-06-25T00:16:19Z

@@ -665,6 +675,7 @@
        base_pdb_dir: str = "/sb/wankowicz_lab/data/srivasv/pdb_redo_data",


out side of the diff, but just hard-coded path to remove

DorisMai · 2026-06-25T00:21:20Z

        self.entries = valid_entries
        logger.info(f"Dataset contains {len(self.entries)} valid entries.")

    def _preprocess_one(self, entry: dict, cache_path: Path):


this function is getting very very long...not really a blocking comment for this PR but if there is a future PR for cleaning up / refactoring this should be on the top of the list.

DorisMai · 2026-06-25T00:22:17Z

+        # Ligands always go last so num_asu_protein and mate counts are unaffected,
+        # preserving ESM/SLAE embedding alignment via _pad_atom_embeddings_for_mates.
+
+        # TODO(ligands+mates): this only adds ASU ligands. Until dev_crystal_mates


I am a bit confused by the comment. Is the intention (future PR) to add ligands as part of mate protein nodes or not?

DorisMai · 2026-06-25T02:24:06Z

@@ -480,32 +480,223 @@ class TestParseAsuWithBiotite:
    """Tests for PDB parsing with biotite."""

    def test_parse_returns_protein_and_water(self, pdb_6eey):


should rename this function

DorisMai · 2026-06-25T15:31:46Z

+            "embedding_dim": 1536,
+            "hidden_s": 64,


where do these hard coded numbers come from (and similarly 128 above)? should some constants be declared/defined?

DorisMai · 2026-06-25T15:47:54Z

+        ASU protein atoms carry the real ESM/SLAE vector; symmetry mates and
+        ligand atoms are zero-padded (they have no residue embedding).


I am not seeing relevant tests for testing if the embeddings work as expected -- whether should be zero-padded or not and have the one-hot component etc for ASU protein, mate, and ligands? I know the mate is not part of this PR, but probably should start adding tests to catch the kind of bug that you found that prevented crystal contact to work.

vratins and others added 10 commits March 30, 2026 12:08

code to parse ligands as a 'non-water' node alongside proteins, tests…

38c2e0b

… for the same

Merge branch 'main' into dev_ligands

b8568a5

rdkit dependencies

7cf6de1

tests

1529943

Auto-commit ruff fixes [skip ci]

bebf742

mates comments

c678fc0

addressing het atom discrepancy

e78be52

addressing comments

c824e2f

Auto-commit ruff fixes [skip ci]

aca3180

include ligands by default

6f14f4a

Copilot AI review requested due to automatic review settings June 24, 2026 01:45

vratins temporarily deployed to testing June 24, 2026 01:45 — with GitHub Actions Inactive

Copilot started reviewing on behalf of vratins June 24, 2026 01:45 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread src/encoder_base.py Outdated

fusion encoders to make the ligand logic cleaner

4979184

vratins temporarily deployed to testing June 24, 2026 02:45 — with GitHub Actions Inactive

vratins requested a review from Copilot June 24, 2026 02:53

Copilot started reviewing on behalf of vratins June 24, 2026 02:54 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread src/dataset.py

Comment thread src/encoder_base.py

DorisMai reviewed Jun 25, 2026

View reviewed changes

		@@ -665,6 +675,7 @@
		base_pdb_dir: str = "/sb/wankowicz_lab/data/srivasv/pdb_redo_data",

		@@ -480,32 +480,223 @@ class TestParseAsuWithBiotite:
		"""Tests for PDB parsing with biotite."""

		def test_parse_returns_protein_and_water(self, pdb_6eey):

		ASU protein atoms carry the real ESM/SLAE vector; symmetry mates and
		ligand atoms are zero-padded (they have no residue embedding).

Uh oh!

Conversation

vratins commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

DorisMai commented Jun 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vratins commented Jun 24, 2026 •

edited

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading