Skip to content

Optimizations to reduce per-step I/O overhead and to reduce graph density#85

Merged
vratins merged 5 commits into
mainfrom
dev_dataset_perf
Jun 24, 2026
Merged

Optimizations to reduce per-step I/O overhead and to reduce graph density#85
vratins merged 5 commits into
mainfrom
dev_dataset_perf

Conversation

@vratins

@vratins vratins commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

mmap-backed cache loading (_load_torch_cache)

  • Wraps torch.load with mmap=True so geometry/embedding .pt files are memory-mapped. Falls back to regular load if the runtime or file format doesn't support it.
  • Threaded through load_slae_embedding, load_esm_embedding, and __getitem__.
  • Controlled by --cache_load_mmap (default: off).

Per-worker sample LRU cache (sample_cache_size)

  • Optional in-process OrderedDict-backed LRU cache in __getitem__. When a sample is already in cache, skips all I/O and returns a .clone(). Evicts least-recently-used entries when the cache is full.
  • 0 by default (disabled); set --sample_cache_size N to hold N samples per worker.
  • default is 0, set it higher if enough available system RAM.

max_neighbors cap on radius_graph

  • ProteinWaterDataset now accepts max_neighbors (default 256) and passes it to radius_graph at preprocessing time
  • Stored in the geometry cache metadata for traceability.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added --sample_cache_size command-line argument to enable per-worker in-process sample caching with configurable capacity
    • Added --cache_load_mmap flag to enable optimized dataset cache loading
    • Samples are cached in-process with automatic capacity management
  • Tests

    • Added tests verifying cache data isolation between retrievals
    • Added tests confirming configuration propagation to dataset loaders

Copilot AI review requested due to automatic review settings June 17, 2026 08:20

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces runtime overhead in ProteinWaterDataset by optimizing cache loading, adding an optional per-worker in-process sample cache, and capping preprocessing-time graph neighborhood density to limit graph size.

Changes:

  • Added _load_torch_cache() wrapper to optionally use torch.load(..., mmap=True) with a safe fallback, and threaded the option through geometry + embedding cache loads.
  • Added an optional per-process LRU cache for fully built HeteroData samples in __getitem__ (sample_cache_size), returning mutation-safe clones on cache hits.
  • Added a max_neighbors cap for radius_graph during preprocessing and stored it in cache metadata for traceability; exposed new runtime flags in scripts/train.py.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
tests/test_dataset.py Adds tests for mutation-safe sample caching and for passing the mmap flag through dataset geometry loading.
src/dataset.py Implements mmap-backed cache loading, per-process sample LRU caching, and radius_graph neighbor capping + metadata.
scripts/train.py Exposes --sample_cache_size and --cache_load_mmap and threads them into dataset configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@vratins, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 54 minutes and 48 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1edea8dd-756b-4cbf-8339-bc2512824bd6

📥 Commits

Reviewing files that changed from the base of the PR and between b13dc7d and 82acb56.

📒 Files selected for processing (1)
  • src/dataset.py
📝 Walkthrough

Walkthrough

Adds mmap-backed .pt file loading via a new _load_torch_cache helper and an in-process LRU sample cache to ProteinWaterDataset, also introducing a max_neighbors cap for radius_graph. Both new options are threaded through embedding loaders, __getitem__, the dataset constructor, and the training script CLI. Two tests cover clone mutation safety and mmap flag propagation.

Changes

Dataset caching and loading optimizations

Layer / File(s) Summary
_load_torch_cache helper and embedding loader updates
src/dataset.py
Adds OrderedDict import, introduces _load_torch_cache with mmap-with-fallback and debug logging, and updates load_slae_embedding and load_esm_embedding to accept and use cache_load_mmap.
ProteinWaterDataset constructor and preprocessing
src/dataset.py
Adds max_neighbors, sample_cache_size, and cache_load_mmap to __init__, validates and normalizes them, initializes the OrderedDict-backed _sample_cache, passes max_num_neighbors into radius_graph, and includes max_neighbors in the saved geometry cache payload.
__getitem__ LRU cache, mmap geometry loading, and tests
src/dataset.py, tests/test_dataset.py
Replaces direct torch.load with _load_torch_cache in __getitem__, adds early LRU cache lookup returning a clone, stores newly built HeteroData samples with LRU eviction, threads cache_load_mmap into _annotate_data_with_embeddings, and covers clone mutation safety and mmap propagation in two new tests.
CLI argument parsing and dataset_kwargs wiring
scripts/train.py
Adds --sample_cache_size and --cache_load_mmap CLI arguments, validates sample_cache_size >= 0, and routes both into the dataset_kwargs dictionary used for dataloader creation.

Sequence Diagram(s)

sequenceDiagram
  participant DL as DataLoader
  participant DS as ProteinWaterDataset.__getitem__
  participant Cache as _sample_cache
  participant LTC as _load_torch_cache
  participant Ann as _annotate_data_with_embeddings

  DL->>DS: __getitem__(idx)
  DS->>Cache: lookup (actual_idx, cache_key)
  alt cache hit
    Cache-->>DS: HeteroData
    DS-->>DL: clone(HeteroData)
  else cache miss
    DS->>LTC: geometry_path, cache_load_mmap
    LTC-->>DS: geometry dict (mmap or fallback)
    DS->>Ann: data, cache_load_mmap
    Ann-->>DS: annotated HeteroData
    DS->>Cache: store + evict LRU if over capacity
    DS-->>DL: clone(HeteroData)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 A cache for my carrots, an mmap for speed,
Neighbors now capped so the graph won't stampede,
Each sample returned is a clone, safe and sound,
No mutations corrupt what the LRU found.
With flags from the CLI, the pipeline's complete—
This bunny hops faster on memory-mapped feet! 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title mentions reducing per-step I/O overhead and graph density, which aligns with the PR's optimization objectives: mmap-backed caching reduces I/O, sample LRU cache eliminates redundant I/O operations, and max_neighbors parameter controls graph sparsity.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev_dataset_perf

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/dataset.py (1)

849-853: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject stale geometry caches when max_neighbors changes.

Line 1119 records the cap, but Lines 849-853 still reuse any existing .pt solely by path. Re-running with the same processed_dir and a different max_neighbors silently serves old pp_edge_index and edge features, so the dataset no longer matches its configuration.

🐛 Minimal fail-fast guard
         cached = _load_torch_cache(cache_path, cache_load_mmap=self.cache_load_mmap)
+        cached_max_neighbors = cached.get("max_neighbors")
+        if cached_max_neighbors != self.max_neighbors:
+            raise ValueError(
+                f"Geometry cache {cache_path} was generated with "
+                f"max_neighbors={cached_max_neighbors}, but this dataset was "
+                f"configured with max_neighbors={self.max_neighbors}. "
+                "Regenerate the geometry cache or use a distinct geometry_cache_name."
+            )
 
         # load all data directly from cache (already includes mates if applicable)

Also applies to: 1119-1119, 1207-1215

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dataset.py` around lines 849 - 853, The list comprehension filtering
to_process entries at lines 849-853 only checks if the geometry cache file
exists by path, but does not validate that the cached file was created with the
same max_neighbors configuration. When max_neighbors changes on re-run with the
same processed_dir, stale cache files are silently reused despite no longer
matching the dataset configuration. Enhance the condition that checks for file
existence to also validate that the cached geometry was created with the current
max_neighbors value (as recorded at line 1119), ensuring entries with
incompatible caches are included in to_process and re-processed.
🧹 Nitpick comments (1)
tests/test_dataset.py (1)

656-698: ⚡ Quick win

Cover the mmap opt-in path too.

This test only asserts cache_load_mmap=False; if __getitem__ accidentally hard-coded False, the new opt-in behavior would still pass. Parameterize both values.

🧪 Proposed test tightening
-    def test_getitem_passes_mmap_flag_to_geometry_loader(self, tmp_path, monkeypatch):
+    `@pytest.mark.parametrize`("cache_load_mmap", [False, True])
+    def test_getitem_passes_mmap_flag_to_geometry_loader(
+        self, tmp_path, monkeypatch, cache_load_mmap
+    ):
         """Dataset geometry loading should use the configured mmap option."""
@@
             include_mates=False,
             preprocess=False,
-            cache_load_mmap=False,
+            cache_load_mmap=cache_load_mmap,
         )
@@
         assert data["protein"].num_nodes == 1
-        assert calls == [(cache_path, False)]
+        assert calls == [(cache_path, cache_load_mmap)]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_dataset.py` around lines 656 - 698, The test
test_getitem_passes_mmap_flag_to_geometry_loader only covers the case where
cache_load_mmap=False, which means if the code accidentally hard-coded False in
__getitem__, the test would still pass. Parameterize this test using
pytest.mark.parametrize to run with both cache_load_mmap=True and
cache_load_mmap=False, and update the corresponding assertion on the calls
variable to verify the correct mmap flag value is passed to the geometry loader
in each case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/dataset.py`:
- Around line 746-755: Add validation for the max_neighbors parameter in the
constructor to ensure it contains only positive values, similar to the existing
validation for sample_cache_size. After the sample_cache_size validation check,
add a check that max_neighbors is greater than 0 and raise a ValueError with an
appropriate message if it is not. This will prevent invalid graph topology when
max_neighbors is later passed to the radius_graph call and ensures consistency
with other parameter validations in the constructor.

---

Outside diff comments:
In `@src/dataset.py`:
- Around line 849-853: The list comprehension filtering to_process entries at
lines 849-853 only checks if the geometry cache file exists by path, but does
not validate that the cached file was created with the same max_neighbors
configuration. When max_neighbors changes on re-run with the same processed_dir,
stale cache files are silently reused despite no longer matching the dataset
configuration. Enhance the condition that checks for file existence to also
validate that the cached geometry was created with the current max_neighbors
value (as recorded at line 1119), ensuring entries with incompatible caches are
included in to_process and re-processed.

---

Nitpick comments:
In `@tests/test_dataset.py`:
- Around line 656-698: The test test_getitem_passes_mmap_flag_to_geometry_loader
only covers the case where cache_load_mmap=False, which means if the code
accidentally hard-coded False in __getitem__, the test would still pass.
Parameterize this test using pytest.mark.parametrize to run with both
cache_load_mmap=True and cache_load_mmap=False, and update the corresponding
assertion on the calls variable to verify the correct mmap flag value is passed
to the geometry loader in each case.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 55892353-48b4-479c-b491-bfebf0ad51ca

📥 Commits

Reviewing files that changed from the base of the PR and between c3b9db6 and b13dc7d.

📒 Files selected for processing (3)
  • scripts/train.py
  • src/dataset.py
  • tests/test_dataset.py

Comment thread src/dataset.py

@marcuscollins marcuscollins left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small things for you to consider before merging, but otherwise approving.

Comment thread src/dataset.py
def _load_torch_cache(path: Path, cache_load_mmap: bool = True) -> dict:
"""Load a torch cache file, using mmap when supported by the file/runtime."""
if not cache_load_mmap:
return torch.load(path, weights_only=False)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you pipe through weights_only?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the weights_only parameter is True by default, I've set it to False everywhere torch.load is used to suppress the pickle concern warning since we load in self-generated .pt cache files at every spot we use torch.load. I do not expect to set it to True anywhere, hence not threading it as a parameter.

Comment thread src/dataset.py
self.filter_by_bfactor = filter_by_bfactor
self.sample_cache_size = int(sample_cache_size)
self.cache_load_mmap = bool(cache_load_mmap)
self._sample_cache: OrderedDict[tuple[int, str], HeteroData] = OrderedDict()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it need to be ordered?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LRU semantics require it to be ordered; it uses move_to_end to mark recently-used keys and popitem(last=False) to evict the oldest, neither of which a regular dict supports.

@vratins vratins merged commit aa8b771 into main Jun 24, 2026
4 checks passed
@vratins vratins deleted the dev_dataset_perf branch June 24, 2026 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants