Skip to content

Strict type-checking + unit tests for foundry.utils; fix dataset fallback weights; drop dead rf3 module#325

Open
lyskov-ai wants to merge 4 commits into
RosettaCommons:productionfrom
lyskov-ai:0042-foundry-datasets-mypy-strict-and-tests
Open

Strict type-checking + unit tests for foundry.utils; fix dataset fallback weights; drop dead rf3 module#325
lyskov-ai wants to merge 4 commits into
RosettaCommons:productionfrom
lyskov-ai:0042-foundry-datasets-mypy-strict-and-tests

Conversation

@lyskov-ai

Copy link
Copy Markdown
Contributor

Bundles four related changes that tighten type-checking and test coverage on the shared foundry layer.

Strict type-checking for the rest of foundry.utils

Enables disallow_untyped_defs + check_untyped_defs for datasets, instantiators, ddp, squashfs, and logging, completing strict type-checking across src/foundry/utils/. These modules were already annotated, so this is mostly a lock-in (new untyped defs now fail the gate); ddp and logging needed a handful of return/param annotations.

Fix: dataset fallback sampler weights

wrap_dataset_and_sampler_with_fallbacks selected the fallback sampler's weights with "weights" in sampler. A Sampler has no __contains__, so in iterates the sampler's integer indices and the string never matches — the "use the sampler's weights" branch was effectively dead, and the test needlessly drew samples. Switched to hasattr(sampler, "weights") (the form atomworks uses internally); the type checker then narrows the type, so a now-redundant suppression was removed.

Reviewer / mpnn maintainers — behaviour change: for the distributed dataloader path this is a no-op (its samplers have no weights, so still uniform). But models/mpnn/src/mpnn/train.py passes a WeightedRandomSampler built from computed PDB weights as the fallback sampler — so its training fallback sampling now uses those weights instead of silently-uniform ones (the function's documented intent). mpnn/train.py is a cluster-only script and is not exercised in CI, so this could not be verified here; please validate when running mpnn training.

New unit tests (CPU-only, no cluster data)

  • tests/test_datasets.py (18) — the fallback-weights selection (incl. a regression test for the fix), the config-key sampler dispatch in instantiate_single_dataset_and_sampler, and the sampler conversions + input-validation guards in assemble_distributed_loader.
  • tests/test_instantiators.py (10) and tests/test_logging.py (6) — the instantiation control flow and the two pure logging helpers.

Remove dead, broken rf3.data.paired_msa

The module subclassed an atomworks class that has since become a factory function, so it raised TypeError on import, and it was reachable only through a commented-out (disabled) dataset config. Removed the module, its orphaned domain_distillation.yaml config, and the dangling commented references.

Verification: ruff format/ruff check clean, mypy clean (152 files), pytest 356 passed.

lyskov and others added 4 commits June 16, 2026 19:43
instantiators is already fully annotated, so this adds it to the
per-module strict mypy override (a lock-in for future defs) and fills the
test gap with tests/test_instantiators.py covering the callback/logger
instantiation control flow.

Co-authored-by: lyskov-ai <277346777+lyskov-ai@users.noreply.github.com>
…st logging

Adds the three remaining small foundry.utils modules to the per-module strict
mypy override. squashfs was already fully annotated (pure lock-in); ddp needed
one annotation (RankedLogger.log varargs); logging needed five. New
tests/test_logging.py covers the two pure helpers (CachedDataFilter.filter and
condense_count_columns_of_grouped_df).

Co-authored-by: lyskov-ai <277346777+lyskov-ai@users.noreply.github.com>
rf3.data.paired_msa no longer imports against the installed atomworks:
MultiInputDatasetWrapper subclasses StructuralDatasetWrapper, which atomworks
turned into a deprecated factory function, so subclassing it raises TypeError
at import. The module was reachable only through domain_distillation.yaml,
which is itself referenced only in commented-out lines of
pdb_and_distillation.yaml, and its LoadPairedMSAs class was used nowhere.

Remove the module, its orphaned domain_distillation.yaml config, the dangling
commented references in pdb_and_distillation.yaml, and the stale pyproject
mypy-exemption comment. rf3 now type-checks fully with no in-file suppressions.

Co-authored-by: lyskov-ai <277346777+lyskov-ai@users.noreply.github.com>
…+ tests

In wrap_dataset_and_sampler_with_fallbacks the fallback-weights branch was
gated on `"weights" in sampler_to_fallback_to`. Sampler defines no
__contains__, so `in` iterates the sampler's integer indices and never
matches the string — the documented "use the sampler's weights" branch was
dead and the membership test needlessly drew samples. Switch to
`hasattr(sampler_to_fallback_to, "weights")` (the idiomatic form atomworks
itself uses); mypy then narrows the type so the `# type: ignore[attr-defined]`
on `.weights` is dropped.

Behaviour change: assemble_distributed_loader's reachable samplers
(DistributedSampler / DistributedMixedSampler) have no `.weights`, so it stays
uniform (behaviour-neutral). But models/mpnn/src/mpnn/train.py passes a
WeightedRandomSampler built from computed PDB weights as the fallback sampler,
so its training fallback sampling now uses those weights instead of
silently-uniform ones — the function's documented intent. mpnn train.py is a
cluster-coupled script (not in CI), so this is unverified here; flagged for
mpnn owners to validate.

Also add foundry.utils.datasets to the strict mypy override (already fully
annotated — a pure lock-in) and add tests/test_datasets.py (18 tests, no prior
coverage) covering the fallback-weights fix, the config-key sampler dispatch,
and the distributed-loader sampler conversions + validation guards.

Co-authored-by: lyskov-ai <277346777+lyskov-ai@users.noreply.github.com>
@lyskov-ai lyskov-ai requested a review from woodsh17 June 17, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants