perf(distributed): add retrieval tuning knobs by yuhezhang-ai · Pull Request #2452 · NVIDIA-NeMo/Automodel

yuhezhang-ai · 2026-06-08T16:15:43Z

What does this PR do ?

Adds distributed tuning and retrieval training fixes used for Nemotron VL retrieval fine-tuning benchmarks. The main goals are to make DDP configurable and faster for retrieval, preserve retrieval optimizer parameter groups, and make loss logging easier to compare with nemo-retriever-research.

Changelog

Expose additional DDP config flags:
- broadcast_buffers
- find_unused_parameters
- static_graph
- bucket_cap_mb
- gradient_as_bucket_view
Forward the new DDP flags into torch.nn.parallel.DistributedDataParallel.
Expose FSDP2 reshard_after_forward so no-reshard variants can be configured from YAML.
Thread reshard_after_forward through the FSDP2 manager and recursive sharding helper.
Speed up DDP gradient clipping by clipping directly on DDP bucket gradients when available.
Fix DDP recipe metric/logging by reducing it even if no mesh.
Fix retrieval recipe attribute access when the bi-encoder is wrapped by DDP.
Preserve retrieval decay/no-decay optimizer parameter groups when constructing typed optimizer configs via build_from_param_groups(...).
Add step_scheduler.loss_average_window_steps with default 50.
Log retrieval loss_avg_window alongside raw per-step loss, so noisy retrieval loss curves can be compared more easily in W&B or local Slurm logs.
Add/update unit coverage for DDP config parsing, DDP manager wiring, FSDP2 reshard override behavior, optimizer param-group construction, retrieval optimizer setup, and averaged retrieval loss logging.

Notes

loss_avg_window is intentionally scoped under step_scheduler because it controls training-step logging behavior, similar to log_every_steps and log_remote_every_steps.
The averaged loss is logging-only. It does not change the training loss, backward pass, gradient scaling, optimizer step, or scheduler behavior.
The optimizer API change keeps the existing one-line simple optimizer construction path intact while adding a path for recipes that need explicit parameter groups.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

Additional Information

Related to: Nemotron VL 1B retrieval fine-tuning performance/loss-curve debugging.
Branch was rebased onto latest main before the latest update.

Validation on current rebased branch:

git diff --check origin/main...HEAD
python -m py_compile nemo_automodel/components/optim/optimizer.py nemo_automodel/components/training/step_scheduler.py nemo_automodel/recipes/retrieval/train_bi_encoder.py nemo_automodel/components/distributed/ddp.py nemo_automodel/components/training/utils.py

Earlier validation before the latest rebase:

source work/runs/_shared/env.sh && uv run --no-sync ruff check ...
source work/runs/_shared/env.sh && uv run --no-sync ruff format --check ...
git diff --check
source work/runs/_shared/env.sh && uv run --no-sync pytest tests/unit_tests/recipes/test_dist_setup.py tests/unit_tests/distributed/test_parallelizer.py tests/unit_tests/distributed/test_ddp_manager.py -q
Result: 126 passed, 17 warnings

Experiment sanity checks:

Real-data DDP torch AdamW fp32, weight_decay=0.1, 40-minute run reached step 1763 and ended due to Slurm time limit without a Python exception.
Real-data DDP TE FusedAdam bf16, weight_decay=0.1, 40-minute run reached step 2002 and ended due to Slurm time limit without a Python exception.
Reconstructed local 50-step averaged loss curves from Slurm logs show both DDP variants descending normally.

Latest Update (2026-06-13)

Added configurable retrieval autocast via distributed.autocast_dtype; default remains disabled.
Wired top-level compile: config into retrieval bi-encoder model instantiation for DDP compile experiments.

Additional validation:

uv run pytest tests/unit_tests/recipes/test_retrieval_bi_encoder_recipe.py tests/unit_tests/recipes/test_dist_utils.py -q (77 passed)
uv run ruff check on touched retrieval/distributed files

copy-pr-bot · 2026-06-08T16:15:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

yuhezhang-ai · 2026-06-12T15:49:27Z

/ok to test e703720

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

yuhezhang-ai · 2026-06-12T15:55:06Z

/ok to test 3239799

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

yuhezhang-ai · 2026-06-13T19:40:39Z

/ok to test 4d43a66

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

yuhezhang-ai · 2026-06-13T23:17:05Z

/ok to test 55343c9

yuhezhang-ai added 4 commits June 9, 2026 18:22

perf(distributed): add retrieval tuning knobs

daec554

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

fix(retrieval): unwrap ddp model attrs

7d80687

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

perf(retrieval): speed up ddp grad clipping

c30d86c

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

fix(distributed): reduce DDP recipe metrics

52eeef3

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

yuhezhang-ai force-pushed the yuhez/perf/retrieval-distributed-tuning branch from c564193 to fcfcc72 Compare June 10, 2026 01:30

fix(retrieval): preserve optimizer groups and log average loss

3387071

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

yuhezhang-ai force-pushed the yuhez/perf/retrieval-distributed-tuning branch from fcfcc72 to 3387071 Compare June 10, 2026 01:49

chore(retrieval): drop megatron fsdp side changes

e703720

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

yuhezhang-ai marked this pull request as ready for review June 12, 2026 15:49

yuhezhang-ai requested review from a team as code owners June 12, 2026 15:49

copy-pr-bot Bot had a problem deploying to test June 12, 2026 15:49 Error

copy-pr-bot Bot had a problem deploying to nemo-ci June 12, 2026 15:49 Error

copy-pr-bot Bot temporarily deployed to nemo-ci June 12, 2026 15:49 Inactive

copy-pr-bot Bot temporarily deployed to public June 12, 2026 15:49 Inactive

copy-pr-bot Bot had a problem deploying to public June 12, 2026 15:51 Failure

style(retrieval): sort bi-encoder imports

3239799

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

copy-pr-bot Bot temporarily deployed to test June 12, 2026 15:55 Inactive

copy-pr-bot Bot temporarily deployed to public June 12, 2026 15:55 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 12, 2026 15:56 Inactive

copy-pr-bot Bot temporarily deployed to public June 12, 2026 15:58 Inactive

copy-pr-bot Bot temporarily deployed to public June 12, 2026 18:20 Inactive

copy-pr-bot Bot temporarily deployed to public June 12, 2026 18:23 Inactive

copy-pr-bot Bot temporarily deployed to public June 12, 2026 18:24 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 12, 2026 18:26 Inactive

copy-pr-bot Bot temporarily deployed to public June 12, 2026 18:33 Inactive

yuhezhang-ai added 2 commits June 13, 2026 12:28

perf(retrieval): make autocast configurable

ad3ff9f

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

perf(retrieval): wire compile config

4d43a66

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

copy-pr-bot Bot temporarily deployed to nemo-ci June 13, 2026 19:40 Inactive

copy-pr-bot Bot temporarily deployed to test June 13, 2026 19:40 Inactive

copy-pr-bot Bot temporarily deployed to public June 13, 2026 19:41 Inactive

copy-pr-bot Bot temporarily deployed to public June 13, 2026 19:43 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 13, 2026 19:45 Inactive

copy-pr-bot Bot temporarily deployed to public June 13, 2026 19:52 Inactive

test(diffusion): update DDP manager config expectation

55343c9

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

copy-pr-bot Bot deployed to test June 13, 2026 23:17 Active

copy-pr-bot Bot deployed to nemo-ci June 13, 2026 23:17 Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(distributed): add retrieval tuning knobs#2452

perf(distributed): add retrieval tuning knobs#2452
yuhezhang-ai wants to merge 12 commits into
mainfrom
yuhez/perf/retrieval-distributed-tuning

yuhezhang-ai commented Jun 8, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 8, 2026

Uh oh!

yuhezhang-ai commented Jun 12, 2026

Uh oh!

yuhezhang-ai commented Jun 12, 2026

Uh oh!

yuhezhang-ai commented Jun 13, 2026

Uh oh!

yuhezhang-ai commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yuhezhang-ai commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Notes

Before your PR is "Ready for review"

Additional Information

Latest Update (2026-06-13)

Uh oh!

copy-pr-bot Bot commented Jun 8, 2026

Uh oh!

yuhezhang-ai commented Jun 12, 2026

Uh oh!

yuhezhang-ai commented Jun 12, 2026

Uh oh!

yuhezhang-ai commented Jun 13, 2026

Uh oh!

yuhezhang-ai commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yuhezhang-ai commented Jun 8, 2026 •

edited

Loading