action dataloader: episode-shuffle stream (fix DROID grad-norm instability) by fwd4 · Pull Request #37 · NVIDIA/cosmos-framework

fwd4 · 2026-06-12T03:20:14Z

Problem

The DROID action SFT dataloader trained with an unstable, slow-settling grad-norm (and a noisy action-loss plateau) vs the internal reference. Root cause: the DROID action dataset is map-style and — unlike the iterable vision SFTDataset, which self-shuffles — does not shuffle, and RankPartitionedDataLoader wraps it in a DataLoader with no shuffle, i.e. a SequentialSampler. Every rank then iterates the same consecutive, overlapping windows, so the all-reduced global batch is effectively ~1 episode → high gradient variance.

(Forward + gradients were verified numerically equivalent to the internal model on identical input, so this was a data-path issue, not the model/loss/optimizer.)

Fix

ActionIterableShuffleDataset (iterable_shuffle=True): an IterableDataset view of the map-style dataset that streams rank × worker-sharded, episode-order-shuffled, sequential-within-episode — decorrelated batches with sequential reads (preserves I/O locality + copy-on-write; a plain shuffle=True/RandomSampler instead does random-access I/O → ~11 min/iter and OOM from broken COW). Mirrors the internal iterable dataset's per-worker episode assignment.

Adds DROIDLeRobotDataset.get_shuffle_blocks() (per-episode/segment flat-index blocks the iterable streams).
No DataLoader/sampler change needed — IterableDataset is handled natively (sampler=None).

Validation (8192 global batch)

iter	this fix	internal ref	no-shuffle
100	grad-norm 2.9	4.7	21
450	grad-norm 1.7	1.9	—

Per-component action loss converges to ~0.0055 (matches internal ~0.005; the no-shuffle run plateaued noisily at 0.03–0.07). Builds on #24 (recipe + FusedAdam optimizer).

🤖 Generated with Claude Code

The DROID action dataset is map-style and (unlike the iterable vision SFTDataset) does not self-shuffle, and RankPartitionedDataLoader wrapped it in a DataLoader with no shuffle -> SequentialSampler. Every rank then iterated the same consecutive, overlapping windows, so the all-reduced global batch was ~1 episode -> high gradient variance and an unstable, slow-settling grad-norm. Fix: ActionIterableShuffleDataset (iterable_shuffle=True) streams rank x worker-sharded, episode-order-shuffled, sequential-within-episode -- decorrelated batches with sequential reads (I/O locality + copy-on-write preserved; a plain RandomSampler instead does random-access I/O -> ~11min/iter + OOM). Mirrors i4's ActionUnifiedIterableDataset worker assignment. Adds DROIDLeRobotDataset.get_shuffle_blocks() for the per-episode/ segment index blocks the iterable streams. No DataLoader change needed -- IterableDataset is handled natively (sampler=None). Validated (256-rank-equivalent, 8192 global): grad-norm settles 27.8->2.9->1.7, tracking the internal reference (43->4.7->1.9) vs the no-shuffle run stuck at ~21; per-component action loss converges to ~0.0055 (matches internal ~0.005 vs the broken run's noisy 0.03-0.07). Signed-off-by: Hao Liang <haolia@nvidia.com>

mli0603 · 2026-06-12T05:06:14Z

LGTM

lfengad

overall LGTM

fwd4 force-pushed the droid-action-shuffle branch from f786168 to 8eec346 Compare June 12, 2026 03:25

fwd4 requested review from lfengad, mli0603 and ychao-nvidia June 12, 2026 03:38

mli0603 enabled auto-merge (squash) June 12, 2026 05:06

lfengad reviewed Jun 12, 2026

View reviewed changes

Comment thread cosmos_framework/data/vfm/action/datasets/action_sft_dataset.py

lfengad approved these changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

action dataloader: episode-shuffle stream (fix DROID grad-norm instability)#37

action dataloader: episode-shuffle stream (fix DROID grad-norm instability)#37
fwd4 wants to merge 1 commit into
NVIDIA:mainfrom
fwd4:droid-action-shuffle

fwd4 commented Jun 12, 2026 •

edited

Loading

Uh oh!

mli0603 commented Jun 12, 2026

Uh oh!

lfengad left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fwd4 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Validation (8192 global batch)

Uh oh!

mli0603 commented Jun 12, 2026

Uh oh!

lfengad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fwd4 commented Jun 12, 2026 •

edited

Loading