Evaluating Robot Policies in a World Model [paper] [website] [demo]

Julian Quevedo¹, Percy Liang¹, Sherry Yang^1,2,3

Stanford University¹ New York University² Google DeepMind³

Overview

This repository contains the implementation accompanying the paper Evaluating Robot Policies in a World Model.

News:

6/24/25: Dataset download script and VLM reward script released
6/11/25: Initial training code released

TODO:

Release dataset preparation scripts
Release instructions for training on OpenVLA

Installation

# Install PyTorch (replace cu124 with your local CUDA version)
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Install other dependencies
pip install diffusers accelerate fire einops pytorchvideo tqdm imageio matplotlib

Quick Start

This is how you launch training. It will train on the tiny 10-example dataset in sample_data/.

# Replace N with the number of available GPUs
torchrun --nproc_per_node=N train.py

Checkpoints and generated GIF samples will be written to outputs/<timestamp>/.

Train on Open X-Embodiment Datasets

To train on the Open X-Embodiment datasets we used in the paper:

# We'll need tensorflow datasets and tensorflow since this code is 
# based on the original Open X-Embodiment repo.
pip install tensorflow tensorflow_datasets
# For example, download just the RT-1 dataset:
python download_data.py --dataset_name rt_1
# By default the data will be written to ./converted_datasets.
# To choose your own output directory:
python download_data.py --dataset_name rt_1 --output_dir <your output dir>

See download_data.py for more dataset names to choose from.

Then launch training with the correct dataset path:

torchrun --nproc_per_node=N train.py --dataset_dir ./converted_datasets --subset_names rt_1
# Replace ./converted_datasets if your path is different.

You can enter a comma separated list for subset_names to train on a mixture of multiple datasets. For example, after downloading the rt_1 and bridge_v2 datasets, you can do --subset_names rt_1,bridge_v2 to train on both the RT-1 and Bridge V2 datasets.

Training on Bridge V2

Since Bridge V2 was not included in the original Open X-Embodiment dataset, you'll need to first download the TFDS dataset to your machine like this:

wget -r -np -R "index.html*" https://rail.eecs.berkeley.edu/datasets/bridge_release/data/tfds/bridge_dataset/

Then, convert the dataset to our format with python download_data.py --dataset_name bridge_v2, changing BRIDGE_V2_PATH at the top of the script if necessary. Since Bridge V2 is a superset of Bridge V1, choose between either downloading bridge or bridge_v2.

VLM-based reward labeling

This script demonstrates how we use GPT-4o to judge the success of generated policy rollouts:

python vlm_reward.py --video_path <path to your .mp4> --task <rollout task instructions>

Citation

If you find this work useful, please cite:

@misc{quevedo2025evaluatingrobotpoliciesworld,
      title={Evaluating Robot Policies in a World Model}, 
      author={Julian Quevedo and Percy Liang and Sherry Yang},
      year={2025},
      eprint={2506.00613},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2506.00613}, 
}

Acknowledgements

Boyuan Chen and Kiwhan Song for Diffusion Forcing
DiT
Oasis

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
media		media
sample_data/bridge		sample_data/bridge
README.md		README.md
dataset.py		dataset.py
diffusion.py		diffusion.py
download_data.py		download_data.py
model.py		model.py
train.py		train.py
vae.py		vae.py
vlm_reward.py		vlm_reward.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Robot Policies in a World Model [paper] [website] [demo]

Overview

Installation

Quick Start

Train on Open X-Embodiment Datasets

Training on Bridge V2

VLM-based reward labeling

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluating Robot Policies in a World Model [paper] [website] [demo]

Overview

Installation

Quick Start

Train on Open X-Embodiment Datasets

Training on Bridge V2

VLM-based reward labeling

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages