Julian Quevedo1, Percy Liang1, Sherry Yang1,2,3
Stanford University1 New York University2 Google DeepMind3
This repository contains the implementation accompanying the paper Evaluating Robot Policies in a World Model.
News:
- 6/24/25: Dataset download script and VLM reward script released
- 6/11/25: Initial training code released
TODO:
- Release dataset preparation scripts
- Release instructions for training on OpenVLA
# Install PyTorch (replace cu124 with your local CUDA version)
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Install other dependencies
pip install diffusers accelerate fire einops pytorchvideo tqdm imageio matplotlibThis is how you launch training. It will train on the tiny 10-example dataset in sample_data/.
# Replace N with the number of available GPUs
torchrun --nproc_per_node=N train.pyCheckpoints and generated GIF samples will be written to outputs/<timestamp>/.
To train on the Open X-Embodiment datasets we used in the paper:
# We'll need tensorflow datasets and tensorflow since this code is
# based on the original Open X-Embodiment repo.
pip install tensorflow tensorflow_datasets
# For example, download just the RT-1 dataset:
python download_data.py --dataset_name rt_1
# By default the data will be written to ./converted_datasets.
# To choose your own output directory:
python download_data.py --dataset_name rt_1 --output_dir <your output dir>See download_data.py for more dataset names to choose from.
Then launch training with the correct dataset path:
torchrun --nproc_per_node=N train.py --dataset_dir ./converted_datasets --subset_names rt_1
# Replace ./converted_datasets if your path is different.You can enter a comma separated list for subset_names to train on a mixture of multiple datasets. For example, after downloading the rt_1 and bridge_v2 datasets, you can do --subset_names rt_1,bridge_v2 to train on both the RT-1 and Bridge V2 datasets.
Since Bridge V2 was not included in the original Open X-Embodiment dataset, you'll need to first download the TFDS dataset to your machine like this:
wget -r -np -R "index.html*" https://rail.eecs.berkeley.edu/datasets/bridge_release/data/tfds/bridge_dataset/
Then, convert the dataset to our format with python download_data.py --dataset_name bridge_v2, changing BRIDGE_V2_PATH at the top of the script if necessary. Since Bridge V2 is a superset of Bridge V1, choose between either downloading bridge or bridge_v2.
This script demonstrates how we use GPT-4o to judge the success of generated policy rollouts:
python vlm_reward.py --video_path <path to your .mp4> --task <rollout task instructions>If you find this work useful, please cite:
@misc{quevedo2025evaluatingrobotpoliciesworld,
title={Evaluating Robot Policies in a World Model},
author={Julian Quevedo and Percy Liang and Sherry Yang},
year={2025},
eprint={2506.00613},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2506.00613},
}
- Boyuan Chen and Kiwhan Song for Diffusion Forcing
- DiT
- Oasis



