Skip to content

anshks/world-model-eval

 
 

Repository files navigation

Evaluating Robot Policies in a World Model [paper] [website] [demo]

sweep z sweep y sweep x gripper

Julian Quevedo1, Percy Liang1, Sherry Yang1,2,3

Stanford University1    New York University2    Google DeepMind3

Overview

This repository contains the implementation accompanying the paper Evaluating Robot Policies in a World Model.

News:

  • 6/24/25: Dataset download script and VLM reward script released
  • 6/11/25: Initial training code released

TODO:

  • Release dataset preparation scripts
  • Release instructions for training on OpenVLA

Installation

# Install PyTorch (replace cu124 with your local CUDA version)
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Install other dependencies
pip install diffusers accelerate fire einops pytorchvideo tqdm imageio matplotlib

Quick Start

This is how you launch training. It will train on the tiny 10-example dataset in sample_data/.

# Replace N with the number of available GPUs
torchrun --nproc_per_node=N train.py

Checkpoints and generated GIF samples will be written to outputs/<timestamp>/.

Train on Open X-Embodiment Datasets

To train on the Open X-Embodiment datasets we used in the paper:

# We'll need tensorflow datasets and tensorflow since this code is 
# based on the original Open X-Embodiment repo.
pip install tensorflow tensorflow_datasets
# For example, download just the RT-1 dataset:
python download_data.py --dataset_name rt_1
# By default the data will be written to ./converted_datasets.
# To choose your own output directory:
python download_data.py --dataset_name rt_1 --output_dir <your output dir>

See download_data.py for more dataset names to choose from.

Then launch training with the correct dataset path:

torchrun --nproc_per_node=N train.py --dataset_dir ./converted_datasets --subset_names rt_1
# Replace ./converted_datasets if your path is different.

You can enter a comma separated list for subset_names to train on a mixture of multiple datasets. For example, after downloading the rt_1 and bridge_v2 datasets, you can do --subset_names rt_1,bridge_v2 to train on both the RT-1 and Bridge V2 datasets.

Training on Bridge V2

Since Bridge V2 was not included in the original Open X-Embodiment dataset, you'll need to first download the TFDS dataset to your machine like this:

wget -r -np -R "index.html*" https://rail.eecs.berkeley.edu/datasets/bridge_release/data/tfds/bridge_dataset/

Then, convert the dataset to our format with python download_data.py --dataset_name bridge_v2, changing BRIDGE_V2_PATH at the top of the script if necessary. Since Bridge V2 is a superset of Bridge V1, choose between either downloading bridge or bridge_v2.

VLM-based reward labeling

This script demonstrates how we use GPT-4o to judge the success of generated policy rollouts:

python vlm_reward.py --video_path <path to your .mp4> --task <rollout task instructions>

Citation

If you find this work useful, please cite:

@misc{quevedo2025evaluatingrobotpoliciesworld,
      title={Evaluating Robot Policies in a World Model}, 
      author={Julian Quevedo and Percy Liang and Sherry Yang},
      year={2025},
      eprint={2506.00613},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2506.00613}, 
}

Acknowledgements

About

Code for "Evaluating Robot Policies in a World Model".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%