Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
9744713
feat: add NPU device_memory_used and vllm support
UsernameFull Jan 28, 2026
3077bef
(feat): publish roll v0.2.0.
PanAndy Feb 3, 2026
c8f8029
(chore): append commiter for v0.2.0.
PanAndy Feb 5, 2026
f41a8f1
(chore): append commiter for v0.2.0.
PanAndy Feb 5, 2026
f3f13dc
Remove upload_to_mos call after checkpoint save
chocoded Feb 7, 2026
4a0ce56
fix: correct typo in async_parallel_rollout.md
WeiyaoLuo Feb 9, 2026
777dad6
feat: add katex to docs markdown
kkkky123 Feb 9, 2026
4a49bab
(docs): update readme.
PanAndy Feb 10, 2026
afc4d30
(fix): use default_factory for mutable SequencePackingConfig field.
hydrozhao Feb 12, 2026
ae69fd8
(fix): fix train_infer_is_weight KeyError for rlvr_vlm_pipeline and …
guoshengCS Feb 6, 2026
c70c473
(fix): handle is_last_step in DeepSpeedTrainStrategy.save_checkpoint
XucSh Feb 28, 2026
ce4e3a2
fix: address resource leaks and code quality issues
hobostay Feb 9, 2026
bec2a4b
(fix): set vllm VLLM_USE_FLASHINFER_SAMPLER=0 for torch 280.
PanAndy Feb 10, 2026
81e9c5c
(fix): set sglang port range to avoid conflicting.
HuangJoJo Feb 9, 2026
1054785
(fix): fix sglang multi-nodes fail when worker num > 1.
emiedon Feb 4, 2026
c72d283
(fix): optimize port allocation logic with atomic operation.
Feb 4, 2026
2e783fe
(chore): fix qwen3-vl-32B 80GB config.
HuangJoJo Feb 5, 2026
d6dad8f
(fix): hardcode default async concurrency limit to 1000 to remove dep…
hydrozhao Feb 26, 2026
526f7b5
(fix): fix reward metrics expo.
PanAndy Feb 26, 2026
53c6da3
(fix): fix batch num tokens.
PanAndy Mar 2, 2026
3b0398a
(fix): fix vllm process weights.
PanAndy Mar 3, 2026
cda8262
(fix): fix func download get_node_ip.
PanAndy Mar 4, 2026
ae0a39b
(fix): fix sglang process weights.
hydrozhao Mar 5, 2026
ca8e9e5
(fix): Make offload states configurable and Fix batch size setting in…
Schnabel-8 Feb 10, 2026
36d0064
(feat): support vllm 0.15.1.
hydrozhao Feb 11, 2026
c356bb9
(fix): FSDP2 DCP Saving when CPU Offload.
Feb 26, 2026
c601cd1
(feat): support sglang-router.
hydrozhao Feb 28, 2026
9087c02
(feat): add Dockerfile for torch2.10.0, support vllm 0.16.dev.
hydrozhao Mar 3, 2026
85100e8
(fix): pyarrow>15.0.0 jemalloc coredump, add torch2.10.0 deps, fix ro…
HuangJoJo Mar 3, 2026
f33540c
(feat): update mcore adapter.
chocoded Mar 3, 2026
4449a31
(feat): support training for qwen3.5-27B.
xuehuanran Mar 4, 2026
a35fbce
(fix): refactor sharded state dict metadata handling and integrate in…
chocoded Mar 5, 2026
b4facdd
(chore): move EnvAffinityRouter and PartialGPUManager to router.py.
hydrozhao Mar 5, 2026
5dfecbb
(fix): gracefully shutdown of Router.
hydrozhao Mar 5, 2026
31c99c9
(chore): release docker image for torch2.10.0.
hydrozhao Mar 5, 2026
16b3ca8
(feat): add example config for qwen3_5_35ba3.
xuehuanran Mar 5, 2026
50c8954
(fix): correct parameter name when constructing reward cluster.
hydrozhao Mar 5, 2026
82436a5
(feat): support onpolicy distillation.
Schnabel-8 Mar 6, 2026
5b488cc
(fix): fix version compare of torch for pg_options_param_name.
hydrozhao Mar 6, 2026
16d2113
(fix): separated the system role check from the skip_mock_system_prom…
hydrozhao Mar 6, 2026
0257ca3
(fix): prevent sync generate request execution during shutdown.
hydrozhao Mar 6, 2026
921dc28
(docs): update readme.
PanAndy Mar 6, 2026
d2dcd86
(fix): FSDP2 Model Initialization & Casting.
Mar 6, 2026
b63b3a4
fix bugs in strategy config and opd config
Schnabel-8 Mar 6, 2026
2eba7c3
(fix): add context parallel loss reduction in trainer.
chocoded Mar 9, 2026
5cd926f
fix: add sft support on npu
UsernameFull Feb 4, 2026
d133109
feat: add npu mindspeed
jiaqiw09 Feb 6, 2026
7f56229
feat: add NPU (Ascend) support for FSDP2, vLLM, model update, and pla…
UsernameFull Feb 10, 2026
9640566
Revert "adapt mindspeed"
UsernameFull Feb 11, 2026
91085a8
fix: rng_state on npu
UsernameFull Feb 11, 2026
079bd47
fix: DeepSpeedEngine.load_checkpoint method doesn't take an is_last_…
UsernameFull Mar 3, 2026
562c25a
platform add empty_cache get_rng_state set_rng_state
UsernameFull Mar 5, 2026
3210f2b
fix: support _set_allocator_settings in NPU
UsernameFull Mar 9, 2026
bd17364
feat: add reward model cluster mode for LLM-as-judge in RLVR pipeline
tanzelin430 Mar 10, 2026
53eddee
Add notable works section to README
taoluo Mar 16, 2026
2d6ab4f
Enhance model saving with PEFT support
chocoded Mar 17, 2026
13a5027
Import is_peft_available from transformers.utils
chocoded Mar 17, 2026
0ce37ff
Rename linear_attention_type to experimental_attention_variant
chocoded Mar 17, 2026
a26d90a
Rename linear_attention_type to experimental_attention_variant
chocoded Mar 17, 2026
f16628f
Rename linear_attention_type to experimental_attention_variant
chocoded Mar 17, 2026
2fd2606
feat: support rock native env and provide demo to run agent rollout a…
jingyushen Mar 17, 2026
c0f40a3
docs: Add Huawei Ascend hardware support doc
UsernameFull Mar 13, 2026
9de2784
fix: fix rlvr metrics update
UsernameFull Mar 17, 2026
9c6ce5c
fix: MultipleChoiceBoxedRuleRewardWorker returns a zero reward
luyouqi233 Mar 19, 2026
cb617db
Revise RLix description for clarity and detail
taoluo Mar 20, 2026
4fd4147
Add Qwen3.5 ROCK agentic SWE example
shamanez Mar 23, 2026
6190d06
minor comment.
shamanez Mar 24, 2026
52e0978
fix: disable reward normalization for SWE configs with group_size=1
shamanez Mar 24, 2026
345edea
Update config name in run_onpolicy_distill_pipeline.sh
joeyzyz Mar 24, 2026
f509efc
add reference to notable work
pUmpKin-Co Mar 26, 2026
53fce5a
added OpenRewards
Mar 25, 2026
bc9af12
added the openreward support.
Mar 25, 2026
6e1a5df
revert agent_native_env_manager.py to upstream version
shamanez Mar 25, 2026
b39681b
remove IPA config yaml not needed for OpenReward integration
shamanez Mar 25, 2026
a49a915
fix: use Cluster instead of WorkerConfig for dynamic batching dp_size
dubin555 Mar 14, 2026
7526682
add initial trackio integration for roll.
ParagEkbote Mar 26, 2026
3432271
(feat): tensorboard log in new executor.
PanAndy Apr 29, 2026
942703d
feat: add npu dockerfile and useage
UsernameFull Apr 27, 2026
034d38e
feat: add npu dockerfile and useage
UsernameFull Apr 29, 2026
19e740d
Optimize ROCm for send_recv and model_update
aaab8b Apr 15, 2026
4bb7d74
Update README.md
histmeisah May 4, 2026
2611542
Add support for ROCm 7.2 and PyTorch 2.10
aaab8b May 12, 2026
084f0ed
feat(agentic): integrate Atropos environment as gem.Env adapter
RUFFY-369 Apr 17, 2026
4960b49
docs: add atropos-gsm8k training demo configuration and launch script
RUFFY-369 Apr 17, 2026
6069cda
fix: move max_steps to yaml to avoid unrecognized cli args in start_p…
RUFFY-369 Apr 17, 2026
6d0df55
fix(yaml): remove duplicate max_steps key
RUFFY-369 Apr 17, 2026
190f124
fix(yaml): add explicit val_env_manager and RL params to avoid config…
RUFFY-369 Apr 17, 2026
e91f69e
fix(yaml): resolve ZeroDivisionError by providing valid val_env_config
RUFFY-369 Apr 17, 2026
5b45342
fix(scheduler): defensive Ray resource allocation for modern versions
RUFFY-369 Apr 17, 2026
8a53cb1
feat: integrate Atropos deep reasoning with GRPO and universal reward…
RUFFY-369 Apr 21, 2026
e5c91c7
fix: restore OpenReward demo config mistakenly pruned during cleanup
RUFFY-369 Apr 21, 2026
aa2052b
refactor: simplify resource allocation and restore original node sort…
RUFFY-369 May 11, 2026
9ec88d6
docs: Add Huawei Ascend hardware support doc
UsernameFull Mar 13, 2026
5d1e3c3
add npu doc
UsernameFull May 14, 2026
c09bc8b
[ascend adapt] qwen3-30b model vllm fsdp2
shun001 May 11, 2026
5d6d554
fix: correct typos and broken link in README
Galleons2029 May 19, 2026
2640a80
docs: add careers page
kkkky123 May 21, 2026
baaa682
docs: add careers application links
kkkky123 May 21, 2026
47202f5
[bugfix] ascend qwen3-30b fsdp2 model update bugfix and update yaml
shun001 May 26, 2026
ae4c065
fix image mismatch
sanmuf May 21, 2026
02b8736
feat: add npu ci yaml and fix tests
UsernameFull May 25, 2026
6de0a69
fix: fix TestVllmStrategyBeamSearch test
UsernameFull Jun 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
318 changes: 318 additions & 0 deletions .github/workflows/ci-npu-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,318 @@
name: Tests

on:
workflow_dispatch:
push:
branches: [main]
paths-ignore:
- "docs_roll/**"
- "**/*.md"
- ".github/workflows/deploy.yml"
- ".github/workflows/daily-stats.yml"
pull_request:
branches: [main]
paths-ignore:
- "docs_roll/**"
- "**/*.md"
- ".github/workflows/deploy.yml"
- ".github/workflows/daily-stats.yml"

permissions:
contents: read

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
unit-test:
name: Unit Tests (CPU)
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
cache-dependency-path: |
requirements_common.txt
mcore_adapter/pyproject.toml
mcore_adapter/requirements.txt
setup.py
pyproject.toml

- name: Install dependencies
run: |
pip install --upgrade pip
# Install PyTorch CPU-only to keep CI lightweight
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# Install core test dependencies (subset of requirements_common.txt)
pip install pytest pytest-timeout pytest-asyncio numpy tensordict pydantic dacite \
more_itertools hydra-core omegaconf peft==0.12.0 datasets==3.1.0 \
trl==0.9.6 transformers ray[default] sympy deprecated codetiming pybase64 imageio \
jsonschema mcp gem-llm==0.0.4 openai==2.31.0 gym 'gymnasium[toy-text]' gym_sokoban rl-rock
# Install mcore_adapter and roll itself
pip install -e ./mcore_adapter
pip install -e .
rock admin start

- name: Run CPU-compatible unit tests
run: |
pytest tests/utils \
tests/datasets \
tests/agentic \
tests/test_ref_worker_type_consistency.py \
-v --timeout=300 --durations=0 --durations-min=0 -x
env:
PYTHONPATH: ${{ github.workspace }}
ROLL_RUN_EXTERNAL_AGENTIC_TESTS: "0"
ROLL_RUN_AGENTIC_SANDBOX_TESTS: "0"
ROLL_RUN_AGENTIC_ENV_MANAGER_DEBUG_TESTS: "0"

npu-test:
name: NPU Integration Tests
if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository
runs-on: linux-aarch64-a3-8
timeout-minutes: 120
container:
# Pre-built NPU docker image (built from docker/Dockerfile.A3) with all deps pre-installed
image: quay.io/ascend/vllm-ascend:v0.18.0-a3
env:
PIP_CACHE_DIR: ${{ github.workspace }}/.pip-cache
PIP_INDEX_URL: https://repo.huaweicloud.com/repository/pypi/simple
PIP_TRUSTED_HOST: repo.huaweicloud.com
HF_ENDPOINT: https://hf-mirror.com
# vLLM-Ascend sleep/offload uses memory pools that are incompatible with
# expandable segments.
PYTORCH_NPU_ALLOC_CONF: ""
TASK_QUEUE_ENABLE: "1"
VLLM_USE_V1: "1"
# The CI vLLM smoke uses TP=1; FlashComm sequence parallelism requires TP>1.
VLLM_ASCEND_ENABLE_FLASHCOMM: "0"
# vLLM-Ascend sleep/wake rejects FRACTAL_NZ for RL-style weight reload flows.
VLLM_ASCEND_ENABLE_NZ: "0"
SGLANG_KERNEL_NPU_REPO: https://github.com/sgl-project/sgl-kernel-npu.git
SGLANG_KERNEL_NPU_BRANCH: main
SGLANG_KERNEL_NPU_CACHE_KEY: main
SGLANG_REPO: https://github.com/sgl-project/sglang.git
SGLANG_BRANCH: ifmn/eagle-dp-attn
SGLANG_CACHE_KEY: ifmn-eagle-dp-attn

steps:
- name: Checkout code
uses: actions/checkout@v4
with:
submodules: recursive

- name: Cache NPU pip packages
uses: actions/cache@v4
with:
path: .pip-cache
key: ${{ runner.os }}-npu-pip-${{ env.SGLANG_KERNEL_NPU_CACHE_KEY }}-${{ env.SGLANG_CACHE_KEY }}-${{ hashFiles('requirements_common.txt', 'requirements_vision.txt', 'mcore_adapter/pyproject.toml', 'mcore_adapter/requirements.txt', 'setup.py', 'pyproject.toml', '.github/workflows/ci-npu-test.yml') }}
restore-keys: |
${{ runner.os }}-npu-pip-${{ env.SGLANG_KERNEL_NPU_CACHE_KEY }}-${{ env.SGLANG_CACHE_KEY }}-
${{ runner.os }}-npu-pip-${{ env.SGLANG_CACHE_KEY }}-
${{ runner.os }}-npu-pip-

- name: Configure pip cache
run: |
mkdir -p "$PIP_CACHE_DIR"
python3 -m pip cache dir

- name: Configure Ascend runtime
shell: bash
run: |
set -eo pipefail
if [ -f /usr/local/Ascend/ascend-toolkit/set_env.sh ]; then
source /usr/local/Ascend/ascend-toolkit/set_env.sh
fi
if [ -f /usr/local/Ascend/nnal/atb/set_env.sh ]; then
source /usr/local/Ascend/nnal/atb/set_env.sh
fi

export ASCEND_HOME_PATH="${ASCEND_HOME_PATH:-/usr/local/Ascend/ascend-toolkit/latest}"
export ASCEND_TOOLKIT_HOME="${ASCEND_TOOLKIT_HOME:-${ASCEND_HOME_PATH}}"
export ASCEND_OPP_PATH="${ASCEND_OPP_PATH:-${ASCEND_HOME_PATH}/opp}"
export ASCEND_AICPU_PATH="${ASCEND_AICPU_PATH:-${ASCEND_HOME_PATH}}"
export LD_LIBRARY_PATH="/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/runtime/lib64:/usr/local/Ascend/ascend-toolkit/latest/runtime/lib64/stub:/usr/local/Ascend/ascend-toolkit/latest/tools/hccl/lib64:/usr/local/Ascend/ascend-toolkit/latest/hccl/lib64:${LD_LIBRARY_PATH:-}"

cann_python_paths=()
for path in \
"${ASCEND_HOME_PATH}/python/site-packages" \
"${ASCEND_HOME_PATH}/opp/built-in/op_impl/ai_core/tbe"; do
if [ -d "$path" ]; then
cann_python_paths+=("$path")
fi
done
if [ ${#cann_python_paths[@]} -gt 0 ]; then
export PYTHONPATH="$(IFS=:; echo "${cann_python_paths[*]}"):${PYTHONPATH:-}"
fi

echo "ASCEND_HOME_PATH=${ASCEND_HOME_PATH}" >> "$GITHUB_ENV"
echo "ASCEND_TOOLKIT_HOME=${ASCEND_TOOLKIT_HOME}" >> "$GITHUB_ENV"
echo "ASCEND_OPP_PATH=${ASCEND_OPP_PATH}" >> "$GITHUB_ENV"
echo "ASCEND_AICPU_PATH=${ASCEND_AICPU_PATH}" >> "$GITHUB_ENV"
echo "LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" >> "$GITHUB_ENV"
echo "PYTHONPATH=${PYTHONPATH:-}" >> "$GITHUB_ENV"
echo "${ASCEND_HOME_PATH}/bin" >> "$GITHUB_PATH"
echo "${ASCEND_HOME_PATH}/compiler/ccec_compiler/bin" >> "$GITHUB_PATH"

- name: Show environment info
run: |
python3 - <<'PY'
import importlib.util
import importlib.metadata as metadata
import sys

import torch
import torch_npu

def module_available(name):
return importlib.util.find_spec(name) is not None

print(f"python={sys.version.split()[0]}")
print(f"pip={metadata.version('pip')}")
print(f"torch={torch.__version__}")
print(f'torch_npu={torch_npu.__version__}')

modules = ('tbe', 'decorator', 'attrs', 'psutil', 'scipy', 'cloudpickle', 'tornado', 'ml_dtypes')
for module_name in modules:
print(f'{module_name}_module={module_available(module_name)}')

if not module_available('tbe'):
raise RuntimeError('CANN tbe Python module is not visible in PYTHONPATH')
if not torch.npu.is_available():
raise RuntimeError('torch.npu.is_available() is False')
print(f'npu_device_count={torch.npu.device_count()}')
PY
npu-smi info

- name: Install pytest dependencies
run: |
pip install pytest-timeout

- name: Install ROLL requirements
run: |
python3 -m pip install -r requirements_common.txt
python3 -m pip install deepspeed==0.16.4 tensorboard

- name: Install SGLang NPU kernel from source
shell: bash
run: |
set -eo pipefail
export SGLANG_KERNEL_NPU_SRC="/tmp/sgl-kernel-npu"
rm -rf "${SGLANG_KERNEL_NPU_SRC}"
git clone --depth 1 --branch "${SGLANG_KERNEL_NPU_BRANCH}" --recurse-submodules --shallow-submodules "${SGLANG_KERNEL_NPU_REPO}" "${SGLANG_KERNEL_NPU_SRC}"
cd "${SGLANG_KERNEL_NPU_SRC}"
git submodule status --recursive
python3 -m pip install pybind11 wheel
bash build.sh -a kernels
python3 -m pip install output/sgl_kernel_npu*.whl
python3 - <<'PY'
import sgl_kernel_npu

print(f"sgl_kernel_npu={sgl_kernel_npu.__path__}")
PY

- name: Install SGLang from source
shell: bash
run: |
set -eo pipefail
export SGLANG_SRC="/tmp/sglang"
rm -rf "${SGLANG_SRC}"
git clone --depth 1 --branch "${SGLANG_BRANCH}" "${SGLANG_REPO}" "${SGLANG_SRC}"
python3 - <<'PY' > "${SGLANG_SRC}/ci-requirements.txt"
import importlib.metadata
import os
import re
import tomllib
from pathlib import Path

skip_packages = {
"cuda-python",
"flashinfer-cubin",
"flashinfer-python",
"nvidia-cutlass-dsl",
"nvidia-ml-py",
"sgl-kernel",
"sglang-router",
"torch",
"torch-memory-saver",
"torchaudio",
"torchao",
"torchcodec",
"torchvision",
"transformers",
}

pyproject = Path(os.environ["SGLANG_SRC"]) / "python" / "pyproject.toml"
dependencies = tomllib.loads(pyproject.read_text())["project"]["dependencies"]
for dependency in dependencies:
package_name = re.split(r"[\[<>=!~; ]", dependency, maxsplit=1)[0]
package_name = package_name.replace("_", "-").lower()
if package_name in skip_packages:
continue
try:
importlib.metadata.version(package_name)
except importlib.metadata.PackageNotFoundError:
print(dependency)
PY
echo "Missing SGLang dependencies for CI:"
cat "${SGLANG_SRC}/ci-requirements.txt"
python3 -m pip install -r "${SGLANG_SRC}/ci-requirements.txt"
python3 -m pip install --no-deps -e "${SGLANG_SRC}/python"
python3 - <<'PY'
import importlib.metadata

print(f"sglang={importlib.metadata.version('sglang')}")
PY

- name: Install ROLL
run: |
python3 -m pip install -e .

- name: Show vLLM Ascend info
run: |
python3 - <<'PY'
import importlib.metadata as metadata

import vllm
from roll.platforms import current_platform

def package_version(name):
try:
return metadata.version(name)
except metadata.PackageNotFoundError:
return "not installed"

packages = ("vllm-ascend", "transformers", "deepspeed", "triton-ascend")
for package_name in packages:
print(f"{package_name}={package_version(package_name)}")

print(f"vllm={vllm.__version__}")
print(f"platform={current_platform.device_type}")
PY

- name: Run remaining NPU-compatible unit tests
run: |
export PYTHONPATH="${GITHUB_WORKSPACE}:${PYTHONPATH:-}"
python3 -m pytest tests/third_party/sglang \
tests/third_party/vllm \
tests/third_party/deepspeed \
tests/distributed \
tests/models \
tests/pipeline \
tests/test_ref_worker_type_consistency.py \
--ignore=tests/models/cuda_mem \
--ignore=tests/distributed/scheduler/test_generate_scheduler.py \
--ignore=tests/distributed/scheduler/test_initialize.py \
--ignore=tests/distributed/scheduler/test_resource_manager.py \
--ignore=tests/distributed/executor/test_ray_thread_actor_cuda_mem_leak.py \
-v --timeout=600 --durations=0 --durations-min=0 -x
env:
ROLL_NPU_CI: "1"
DS_UNITTEST_TIMEOUT: "600"
4 changes: 0 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,4 @@
# Ignore all png files
*.png

# But allow png files in static/img directory
!docs_roll/static/img/*.png
*.pyc
*/checkpoint_dir
*/dataset
Expand Down
Loading