Refine rollout worker health check and recovery lifecycle by YanhuiDua · Pull Request #1877 · InternLM/xtuner

YanhuiDua · 2026-06-05T08:01:35Z

概述

这个 PR 主要做两部分重构：

重构 rollout server 的启动布局，新增 EngineLaunchSpec / ServerProcessSpec。
重构 rollout health manager，明确 RolloutController、RolloutHealthManager、RolloutWorker 的职责边界，并且支持ep group中一个worker失败后，将会重启所有的worker，另外，当RolloutHealthManager检测到所有worker失败后，会立即重启所有的worker

为什么 Health Manager 重构依赖 Server Launch Spec 重构

Rollout health recovery 不能只知道“哪些 worker 还活着”，还必须知道每个 rollout server 是怎么启动出来的，以及失败后应该按什么粒度恢复。

在 LMDeploy EP、SGLang 跨节点等场景下，一个 logical engine 和 server process 不是简单的一一对应关系：

LMDeploy EP：一个 engine 内可能有多个 server process，并且多个 server 都可以接收 rollout request。
SGLang 跨节点：一个 engine 内可能每个节点一个 server process，但只有 node 0 server 接收 rollout request。
recovery 时不能只重启单个失败 rank，而是要知道同一个 lifecycle group 内哪些 server process 需要一起停掉、一起重启。
routing 时也不能把请求发给所有 server，只能发给 request entrypoint。

因此，Health Manager 需要依赖 Server Launch Spec 提供的结构化信息：

engine_ranks：一个 logical engine 由哪些 worker rank 组成。
server_processes：这个 engine 实际启动了哪些 rollout server process。
server_worker_ranks：哪些 worker rank 拥有 server process，需要参与生命周期管理。
accepts_rollout_requests：哪些 server 是 request entrypoint，可以接收 generate 请求。
dist_init_addr / placement_group_bundle_idxs：worker recovery 时复用原始启动布局，避免重启后 server 地址或资源绑定发生变化。

所以第一个 commit 先把 server launch layout 显式化；第二个 commit 才能基于这些结构化信息，把 health check、状态流转、group recovery 和 request routing 的职责从 controller 中拆出来。

主要改动

Server Launch Spec 重构

新增 ServerProcessSpec 和 EngineLaunchSpec，显式描述每个推理 engine 应该启动哪些 server process。
重构 RolloutController._init_workers，先构造 launch spec，再根据 spec 启动 server。
将 LMDeploy / SGLang 的后端启动差异下沉到各自的 build_engine_launch_specs 中。
支持并明确表达：
- LMDeploy EP：每个 EP rank 启动一个可接收 rollout request 的 server。
- SGLang 跨节点：每个节点启动一个 server，只有 node 0 server 作为 request entrypoint。
worker 初次 init 时缓存 launch spec，后续 recovery 重启时复用原始 placement / dist-init 布局。

Rollout Health Manager 重构

新增 RolloutHealthManager，统一负责 worker 健康状态检查、状态流转和失败恢复。
RolloutController 只保留 worker 创建、请求路由、训练生命周期控制等职责。
RolloutWorker 继续负责后端相关的 server 启停、health probe 和 generate 行为。
将原来的 boolean active 状态改为 WorkerLifecycleState，明确区分：
- ACTIVE
- INACTIVE
- RECOVERING
recovery 按 lifecycle group 处理，避免只重启同一个 engine 中的部分 server。
引入 request entrypoint 语义，确保 SessionRouter 和 generate readiness check 只面向真正接收请求的 server。
worker restart 后会对 request-serving endpoint 做 generate readiness 检查。

说明：该PR包含了PR1873的功能：#1873

测试

已验证通过moe模型EP/TP正常训练，并且当rank2的worker挂掉后，EP/TP 整组可正常恢复；当所有worker挂掉后，所有worker立即重启。遗留的一个问题：当所有worker挂掉后，会丢掉一些数据，但这个涉及到数据流的改动，本PR先不做该问题的修复

YanhuiDua · 2026-06-11T09:09:26Z

@claude review

claude · 2026-06-11T09:15:51Z

Claude: ## Summary

Well-structured refactor that introduces EngineLaunchSpec / ServerProcessSpec to make backend launch topology explicit, and extracts health state management into RolloutHealthManager with clear lifecycle-group recovery semantics.

Issues

Critical

[health_manager.py:137-143] run_once() only triggers recovery when all workers are down. A single failed lifecycle group stays INACTIVE indefinitely during generation (no offload, no shutdown) — GPU memory may remain allocated. The old code at least offloaded and shut down failed workers.

Warning

[health_manager.py:85-89] stop() calls self._thread.join() without a timeout. The old code used join(timeout=5). If recovery is mid-flight (up to 5 min for generate readiness checks), shutdown blocks indefinitely.
[health_manager.py:268-280] Generate-readiness polling is sequential per worker per attempt. With N workers × 60s timeout, worst case is N×60s per attempt. Consider parallelizing like init and check_health already are.
[controller.py:350] _broadcast_to_active_workers silently drops **kwargs — the signature still accepts them but the body ignores them.

Nit

[health_manager.py:124-130] getattr(info, "lifecycle_group_ranks", ()) defensive patterns on a concrete dataclass — if the field is missing at runtime, it's a bug worth surfacing.
[worker.py:586-589] init() signature changed from positional dist_init_addr to keyword-only engine_launch_spec — worth noting in docstring since test mocks and subclasses must match.

ProduceBatchResult impact

Not affected — no changes to batch accounting, status conversion, or reward/filter paths.

RoutedExperts impact

Not affected — no changes to routed-experts plumbing, object ref ownership, or extra_fields handling.

Verdict

REQUEST_CHANGES — the run_once recovery gap (failed groups left without offload/shutdown) risks GPU memory leaks in production.

YanhuiDua · 2026-06-12T01:14:03Z

@claude review

claude · 2026-06-12T01:17:32Z

+                }
+                for future in as_completed(future_to_group):
+                    group = future_to_group[future]
+                    try:
+                        group_recovery_results[group.ranks] = future.result()
+                    except Exception:
+                        logger.exception(f"Failed to restart rollout worker group ranks={group.ranks}.")
+                        group_recovery_results[group.ranks] = False
+
+            if self._is_stopping():
+                for group in sorted_failed_groups:


Claude: Critical: run_once() still only triggers recover_failed_workers() when all workers are down after the health check. If a single lifecycle group fails, it's marked INACTIVE by _check_active_workers_and_mark_failed_groups() but no recovery or cleanup is attempted — the failed group's GPU memory remains allocated indefinitely until the next training barrier (ensure_workers_healthy_before_training).

The old RolloutHealthChecker.run_once() at least offloaded and shut down failed workers immediately. The new code should either:

Attempt recovery or at least offload/shutdown for failed groups during generation, or

Document this as an intentional design decision with a comment explaining why partial-group recovery during generation is unsafe.

Without either, a long-running generation phase can leak GPU memory from dead workers.

ProduceBatchResult impact: Not directly affected, but reduced worker pool capacity can affect group_gen_* timing if generation load concentrates on fewer workers.

claude · 2026-06-12T01:17:36Z

+        assert self._stop_event is not None
+        self._stopped = True
+        self._stop_event.set()
+        if self._pause_event:


Claude: Nit: When join() times out, the method returns early without clearing self._thread, self._stop_event, or self._pause_event. This means a subsequent start() call will see self._thread as non-None and is_alive() still True, so it short-circuits without creating a new thread. The stale references could also confuse is_paused() or other state queries.

Consider at least logging that the manager is in a degraded state, or setting self._stopped = True so _is_stopping() guards subsequent operations.

claude · 2026-06-12T01:17:42Z

+
+        if failed_groups and not self._is_stopping():
+            with self._operation_lock:
+                current_workers_snapshot = self.snapshot_workers()
+                active_groups = {
+                    worker.lifecycle_group_ranks or (worker.rank,)
+                    for worker in current_workers_snapshot.values()
+                    if worker.active
+                }
+                failed_groups = failed_groups & active_groups
+                for group_ranks in failed_groups:
+                    self._set_group_lifecycle_state(group_ranks, WorkerLifecycleState.INACTIVE)
+
+        return len(workers_to_check)
+
+    def _check_workers_health(self, workers_to_check: list[WorkerSnapshot]) -> list[bool]:
+        """Run periodic check_health probes concurrently."""
+        if self._check_failure_threshold <= 0:


Claude: Warning: Periodic health checks now use only check_health (basic HTTP liveness) for all workers, including request-entrypoint workers. The previous implementation (in the initial patch 2 version) used check_health_generate for request entrypoints, which validates the inference engine is actually functional.

This means a worker whose HTTP server is alive but whose inference engine is broken (e.g., CUDA OOM, stuck model load) will pass periodic health checks indefinitely. The generate-readiness verification now only happens during recovery restart, not during ongoing monitoring.

If this is intentional (to avoid the cost of generate probes every check interval), consider documenting why — otherwise this is a regression in failure detection coverage.

claude · 2026-06-12T01:18:12Z

Claude: ## Re-review Summary (after patches 4-5)

Good progress since the first review. The stop() timeout and _broadcast_to_active_workers kwargs issues are fixed. The parallel recovery via ThreadPoolExecutor and the failure-threshold counter are solid additions.

Remaining Issues

Critical

[health_manager.py:212-222] run_once() still only recovers when all workers are down. A single failed lifecycle group stays INACTIVE without offload/shutdown, leaking GPU memory until the next training barrier. This was the critical issue from the first review and remains unresolved.

Warning

[health_manager.py:284-301] Periodic health checks downgraded from check_health_generate to plain check_health — workers with a live HTTP server but broken inference engine won't be caught until recovery.
[worker.py:900-902] check_health_generate now raises NotImplementedError instead of falling back to check_health(). New backends that miss this will crash during recovery rather than degrade gracefully.

Nit

[health_manager.py:95-98] stop() returns early on timeout without clearing stale _thread/_stop_event references, which could confuse subsequent start()/is_paused() calls.

Previous Issues — Fixed

stop() timeout: now uses HEALTH_MANAGER_STOP_JOIN_TIMEOUT = 30.0 ✓
_broadcast_to_active_workers kwargs dropped: now forwards **kwargs ✓

ProduceBatchResult impact: not affected — no changes to batch accounting, status conversion, or reward/filter paths.

RoutedExperts impact: not affected — no changes to routed-experts plumbing, object ref ownership, or extra_fields handling.

Verdict

REQUEST_CHANGES — the run_once recovery/cleanup gap for partially-failed groups remains the blocking issue.

YanhuiDua mentioned this pull request Jun 5, 2026

Support EP rollout worker group recovery #1873

Open

YanhuiDua added 3 commits June 11, 2026 08:35

Refactor rollout server launch specs

f3da8ce

Refactor rollout health manager

13a81cd

Restore rollout worker session URLs

a766fdb

YanhuiDua force-pushed the fix-health-check-part3 branch from 9cb357d to a766fdb Compare June 11, 2026 09:08