Skip to content

Refine rollout worker health check and recovery lifecycle#1877

Open
YanhuiDua wants to merge 7 commits into
InternLM:mainfrom
YanhuiDua:fix-health-check-part3
Open

Refine rollout worker health check and recovery lifecycle#1877
YanhuiDua wants to merge 7 commits into
InternLM:mainfrom
YanhuiDua:fix-health-check-part3

Conversation

@YanhuiDua

@YanhuiDua YanhuiDua commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

概述

这个 PR 主要做两部分重构:

  1. 重构 rollout server 的启动布局,新增 EngineLaunchSpec / ServerProcessSpec
  2. 重构 rollout health manager,明确 RolloutControllerRolloutHealthManagerRolloutWorker 的职责边界,并且支持ep group中一个worker失败后,将会重启所有的worker,另外, 当RolloutHealthManager检测到所有worker失败后,会立即重启所有的worker

为什么 Health Manager 重构依赖 Server Launch Spec 重构

Rollout health recovery 不能只知道“哪些 worker 还活着”,还必须知道每个 rollout server 是怎么启动出来的,以及失败后应该按什么粒度恢复。

在 LMDeploy EP、SGLang 跨节点等场景下,一个 logical engine 和 server process 不是简单的一一对应关系:

  • LMDeploy EP:一个 engine 内可能有多个 server process,并且多个 server 都可以接收 rollout request。
  • SGLang 跨节点:一个 engine 内可能每个节点一个 server process,但只有 node 0 server 接收 rollout request
  • recovery 时不能只重启单个失败 rank,而是要知道同一个 lifecycle group 内哪些 server process 需要一起停掉、一起重启。
  • routing 时也不能把请求发给所有 server,只能发给 request entrypoint。

因此,Health Manager 需要依赖 Server Launch Spec 提供的结构化信息:

  • engine_ranks:一个 logical engine 由哪些 worker rank 组成。
  • server_processes:这个 engine 实际启动了哪些 rollout server process。
  • server_worker_ranks:哪些 worker rank 拥有 server process,需要参与生命周期管理。
  • accepts_rollout_requests:哪些 server 是 request entrypoint,可以接收 generate 请求。
  • dist_init_addr / placement_group_bundle_idxs:worker recovery 时复用原始启动布局,避免重启后 server 地址或资源绑定发生变化。

所以第一个 commit 先把 server launch layout 显式化;第二个 commit 才能基于这些结构化信息,把 health check、状态流转、group recovery 和 request routing 的职责从 controller 中拆出来。

主要改动

Server Launch Spec 重构

  • 新增 ServerProcessSpecEngineLaunchSpec,显式描述每个推理 engine 应该启动哪些 server process。
  • 重构 RolloutController._init_workers,先构造 launch spec,再根据 spec 启动 server。
  • 将 LMDeploy / SGLang 的后端启动差异下沉到各自的 build_engine_launch_specs 中。
  • 支持并明确表达:
    • LMDeploy EP:每个 EP rank 启动一个可接收 rollout request 的 server。
    • SGLang 跨节点:每个节点启动一个 server,只有 node 0 server 作为 request entrypoint。
  • worker 初次 init 时缓存 launch spec,后续 recovery 重启时复用原始 placement / dist-init 布局。

Rollout Health Manager 重构

  • 新增 RolloutHealthManager,统一负责 worker 健康状态检查、状态流转和失败恢复。
  • RolloutController 只保留 worker 创建、请求路由、训练生命周期控制等职责。
  • RolloutWorker 继续负责后端相关的 server 启停、health probe 和 generate 行为。
  • 将原来的 boolean active 状态改为 WorkerLifecycleState,明确区分:
    • ACTIVE
    • INACTIVE
    • RECOVERING
  • recovery 按 lifecycle group 处理,避免只重启同一个 engine 中的部分 server。
  • 引入 request entrypoint 语义,确保 SessionRouter 和 generate readiness check 只面向真正接收请求的 server。
  • worker restart 后会对 request-serving endpoint 做 generate readiness 检查。

说明:该PR包含了PR1873的功能:#1873

测试

已验证通过moe模型EP/TP正常训练,并且当rank2的worker挂掉后,EP/TP 整组可正常恢复;当所有worker挂掉后,所有worker立即重启。遗留的一个问题:当所有worker挂掉后,会丢掉一些数据,但这个涉及到数据流的改动,本PR先不做该问题的修复

@YanhuiDua YanhuiDua force-pushed the fix-health-check-part3 branch from 9cb357d to a766fdb Compare June 11, 2026 09:08
@YanhuiDua

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread xtuner/v1/rl/rollout/health_manager.py
Comment thread xtuner/v1/rl/rollout/health_manager.py Outdated
Comment thread xtuner/v1/rl/rollout/health_manager.py Outdated
Comment thread xtuner/v1/rl/rollout/health_manager.py
Comment thread xtuner/v1/rl/rollout/health_manager.py Outdated
Comment thread xtuner/v1/rl/rollout/worker.py
Comment thread xtuner/v1/rl/rollout/controller.py
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Claude: ## Summary

Well-structured refactor that introduces EngineLaunchSpec / ServerProcessSpec to make backend launch topology explicit, and extracts health state management into RolloutHealthManager with clear lifecycle-group recovery semantics.

Issues

Critical

  • [health_manager.py:137-143] run_once() only triggers recovery when all workers are down. A single failed lifecycle group stays INACTIVE indefinitely during generation (no offload, no shutdown) — GPU memory may remain allocated. The old code at least offloaded and shut down failed workers.

Warning

  • [health_manager.py:85-89] stop() calls self._thread.join() without a timeout. The old code used join(timeout=5). If recovery is mid-flight (up to 5 min for generate readiness checks), shutdown blocks indefinitely.
  • [health_manager.py:268-280] Generate-readiness polling is sequential per worker per attempt. With N workers × 60s timeout, worst case is N×60s per attempt. Consider parallelizing like init and check_health already are.
  • [controller.py:350] _broadcast_to_active_workers silently drops **kwargs — the signature still accepts them but the body ignores them.

Nit

  • [health_manager.py:124-130] getattr(info, "lifecycle_group_ranks", ()) defensive patterns on a concrete dataclass — if the field is missing at runtime, it's a bug worth surfacing.
  • [worker.py:586-589] init() signature changed from positional dist_init_addr to keyword-only engine_launch_spec — worth noting in docstring since test mocks and subclasses must match.

ProduceBatchResult impact

Not affected — no changes to batch accounting, status conversion, or reward/filter paths.

RoutedExperts impact

Not affected — no changes to routed-experts plumbing, object ref ownership, or extra_fields handling.

Verdict

REQUEST_CHANGES — the run_once recovery gap (failed groups left without offload/shutdown) risks GPU memory leaks in production.

@YanhuiDua

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment on lines +212 to +222
}
for future in as_completed(future_to_group):
group = future_to_group[future]
try:
group_recovery_results[group.ranks] = future.result()
except Exception:
logger.exception(f"Failed to restart rollout worker group ranks={group.ranks}.")
group_recovery_results[group.ranks] = False

if self._is_stopping():
for group in sorted_failed_groups:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Critical: run_once() still only triggers recover_failed_workers() when all workers are down after the health check. If a single lifecycle group fails, it's marked INACTIVE by _check_active_workers_and_mark_failed_groups() but no recovery or cleanup is attempted — the failed group's GPU memory remains allocated indefinitely until the next training barrier (ensure_workers_healthy_before_training).

The old RolloutHealthChecker.run_once() at least offloaded and shut down failed workers immediately. The new code should either:

  1. Attempt recovery or at least offload/shutdown for failed groups during generation, or
  2. Document this as an intentional design decision with a comment explaining why partial-group recovery during generation is unsafe.

Without either, a long-running generation phase can leak GPU memory from dead workers.

ProduceBatchResult impact: Not directly affected, but reduced worker pool capacity can affect group_gen_* timing if generation load concentrates on fewer workers.

Comment on lines +95 to +98
assert self._stop_event is not None
self._stopped = True
self._stop_event.set()
if self._pause_event:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit: When join() times out, the method returns early without clearing self._thread, self._stop_event, or self._pause_event. This means a subsequent start() call will see self._thread as non-None and is_alive() still True, so it short-circuits without creating a new thread. The stale references could also confuse is_paused() or other state queries.

Consider at least logging that the manager is in a degraded state, or setting self._stopped = True so _is_stopping() guards subsequent operations.

Comment on lines +284 to +301

if failed_groups and not self._is_stopping():
with self._operation_lock:
current_workers_snapshot = self.snapshot_workers()
active_groups = {
worker.lifecycle_group_ranks or (worker.rank,)
for worker in current_workers_snapshot.values()
if worker.active
}
failed_groups = failed_groups & active_groups
for group_ranks in failed_groups:
self._set_group_lifecycle_state(group_ranks, WorkerLifecycleState.INACTIVE)

return len(workers_to_check)

def _check_workers_health(self, workers_to_check: list[WorkerSnapshot]) -> list[bool]:
"""Run periodic check_health probes concurrently."""
if self._check_failure_threshold <= 0:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning: Periodic health checks now use only check_health (basic HTTP liveness) for all workers, including request-entrypoint workers. The previous implementation (in the initial patch 2 version) used check_health_generate for request entrypoints, which validates the inference engine is actually functional.

This means a worker whose HTTP server is alive but whose inference engine is broken (e.g., CUDA OOM, stuck model load) will pass periodic health checks indefinitely. The generate-readiness verification now only happens during recovery restart, not during ongoing monitoring.

If this is intentional (to avoid the cost of generate probes every check interval), consider documenting why — otherwise this is a regression in failure detection coverage.

Comment thread xtuner/v1/rl/rollout/worker.py Outdated
@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Claude: ## Re-review Summary (after patches 4-5)

Good progress since the first review. The stop() timeout and _broadcast_to_active_workers kwargs issues are fixed. The parallel recovery via ThreadPoolExecutor and the failure-threshold counter are solid additions.

Remaining Issues

Critical

  • [health_manager.py:212-222] run_once() still only recovers when all workers are down. A single failed lifecycle group stays INACTIVE without offload/shutdown, leaking GPU memory until the next training barrier. This was the critical issue from the first review and remains unresolved.

Warning

  • [health_manager.py:284-301] Periodic health checks downgraded from check_health_generate to plain check_health — workers with a live HTTP server but broken inference engine won't be caught until recovery.
  • [worker.py:900-902] check_health_generate now raises NotImplementedError instead of falling back to check_health(). New backends that miss this will crash during recovery rather than degrade gracefully.

Nit

  • [health_manager.py:95-98] stop() returns early on timeout without clearing stale _thread/_stop_event references, which could confuse subsequent start()/is_paused() calls.

Previous Issues — Fixed

  • stop() timeout: now uses HEALTH_MANAGER_STOP_JOIN_TIMEOUT = 30.0
  • _broadcast_to_active_workers kwargs dropped: now forwards **kwargs

ProduceBatchResult impact: not affected — no changes to batch accounting, status conversion, or reward/filter paths.

RoutedExperts impact: not affected — no changes to routed-experts plumbing, object ref ownership, or extra_fields handling.

Verdict

REQUEST_CHANGES — the run_once recovery/cleanup gap for partially-failed groups remains the blocking issue.

@YanhuiDua YanhuiDua force-pushed the fix-health-check-part3 branch from be11ead to 50f26e4 Compare June 12, 2026 01:22
Comment thread xtuner/v1/rl/rollout/trace_store.py Outdated
Comment thread xtuner/v1/rl/rollout/controller.py Outdated
Comment thread xtuner/v1/rl/rollout/controller.py Outdated
Comment thread xtuner/v1/rl/rollout/controller.py
Comment thread xtuner/v1/rl/rollout/controller.py Outdated
Comment thread xtuner/v1/rl/rollout/lmdeploy.py Outdated
Comment thread xtuner/v1/rl/rollout/trace_store.py Outdated
@hhaAndroid hhaAndroid requested review from CyCle1024 and jayhenry June 12, 2026 05:53
Comment thread xtuner/v1/rl/rollout/controller.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants