Refine rollout worker health check and recovery lifecycle#1877
Refine rollout worker health check and recovery lifecycle#1877YanhuiDua wants to merge 7 commits into
Conversation
9cb357d to
a766fdb
Compare
|
@claude review |
|
Claude: ## Summary Well-structured refactor that introduces IssuesCritical
Warning
Nit
ProduceBatchResult impactNot affected — no changes to batch accounting, status conversion, or reward/filter paths. RoutedExperts impactNot affected — no changes to routed-experts plumbing, object ref ownership, or extra_fields handling. VerdictREQUEST_CHANGES — the |
|
@claude review |
| } | ||
| for future in as_completed(future_to_group): | ||
| group = future_to_group[future] | ||
| try: | ||
| group_recovery_results[group.ranks] = future.result() | ||
| except Exception: | ||
| logger.exception(f"Failed to restart rollout worker group ranks={group.ranks}.") | ||
| group_recovery_results[group.ranks] = False | ||
|
|
||
| if self._is_stopping(): | ||
| for group in sorted_failed_groups: |
There was a problem hiding this comment.
Claude: Critical: run_once() still only triggers recover_failed_workers() when all workers are down after the health check. If a single lifecycle group fails, it's marked INACTIVE by _check_active_workers_and_mark_failed_groups() but no recovery or cleanup is attempted — the failed group's GPU memory remains allocated indefinitely until the next training barrier (ensure_workers_healthy_before_training).
The old RolloutHealthChecker.run_once() at least offloaded and shut down failed workers immediately. The new code should either:
- Attempt recovery or at least offload/shutdown for failed groups during generation, or
- Document this as an intentional design decision with a comment explaining why partial-group recovery during generation is unsafe.
Without either, a long-running generation phase can leak GPU memory from dead workers.
ProduceBatchResult impact: Not directly affected, but reduced worker pool capacity can affect group_gen_* timing if generation load concentrates on fewer workers.
| assert self._stop_event is not None | ||
| self._stopped = True | ||
| self._stop_event.set() | ||
| if self._pause_event: |
There was a problem hiding this comment.
Claude: Nit: When join() times out, the method returns early without clearing self._thread, self._stop_event, or self._pause_event. This means a subsequent start() call will see self._thread as non-None and is_alive() still True, so it short-circuits without creating a new thread. The stale references could also confuse is_paused() or other state queries.
Consider at least logging that the manager is in a degraded state, or setting self._stopped = True so _is_stopping() guards subsequent operations.
|
|
||
| if failed_groups and not self._is_stopping(): | ||
| with self._operation_lock: | ||
| current_workers_snapshot = self.snapshot_workers() | ||
| active_groups = { | ||
| worker.lifecycle_group_ranks or (worker.rank,) | ||
| for worker in current_workers_snapshot.values() | ||
| if worker.active | ||
| } | ||
| failed_groups = failed_groups & active_groups | ||
| for group_ranks in failed_groups: | ||
| self._set_group_lifecycle_state(group_ranks, WorkerLifecycleState.INACTIVE) | ||
|
|
||
| return len(workers_to_check) | ||
|
|
||
| def _check_workers_health(self, workers_to_check: list[WorkerSnapshot]) -> list[bool]: | ||
| """Run periodic check_health probes concurrently.""" | ||
| if self._check_failure_threshold <= 0: |
There was a problem hiding this comment.
Claude: Warning: Periodic health checks now use only check_health (basic HTTP liveness) for all workers, including request-entrypoint workers. The previous implementation (in the initial patch 2 version) used check_health_generate for request entrypoints, which validates the inference engine is actually functional.
This means a worker whose HTTP server is alive but whose inference engine is broken (e.g., CUDA OOM, stuck model load) will pass periodic health checks indefinitely. The generate-readiness verification now only happens during recovery restart, not during ongoing monitoring.
If this is intentional (to avoid the cost of generate probes every check interval), consider documenting why — otherwise this is a regression in failure detection coverage.
|
Claude: ## Re-review Summary (after patches 4-5) Good progress since the first review. The Remaining IssuesCritical
Warning
Nit
Previous Issues — Fixed
ProduceBatchResult impact: not affected — no changes to batch accounting, status conversion, or reward/filter paths. RoutedExperts impact: not affected — no changes to routed-experts plumbing, object ref ownership, or extra_fields handling. VerdictREQUEST_CHANGES — the |
be11ead to
50f26e4
Compare
概述
这个 PR 主要做两部分重构:
EngineLaunchSpec/ServerProcessSpec。RolloutController、RolloutHealthManager、RolloutWorker的职责边界,并且支持ep group中一个worker失败后,将会重启所有的worker,另外, 当RolloutHealthManager检测到所有worker失败后,会立即重启所有的worker为什么 Health Manager 重构依赖 Server Launch Spec 重构
Rollout health recovery 不能只知道“哪些 worker 还活着”,还必须知道每个 rollout server 是怎么启动出来的,以及失败后应该按什么粒度恢复。
在 LMDeploy EP、SGLang 跨节点等场景下,一个 logical engine 和 server process 不是简单的一一对应关系:
因此,Health Manager 需要依赖 Server Launch Spec 提供的结构化信息:
engine_ranks:一个 logical engine 由哪些 worker rank 组成。server_processes:这个 engine 实际启动了哪些 rollout server process。server_worker_ranks:哪些 worker rank 拥有 server process,需要参与生命周期管理。accepts_rollout_requests:哪些 server 是 request entrypoint,可以接收 generate 请求。dist_init_addr/placement_group_bundle_idxs:worker recovery 时复用原始启动布局,避免重启后 server 地址或资源绑定发生变化。所以第一个 commit 先把 server launch layout 显式化;第二个 commit 才能基于这些结构化信息,把 health check、状态流转、group recovery 和 request routing 的职责从 controller 中拆出来。
主要改动
Server Launch Spec 重构
ServerProcessSpec和EngineLaunchSpec,显式描述每个推理 engine 应该启动哪些 server process。RolloutController._init_workers,先构造 launch spec,再根据 spec 启动 server。build_engine_launch_specs中。init时缓存 launch spec,后续 recovery 重启时复用原始 placement / dist-init 布局。Rollout Health Manager 重构
RolloutHealthManager,统一负责 worker 健康状态检查、状态流转和失败恢复。RolloutController只保留 worker 创建、请求路由、训练生命周期控制等职责。RolloutWorker继续负责后端相关的 server 启停、health probe 和 generate 行为。WorkerLifecycleState,明确区分:ACTIVEINACTIVERECOVERINGSessionRouter和 generate readiness check 只面向真正接收请求的 server。说明:该PR包含了PR1873的功能:#1873
测试
已验证通过moe模型EP/TP正常训练,并且当rank2的worker挂掉后,EP/TP 整组可正常恢复;当所有worker挂掉后,所有worker立即重启。遗留的一个问题:当所有worker挂掉后,会丢掉一些数据,但这个涉及到数据流的改动,本PR先不做该问题的修复