Skip to content

fix(rollout): survive and prevent mid-session aborted turns in agentic rollouts#42

Open
nightlessbaron wants to merge 4 commits into
integration/agentic-rl-2026-06-10from
e2e/abort
Open

fix(rollout): survive and prevent mid-session aborted turns in agentic rollouts#42
nightlessbaron wants to merge 4 commits into
integration/agentic-rl-2026-06-10from
e2e/abort

Conversation

@nightlessbaron

Copy link
Copy Markdown

Problem

Training jobs crashed with AssertionError: a.status must be COMPLETED, got Status.ABORTED in merge_samples during the end-of-rollout abort. Root cause: abort() posts /abort_request to sglang workers, which kills the in-flight turn of a multi-turn agent session (finish_reason="abort"), but the external agent doesn't see the finish_reason, keeps looping on the truncated output, and produces COMPLETED turns after the aborted one. merge_samples cannot represent a non-final non-COMPLETED turn — and the exception escaped abort()'s unguarded collection loop, killing the whole job after the step's data was already collected.

Fix (4 commits, layered)

  1. drop_samples_after_first_non_completed — establishes the merge invariant at the producer: turns generated after an aborted/length-limited turn are conditioned on incomplete output and are dropped (warning log + dropped_trailing_turns metadata). Also covers the generate_multi_samples path, which previously shipped post-abort turns to training silently.
  2. Hardened abort() collection — pending tasks are gathered with per-task error handling (mirroring the main loop), so one malformed session degrades to a logged drop instead of a dead run.
  3. Session-server drain/resume endpointsPOST /sessions/drain marks all open sessions closing and rejects new create_session (503) / chat_completions (404, non-retryable for OpenAI-style clients) until POST /sessions/resume.
  4. abort() drains before killing — session servers are drained before /abort_request, closing the race entirely; resume runs in a finally after all pending tasks return so a failed abort can't leave servers drained and deadlock later rollouts/evals.

Testing

TDD throughout — every new test observed failing for the right reason first. tests/fast/rollout/generate_utils, tests/fast/rollout/inference_rollout/test_inference_rollout_train.py, and tests/fast/router/test_sessions.py all pass (49 tests) in the miles CI container. Also realigns test_get_session_not_found with the empty-records semantics introduced by the agentic patch (bdf32f0); it was failing before this PR.

Companion PR: LLM360/harbor e2e/abort (agent treats the abort signals as terminal).

nightlessbaron and others added 4 commits June 11, 2026 20:43
… turn

When the engine aborts an in-flight request mid-session (e.g. the
end-of-rollout abort_request) or a turn hits the per-turn length limit,
the external agent does not see the finish_reason and keeps issuing
turns conditioned on the incomplete output. Those trailing turns are
invalid training data, and merge_samples cannot represent a non-final
non-COMPLETED turn (a.status must be COMPLETED assertion).

Establish the merge invariant at the producer: keep turns up to and
including the first non-COMPLETED one, drop the rest, log a warning and
stamp dropped_trailing_turns into sample metadata for observability.
Applies to both the merge path and generate_multi_samples, which
previously shipped post-abort turns to training silently.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rt()

The main generation loop guards task.result() with try/except, but the
abort() collection loop awaited tasks unguarded via as_completed_async,
so a single malformed session raised through RolloutManager.generate and
crashed the whole training job after the step's data was already
collected. Collect pendings with asyncio.gather(return_exceptions=True)
and log-and-skip failures, mirroring the main loop's error handling.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rollout abort

POST /sessions/drain marks every open session closing and rejects new
create_session (503) and chat_completions (404) calls until POST
/sessions/resume. This lets the trainer stop external agents from
issuing turns conditioned on aborted (mid-decode truncated) output:
without the gate, /abort_request only kills in-flight engine requests
and agents keep extending their session on top of the partial turn.

chat_completions returns 404 rather than 503 while draining because
agent-side OpenAI clients retry 5xx but treat 404 as terminal.

Also realigns test_get_session_not_found with the empty-records
semantics for unknown sessions introduced by the agentic patch
(bdf32f0); the test had been asserting the old 404 behavior.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tend aborted sessions

abort() now drains all session_server_backends BEFORE posting
/abort_request to the engine workers, closing the race where an agent
starts a new turn after its in-flight turn was aborted; such turns are
invalid training data (the merge invariant fix backstops them) and
waste GPU time while the weight update waits. Resume happens in a
finally block after all pending generate tasks have returned, so a
failed abort cannot leave the servers drained and deadlock subsequent
rollouts/evals. Drain/resume posts use low retries and tolerate
unreachable servers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@nightlessbaron nightlessbaron requested a review from a team as a code owner June 11, 2026 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant