fix(rollout): survive and prevent mid-session aborted turns in agentic rollouts#42
Open
nightlessbaron wants to merge 4 commits into
Open
fix(rollout): survive and prevent mid-session aborted turns in agentic rollouts#42nightlessbaron wants to merge 4 commits into
nightlessbaron wants to merge 4 commits into
Conversation
… turn When the engine aborts an in-flight request mid-session (e.g. the end-of-rollout abort_request) or a turn hits the per-turn length limit, the external agent does not see the finish_reason and keeps issuing turns conditioned on the incomplete output. Those trailing turns are invalid training data, and merge_samples cannot represent a non-final non-COMPLETED turn (a.status must be COMPLETED assertion). Establish the merge invariant at the producer: keep turns up to and including the first non-COMPLETED one, drop the rest, log a warning and stamp dropped_trailing_turns into sample metadata for observability. Applies to both the merge path and generate_multi_samples, which previously shipped post-abort turns to training silently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rt() The main generation loop guards task.result() with try/except, but the abort() collection loop awaited tasks unguarded via as_completed_async, so a single malformed session raised through RolloutManager.generate and crashed the whole training job after the step's data was already collected. Collect pendings with asyncio.gather(return_exceptions=True) and log-and-skip failures, mirroring the main loop's error handling. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rollout abort POST /sessions/drain marks every open session closing and rejects new create_session (503) and chat_completions (404) calls until POST /sessions/resume. This lets the trainer stop external agents from issuing turns conditioned on aborted (mid-decode truncated) output: without the gate, /abort_request only kills in-flight engine requests and agents keep extending their session on top of the partial turn. chat_completions returns 404 rather than 503 while draining because agent-side OpenAI clients retry 5xx but treat 404 as terminal. Also realigns test_get_session_not_found with the empty-records semantics for unknown sessions introduced by the agentic patch (bdf32f0); the test had been asserting the old 404 behavior. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tend aborted sessions abort() now drains all session_server_backends BEFORE posting /abort_request to the engine workers, closing the race where an agent starts a new turn after its in-flight turn was aborted; such turns are invalid training data (the merge invariant fix backstops them) and waste GPU time while the weight update waits. Resume happens in a finally block after all pending generate tasks have returned, so a failed abort cannot leave the servers drained and deadlock subsequent rollouts/evals. Drain/resume posts use low retries and tolerate unreachable servers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Training jobs crashed with
AssertionError: a.status must be COMPLETED, got Status.ABORTEDinmerge_samplesduring the end-of-rollout abort. Root cause:abort()posts/abort_requestto sglang workers, which kills the in-flight turn of a multi-turn agent session (finish_reason="abort"), but the external agent doesn't see the finish_reason, keeps looping on the truncated output, and produces COMPLETED turns after the aborted one.merge_samplescannot represent a non-final non-COMPLETED turn — and the exception escapedabort()'s unguarded collection loop, killing the whole job after the step's data was already collected.Fix (4 commits, layered)
drop_samples_after_first_non_completed— establishes the merge invariant at the producer: turns generated after an aborted/length-limited turn are conditioned on incomplete output and are dropped (warning log +dropped_trailing_turnsmetadata). Also covers thegenerate_multi_samplespath, which previously shipped post-abort turns to training silently.abort()collection — pending tasks are gathered with per-task error handling (mirroring the main loop), so one malformed session degrades to a logged drop instead of a dead run.POST /sessions/drainmarks all open sessions closing and rejects newcreate_session(503) /chat_completions(404, non-retryable for OpenAI-style clients) untilPOST /sessions/resume.abort()drains before killing — session servers are drained before/abort_request, closing the race entirely; resume runs in afinallyafter all pending tasks return so a failed abort can't leave servers drained and deadlock later rollouts/evals.Testing
TDD throughout — every new test observed failing for the right reason first.
tests/fast/rollout/generate_utils,tests/fast/rollout/inference_rollout/test_inference_rollout_train.py, andtests/fast/router/test_sessions.pyall pass (49 tests) in the miles CI container. Also realignstest_get_session_not_foundwith the empty-records semantics introduced by the agentic patch (bdf32f0); it was failing before this PR.Companion PR: LLM360/harbor e2e/abort (agent treats the abort signals as terminal).