Skip to content

Deflake the session-level timeout test with trio's virtual clock#2788

Merged
maxisbey merged 1 commit into
mainfrom
maxisbey/deflake-session-timeout-test
Jun 5, 2026
Merged

Deflake the session-level timeout test with trio's virtual clock#2788
maxisbey merged 1 commit into
mainfrom
maxisbey/deflake-session-timeout-test

Conversation

@maxisbey
Copy link
Copy Markdown
Contributor

@maxisbey maxisbey commented Jun 5, 2026

test_session_level_timeout_applies_to_every_request flaked three times in its first week. This switches it to trio's virtual clock so it is deterministic by construction, with no real waiting.

Motivation and Context

The test sets Client(server, read_timeout_seconds=0.05). A session-level read timeout applies to every request that doesn't set its own — including the InitializeRequest sent inside Client.__aenter__ — so the in-process handshake had to beat 50ms of real clock before the test body even ran. Under CI load it sometimes didn't, and the test died in __aenter__ with Timed out while waiting for response to InitializeRequest:

Looping the in-memory handshake on a CPU-saturated windows-latest runner (3.12, locked) puts its tail at p99 ≈ 70ms and max ≈ 190ms — 4.3% of handshakes exceeded 50ms; even idle, the max was 26ms. So the existing value was a standing bet against scheduler stalls, and any wider real-clock margin pays for itself in wall-clock time on every run.

The file's other tests are deterministic because the timed request can never be answered (the handler blocks forever), so an effectively-zero timeout always wins. The session-default test can't use that trick: the same knob governs the handshake, which must succeed first.

Running this one test on trio's MockClock(autojump_threshold=0) removes the bet instead of widening it. Virtual time only advances when every task is blocked: during the handshake some task is always runnable, so the handshake cannot time out no matter how slow the runner, and once the tool call blocks on the never-answered request the run goes idle and the clock jumps straight to the deadline. The timeout value and snapshot stay as they were; the file now runs in milliseconds. trio is already a dev dependency, and the per-test anyio_backend parametrization keeps the rest of the suite on asyncio.

How Has This Been Tested?

  • 500 iterations of the test shape under the MockClock via the pytest plugin path: identical outcome every time, ~1.1s of real time for 25s of virtual timeouts.
  • 30 consecutive runs of the file, all green; full tests/interaction/ suite passes (529 tests, ~5s).
  • Sanity-checked the failure mode is unchanged: a tiny real-clock timeout racing the handshake is not a viable alternative — on Windows 3.12, time.monotonic has 15.625ms granularity, so fail_after(1e-6) actually lets the handshake win most races (and 3.13+ switched to QPC, flipping the behavior). Virtual time sidesteps clock granularity entirely.

Breaking Changes

None — test-only.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

All three flakes hit the 3.12/locked matrix cell. For the two Windows hits there's a plausible amplifier: on 3.12, asyncio's clock is GetTickCount64-based, so a 50ms deadline computed from a floor-quantized clock can fire up to ~15.6ms early in real time.

The CI tracebacks above are also a live reproduction of #2114: the MCPError from the failed initialize surfaces from Client.__aenter__ double-wrapped in ExceptionGroups.

AI Disclaimer

The session-level read timeout also governs the initialize handshake sent
by Client.__aenter__, so the test's 50ms value doubled as a real-time
deadline for the in-process handshake. Under CI load the handshake tail
exceeds 50ms (observed max ~190ms on a saturated windows runner), which
failed the test before its body ran -- three times in the week since it
landed, twice on windows and once on ubuntu, always on the 3.12/locked
matrix cell.

Instead of widening the margin and paying for it in real wait time, run
this one test on trio's MockClock with autojump: virtual time advances
only when every task is blocked, so the handshake can never time out no
matter how slow the runner, and the blocked tool call hits its deadline
the moment the run goes idle. The test keeps its original timeout value
and snapshot, is immune to scheduler stalls by construction, and the
file now runs in milliseconds.
@maxisbey maxisbey marked this pull request as ready for review June 5, 2026 20:27
@maxisbey maxisbey merged commit ac96f88 into main Jun 5, 2026
33 checks passed
@maxisbey maxisbey deleted the maxisbey/deflake-session-timeout-test branch June 5, 2026 20:33
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — test-only deflake of a single timeout test using trio's virtual clock; the only finding is a one-word README wording nit.

Extended reasoning...

Overview

This PR modifies a single test file, tests/interaction/lowlevel/test_timeouts.py, switching test_session_level_timeout_applies_to_every_request from a real-clock 50ms timeout to trio's MockClock(autojump_threshold=0) via a per-test anyio_backend parametrization. The timeout value, assertion, and snapshot are unchanged; the module docstring and inline comments are updated to explain the approach. No production code is touched.

Security risks

None. This is a test-only change with no impact on library code, transports, auth, or any runtime behavior shipped to users.

Level of scrutiny

Low. The change is small, self-contained, and follows an established anyio/trio testing pattern (per-test backend parametrization with a mock clock). The PR description documents the flake history, the reasoning for why the effectively-zero-timeout pattern used by the file's other tests cannot apply here, and stress-test results (500 iterations under MockClock, 30 consecutive file runs). trio is already a dev dependency, so no new dependency is introduced.

Other factors

The bug-hunting pass surfaced only a documentation nit: tests/interaction/README.md still refers to these as "real-clock timeout tests", which is now slightly imprecise for the session-level test. That is a one-word follow-up and does not affect correctness, so it does not block approval. The remaining two tests in the file are unchanged and still run on asyncio.

Comment on lines +90 to 103
# A session-level timeout cannot use the effectively-zero pattern above: it also governs the
# initialize handshake, which must complete before the blocked tool call can wait the timeout
# out in full. Any real-clock margin is a bet against CI scheduler stalls (a 50ms value lost
# that bet in CI; the in-process handshake tail reaches ~190ms on a loaded windows runner), so
# this test runs on trio's virtual clock instead. With autojump, time advances only when every
# task is blocked: the handshake always has a runnable task and therefore cannot time out no
# matter how slow the runner, and once the tool call blocks on the never-answered request the
# run goes idle and the clock jumps straight to the deadline — deterministic, with no real wait.
@requirement("protocol:timeout:session-default")
@pytest.mark.parametrize(
"anyio_backend",
[pytest.param(("trio", {"clock": MockClock(autojump_threshold=0)}), id="trio-mockclock")],
)
async def test_session_level_timeout_applies_to_every_request() -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: tests/interaction/README.md (line 64) still describes these as "the real-clock timeout tests" when explaining why they skip the transport-parametrized connect fixture, but after this change the session-level test runs on trio's MockClock virtual clock rather than the real clock. Consider rewording that phrase (e.g. just "the timeout tests") so the README stays consistent with the updated docstring and comments here.

Extended reasoning...

What's stale: tests/interaction/README.md lines 62–66 list the test groups that don't use the transport-parametrized connect fixture, and one of them is described as "the real-clock timeout tests (the timeout machinery is transport-independent and must not race transport latency)". That phrasing was accurate before this PR, when all three tests in test_timeouts.py ran against the real clock with effectively-zero or 50ms timeouts.

Why this PR makes it inaccurate: This change moves test_session_level_timeout_applies_to_every_request onto trio's MockClock(autojump_threshold=0) via the per-test anyio_backend parametrization. Concretely: the test no longer waits any real time at all — the handshake runs on virtual time that only advances when every task is blocked, and the 0.05s deadline is reached by an autojump, not by the real clock. So of the three timeout tests in the file, only the two per-request tests (test_request_timeout_fails_the_pending_call, test_session_serves_requests_after_timeout) still run on the real clock; the README's collective label "real-clock timeout tests" no longer covers the file.

Step-by-step: (1) A reader sees "real-clock timeout tests" in the README and opens tests/interaction/lowlevel/test_timeouts.py. (2) The module docstring now says the session-level test "runs on trio's virtual clock instead", and the comment block above test_session_level_timeout_applies_to_every_request (lines 90–97) explains the MockClock approach in detail. (3) The README and the file now describe the same tests with contradictory clock semantics — exactly the kind of drift the PR was otherwise careful to avoid, since it updated both the module docstring and the inline comments.

Why nothing else catches it: the README is prose; nothing enforces consistency between it and the test file, so the drift will persist until someone notices it manually.

Addressing the counter-argument: one reviewer noted that the README sentence's real purpose — explaining why the timeout tests skip the connect fixture — remains fully valid, and that two of the three tests still use the real clock, so this is a single slightly-imprecise adjective. That's all true, which is why this is filed as a nit rather than a blocking issue: the substantive rationale (transport-independence, not racing transport latency) is unchanged and arguably strengthened. But the PR's own description checks "I have added or updated documentation as needed" as incomplete, and the change deliberately updated every other piece of documentation describing these tests; keeping the suite README in step is a one-word fix.

How to fix: in tests/interaction/README.md line 64, drop or adjust the adjective — e.g. "the timeout tests (the timeout machinery is transport-independent and must not race transport latency; the session-level one runs on trio's virtual clock)" or simply "the timeout tests (...)".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants