Skip to content

[Cosmos] Shield cross-region failover for metadata reads from caller cancellation#47513

Open
NaluTripician wants to merge 3 commits into
mainfrom
nalutripician/cosmos-detached-metadata-executor
Open

[Cosmos] Shield cross-region failover for metadata reads from caller cancellation#47513
NaluTripician wants to merge 3 commits into
mainfrom
nalutripician/cosmos-detached-metadata-executor

Conversation

@NaluTripician

Copy link
Copy Markdown
Contributor

Summary

Ports Azure/azure-cosmos-dotnet-v3#5844 (detached metadata executor) to the Python SDK, addressing azure-sdk-for-python#46471 (same defect class as dotnet #5805).

A caller request-level timeout / cancellation that fires mid-flight during a cold control-plane metadata (container/collection) read could preempt the cross-region failover retry before the retry policy was consulted — surfacing a cancellation (asyncio.CancelledError async / a BaseException sync) to the customer instead of failing over to the next preferred region.

Root cause

_retry_utility.Execute / aio/_retry_utility_async.ExecuteAsync catch only CosmosHttpResponseError, ServiceRequestError, and ServiceResponseError. A caller cancellation is a BaseException that bypasses all of these and exits the while True loop before the failover policy (_TimeoutFailoverRetryPolicy.ShouldRetry) runs.

Proof the bug exists (done before the fix)

A repro driving the real retry loops with a failover policy that would retry:

  • Retryable 503 mid-flight → 2 attempts → success (proves the policy will fail over cross-region).
  • Caller cancellation mid-flight1 attempt only, cancellation propagates, ShouldRetry never consulted — confirmed for both sync and async.

Fix

Scoped strictly to cold collection metadata reads (resource_type == Collection, read-only) — matching .NET, which wires the detached executor only into the collection-cache path. Data-plane cancellation semantics are unchanged.

  • New azure/cosmos/_metadata_failover_grace.py:
    • is_metadata_failover_candidate(args) — detect a collection read with a request object.
    • get_grace_seconds() — default 10s, env override AZURE_COSMOS_METADATA_FAILOVER_GRACE_SECONDS, clamped to [0, 86400]; 0 disables the grace (restores prior behavior).
    • run_grace_attempt_sync() — runs one cross-region attempt on a detached daemon thread with a bounded join.
  • _retry_utility.py (sync): on caller cancellation, for metadata candidates only, consult the failover policy and grant one bounded, cancellation-shielded cross-region attempt; return on success, else surface the original cancellation. KeyboardInterrupt / SystemExit / GeneratorExit / CosmosClientTimeoutError are never intercepted.
  • aio/_retry_utility_async.py (async): mirror via await asyncio.wait_for(asyncio.shield(ExecuteFunctionAsync(...)), grace_seconds) so the attempt detaches from the caller's cancellation.

Tests

15 new unit tests (sync + async) — all pass:

  • metadata read + caller-cancel → one cross-region grace attempt → success returned
  • docs read + caller-cancel → no grace, original cancellation propagates
  • failover policy declines (endpoint discovery disabled) → original propagates
  • grace attempt also fails → original cancellation surfaced
  • grace disabled via env → original propagates
  • helper: get_grace_seconds parsing/clamp/default, run_grace_attempt_sync success/exception/timeout

(Tests run standalone via unittest; the cosmos conftest.py requires the emulator for the full pytest suite.)

Notes

Fixes #46471


🤖 Generated with GitHub Copilot CLI

…ancellation

A caller request-level timeout/cancellation firing mid-flight during a cold control-plane metadata (collection) read preempted the cross-region failover retry before the retry policy was consulted, surfacing a cancellation instead of failing over to the next preferred region (azure-sdk-for-python#46471 / azure-cosmos-dotnet-v3#5805, port of #5844).

Sync Execute and async ExecuteAsync now, for collection read-only requests only, grant one bounded, cancellation-shielded cross-region attempt when the failover policy would retry. Configurable via AZURE_COSMOS_METADATA_FAILOVER_GRACE_SECONDS (default 10s; 0 disables). Non-metadata operations keep existing cancellation semantics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Self-review (PR Deep Reviewer, 2 cycles) on #47513:
- Async: run the cross-region grace attempt as an explicit task and observe the detached task on grace timeout/failure, preventing 'Task exception was never retrieved'.
- Sync + async: log the grace-attempt failure at debug before surfacing the original cancellation (no misleading __cause__ chain).
- is_metadata_failover_candidate now guards with isinstance(RequestObject).
- Documented the abandoned-thread caveat in run_grace_attempt_sync (single leaf send; no deepcopy).
- Added async tests: grace-timeout clean __cause__, and detached-failing-attempt-is-observed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician NaluTripician marked this pull request as ready for review June 15, 2026 22:27
@NaluTripician NaluTripician requested a review from a team as a code owner June 15, 2026 22:27
Copilot AI review requested due to automatic review settings June 15, 2026 22:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a “grace window” mechanism for cold container/collection metadata reads so that cross-region failover can still occur even if the caller cancels mid-flight, preventing the retry policy from being consulted (fixing #46471).

Changes:

  • Introduces _metadata_failover_grace.py to detect metadata-read candidates and run one bounded, cancellation-shielded cross-region attempt (sync thread + async shield/wait_for).
  • Updates sync and async retry loops to, on cancellation during metadata reads, consult the timeout failover policy and perform one grace attempt.
  • Adds unit tests (sync + async) and a changelog entry documenting the new behavior and env var override.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/azure/cosmos/_metadata_failover_grace.py New helper module for grace-window selection and detached execution for metadata reads.
sdk/cosmos/azure-cosmos/azure/cosmos/_retry_utility.py Sync retry loop updated to perform one bounded grace attempt on cancellation for metadata reads.
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_retry_utility_async.py Async retry loop updated to perform one bounded shielded grace attempt on cancellation for metadata reads and observe detached task outcomes.
sdk/cosmos/azure-cosmos/azure/cosmos/_constants.py Adds env var + defaults for metadata failover grace window configuration.
sdk/cosmos/azure-cosmos/tests/test_metadata_failover_grace_unit.py New sync unit tests covering candidate detection, grace window parsing, and retry-loop grace behavior.
sdk/cosmos/azure-cosmos/tests/test_metadata_failover_grace_unit_async.py New async unit tests covering grace behavior, timeouts, and “no unobserved task exception” behavior.
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the bug fix and the new AZURE_COSMOS_METADATA_FAILOVER_GRACE_SECONDS setting.

Comment on lines +53 to +56
if value < 0:
value = 0.0
if value > _Constants.METADATA_FAILOVER_GRACE_SECONDS_MAX:
value = _Constants.METADATA_FAILOVER_GRACE_SECONDS_MAX
Comment on lines +360 to +376
try:
grace_result = await asyncio.wait_for(
asyncio.shield(grace_task), timeout=grace_seconds)
except BaseException as grace_exc: # pylint: disable=broad-except
# Grace window expired or the cross-region attempt failed. Make sure the
# detached attempt's eventual result/exception is observed, then surface
# the original caller cancellation. The detached attempt (if still
# running) continues in the background to warm caches for subsequent
# callers.
if grace_task.done():
_observe_detached_grace_task(grace_task)
else:
grace_task.add_done_callback(_observe_detached_grace_task)
logger.debug(
"Metadata cross-region failover grace attempt did not complete "
"before the caller cancellation was surfaced: %r", grace_exc)
raise e from None
Comment on lines +302 to +315
except BaseException as e: # pylint: disable=broad-except
# A caller request-level timeout / cancellation can fire mid-flight during a
# cold control-plane metadata (collection) read and preempt the retry loop
# before the cross-region failover policy is consulted, surfacing the
# cancellation instead of a successful failover to the next preferred region
# (see azure-sdk-for-python#46471 / azure-cosmos-dotnet-v3#5805). For metadata
# reads only, give the failover policy one bounded, cancellation-shielded
# attempt against the next region. Process-control signals and the SDK's own
# timeout error are never intercepted.
if isinstance(e, (KeyboardInterrupt, SystemExit, GeneratorExit,
exceptions.CosmosClientTimeoutError)):
raise
if not _metadata_failover_grace.is_metadata_failover_candidate(args):
raise
…er grace

Resolves CI Analyze (pylint R1730 consider-using-min-builtin) and cspell (unknown word 'ppaf') failures on PR #47513.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Cosmos] Metadata retry: cross-region failover preempted when caller cancels during control-plane timeout escalation

2 participants