[Cosmos] Shield cross-region failover for metadata reads from caller cancellation#47513
Open
NaluTripician wants to merge 3 commits into
Open
[Cosmos] Shield cross-region failover for metadata reads from caller cancellation#47513NaluTripician wants to merge 3 commits into
NaluTripician wants to merge 3 commits into
Conversation
…ancellation A caller request-level timeout/cancellation firing mid-flight during a cold control-plane metadata (collection) read preempted the cross-region failover retry before the retry policy was consulted, surfacing a cancellation instead of failing over to the next preferred region (azure-sdk-for-python#46471 / azure-cosmos-dotnet-v3#5805, port of #5844). Sync Execute and async ExecuteAsync now, for collection read-only requests only, grant one bounded, cancellation-shielded cross-region attempt when the failover policy would retry. Configurable via AZURE_COSMOS_METADATA_FAILOVER_GRACE_SECONDS (default 10s; 0 disables). Non-metadata operations keep existing cancellation semantics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Self-review (PR Deep Reviewer, 2 cycles) on #47513: - Async: run the cross-region grace attempt as an explicit task and observe the detached task on grace timeout/failure, preventing 'Task exception was never retrieved'. - Sync + async: log the grace-attempt failure at debug before surfacing the original cancellation (no misleading __cause__ chain). - is_metadata_failover_candidate now guards with isinstance(RequestObject). - Documented the abandoned-thread caveat in run_grace_attempt_sync (single leaf send; no deepcopy). - Added async tests: grace-timeout clean __cause__, and detached-failing-attempt-is-observed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a “grace window” mechanism for cold container/collection metadata reads so that cross-region failover can still occur even if the caller cancels mid-flight, preventing the retry policy from being consulted (fixing #46471).
Changes:
- Introduces
_metadata_failover_grace.pyto detect metadata-read candidates and run one bounded, cancellation-shielded cross-region attempt (sync thread + async shield/wait_for). - Updates sync and async retry loops to, on cancellation during metadata reads, consult the timeout failover policy and perform one grace attempt.
- Adds unit tests (sync + async) and a changelog entry documenting the new behavior and env var override.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/azure/cosmos/_metadata_failover_grace.py | New helper module for grace-window selection and detached execution for metadata reads. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_retry_utility.py | Sync retry loop updated to perform one bounded grace attempt on cancellation for metadata reads. |
| sdk/cosmos/azure-cosmos/azure/cosmos/aio/_retry_utility_async.py | Async retry loop updated to perform one bounded shielded grace attempt on cancellation for metadata reads and observe detached task outcomes. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_constants.py | Adds env var + defaults for metadata failover grace window configuration. |
| sdk/cosmos/azure-cosmos/tests/test_metadata_failover_grace_unit.py | New sync unit tests covering candidate detection, grace window parsing, and retry-loop grace behavior. |
| sdk/cosmos/azure-cosmos/tests/test_metadata_failover_grace_unit_async.py | New async unit tests covering grace behavior, timeouts, and “no unobserved task exception” behavior. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the bug fix and the new AZURE_COSMOS_METADATA_FAILOVER_GRACE_SECONDS setting. |
Comment on lines
+53
to
+56
| if value < 0: | ||
| value = 0.0 | ||
| if value > _Constants.METADATA_FAILOVER_GRACE_SECONDS_MAX: | ||
| value = _Constants.METADATA_FAILOVER_GRACE_SECONDS_MAX |
Comment on lines
+360
to
+376
| try: | ||
| grace_result = await asyncio.wait_for( | ||
| asyncio.shield(grace_task), timeout=grace_seconds) | ||
| except BaseException as grace_exc: # pylint: disable=broad-except | ||
| # Grace window expired or the cross-region attempt failed. Make sure the | ||
| # detached attempt's eventual result/exception is observed, then surface | ||
| # the original caller cancellation. The detached attempt (if still | ||
| # running) continues in the background to warm caches for subsequent | ||
| # callers. | ||
| if grace_task.done(): | ||
| _observe_detached_grace_task(grace_task) | ||
| else: | ||
| grace_task.add_done_callback(_observe_detached_grace_task) | ||
| logger.debug( | ||
| "Metadata cross-region failover grace attempt did not complete " | ||
| "before the caller cancellation was surfaced: %r", grace_exc) | ||
| raise e from None |
Comment on lines
+302
to
+315
| except BaseException as e: # pylint: disable=broad-except | ||
| # A caller request-level timeout / cancellation can fire mid-flight during a | ||
| # cold control-plane metadata (collection) read and preempt the retry loop | ||
| # before the cross-region failover policy is consulted, surfacing the | ||
| # cancellation instead of a successful failover to the next preferred region | ||
| # (see azure-sdk-for-python#46471 / azure-cosmos-dotnet-v3#5805). For metadata | ||
| # reads only, give the failover policy one bounded, cancellation-shielded | ||
| # attempt against the next region. Process-control signals and the SDK's own | ||
| # timeout error are never intercepted. | ||
| if isinstance(e, (KeyboardInterrupt, SystemExit, GeneratorExit, | ||
| exceptions.CosmosClientTimeoutError)): | ||
| raise | ||
| if not _metadata_failover_grace.is_metadata_failover_candidate(args): | ||
| raise |
…er grace Resolves CI Analyze (pylint R1730 consider-using-min-builtin) and cspell (unknown word 'ppaf') failures on PR #47513. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports Azure/azure-cosmos-dotnet-v3#5844 (detached metadata executor) to the Python SDK, addressing azure-sdk-for-python#46471 (same defect class as dotnet #5805).
A caller request-level timeout / cancellation that fires mid-flight during a cold control-plane metadata (container/collection) read could preempt the cross-region failover retry before the retry policy was consulted — surfacing a cancellation (
asyncio.CancelledErrorasync / aBaseExceptionsync) to the customer instead of failing over to the next preferred region.Root cause
_retry_utility.Execute/aio/_retry_utility_async.ExecuteAsynccatch onlyCosmosHttpResponseError,ServiceRequestError, andServiceResponseError. A caller cancellation is aBaseExceptionthat bypasses all of these and exits thewhile Trueloop before the failover policy (_TimeoutFailoverRetryPolicy.ShouldRetry) runs.Proof the bug exists (done before the fix)
A repro driving the real retry loops with a failover policy that would retry:
ShouldRetrynever consulted — confirmed for both sync and async.Fix
Scoped strictly to cold collection metadata reads (
resource_type == Collection, read-only) — matching .NET, which wires the detached executor only into the collection-cache path. Data-plane cancellation semantics are unchanged.azure/cosmos/_metadata_failover_grace.py:is_metadata_failover_candidate(args)— detect a collection read with a request object.get_grace_seconds()— default10s, env overrideAZURE_COSMOS_METADATA_FAILOVER_GRACE_SECONDS, clamped to[0, 86400];0disables the grace (restores prior behavior).run_grace_attempt_sync()— runs one cross-region attempt on a detached daemon thread with a bounded join._retry_utility.py(sync): on caller cancellation, for metadata candidates only, consult the failover policy and grant one bounded, cancellation-shielded cross-region attempt; return on success, else surface the original cancellation.KeyboardInterrupt/SystemExit/GeneratorExit/CosmosClientTimeoutErrorare never intercepted.aio/_retry_utility_async.py(async): mirror viaawait asyncio.wait_for(asyncio.shield(ExecuteFunctionAsync(...)), grace_seconds)so the attempt detaches from the caller's cancellation.Tests
15 new unit tests (sync + async) — all pass:
get_grace_secondsparsing/clamp/default,run_grace_attempt_syncsuccess/exception/timeout(Tests run standalone via
unittest; the cosmosconftest.pyrequires the emulator for the fullpytestsuite.)Notes
Fixes #46471
🤖 Generated with GitHub Copilot CLI