Skip to content

[Cosmos] Add cold-start metadata cache cross-region hedging#47509

Draft
NaluTripician wants to merge 2 commits into
mainfrom
nalutripician/cosmos-metadata-hedging
Draft

[Cosmos] Add cold-start metadata cache cross-region hedging#47509
NaluTripician wants to merge 2 commits into
mainfrom
nalutripician/cosmos-metadata-hedging

Conversation

@NaluTripician

@NaluTripician NaluTripician commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Description

Python port of .NET PR Azure/azure-cosmos-dotnet-v3#5923 — cold-start metadata cache cross-region hedging.

When enabled, the SDK hedges the first-time population (and refresh) of the container (Collection) and partition-key-range metadata caches across regions: the primary request is dispatched immediately and, if it has not produced an acceptable response within a fixed SDK-derived threshold (1.5s), a single hedge request is dispatched to a second region. The first acceptable winner is returned. This prevents a slow/unavailable region from stalling client init and the first request.

Customer-facing API

New tri-state client keyword enable_metadata_hedging_for_cold_start (sync + async CosmosClient):

  • None (default) — follows the account's PPAF (Per-Partition Automatic Failover) state.
  • True — hedge even when PPAF is disabled.
  • False — disable regardless of PPAF.

The threshold and concurrency budget are SDK-derived defaults and are not customer-configurable.

Scope (important)

Hedging is strictly limited to internal cold-start metadata-cache population reads. It is gated on an explicit internal metadataCachePopulation request flag that is set by exactly three callsites:

  • container-properties cache refresh (_refresh_container_properties_cache),
  • cold partition-key-definition read (_get_partition_key_definition),
  • routing-map (partition-key-range) fetch (_fetch_routing_map).

Public container.read(), hybrid-search partition-key-range reads, and GetDatabaseAccount are not hedged.

Implementation

  • New azure/cosmos/_metadata_hedging.py and azure/cosmos/aio/_metadata_hedging.py — bounded primary + single-hedge handlers reusing the existing AvailabilityStrategyHandlerMixin for endpoint resolution and excluded-region routing. Includes:
    • per-client concurrency budget (semaphore); falls back to primary-only when exhausted,
    • dedicated thread pool sized to max(cpu_count, 2 × budget) so each in-flight hedge always has its two threads (sync),
    • first-acceptable-winner with the regional-failure classification (is_regional_failure) + hedge 401/403 auth-reject overlay (a hedge auth failure can never win),
    • single-region / unsupported-resource / not-cache-population short-circuits.
  • _availability_strategy_config.pyMetadataCrossRegionHedgingStrategy (fixed threshold), resolve_metadata_hedging_opt_in, constants.
  • _request_object.pymetadataCachePopulation options flag + is_metadata_cache_population.
  • WiringSynchronizedRequest / AsynchronousRequest gain _is_metadata_hedging_applicable (requires the cache-population flag) and route eligible reads through the metadata handler; the option is threaded through both clients and connections.
  • Teststests/test_metadata_hedging.py + tests/test_metadata_hedging_async.py (25 unit tests, no live service).

Self-review (skeptic lens)

Three independent reviewers were run on the diff. Key outcomes folded into the latest commit:

  • Narrowed scope from resource-type-only to an explicit cache-population flag (above) — the main correctness/over-hedging fix.
  • Right-sized the hedge thread pool to avoid starvation on low-core hosts.
  • Removed a dead record_failure no-op (metadata cache health is account-global, not per-partition).
  • Verified non-issues: losing hedge requests abort (should_cancel_request is checked every retry), PK-range continuation pages are safe cross-region (pkranges metadata is account-global), and the async budget check is not a race.

Accepted design choices

  • Default-on for PPAF accounts (None → follow PPAF), now narrowly scoped to cold-start cache reads — consistent with the existing data-plane PPAF hedging default.
  • Fixed 1.5s threshold, not customer-configurable.
  • Routing-map refreshes (not only first-ever population) are also hedged; bounded by the budget and safe because pkranges metadata is account-global.

Validation

  • 25/25 new unit tests pass (sync + async).
  • pylint clean on the new modules.

)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Self-review (Seon thorough) fixes:

- Gate hedging on an explicit metadataCachePopulation request flag set only by the container-properties refresh, cold partition-key read, and routing-map fetch callsites, so public container reads and hybrid-search PK reads are no longer hedged.

- Size the hedge thread pool to max(cpu_count, 2*budget) so each in-flight hedge always has its primary+hedge threads.

- Remove the dead record_failure no-op (metadata cache health is account-global, not per-partition).

- Clarify the threshold rationale and the async budget non-blocking check.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants