Skip to content

[SPARK-56743][SPARK-56773][SQL][CORE][TESTS] Exercise writer-stage retries in DSv2 DML metric tests and fix injection-state cleanup under AQE#56597

Open
juliuszsompolski wants to merge 1 commit into
apache:masterfrom
juliuszsompolski:SPARK-56743-extratests
Open

[SPARK-56743][SPARK-56773][SQL][CORE][TESTS] Exercise writer-stage retries in DSv2 DML metric tests and fix injection-state cleanup under AQE#56597
juliuszsompolski wants to merge 1 commit into
apache:masterfrom
juliuszsompolski:SPARK-56743-extratests

Conversation

@juliuszsompolski

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Follow-up to SPARK-56743 (the
INJECT_SHUFFLE_FETCH_FAILURES injection infra and SLAM for DSv2 DML metrics). It strengthens the
DSv2 MERGE/UPDATE retry tests so they actually exercise a writer-stage retry, fixes a
DAGScheduler bug that prevented that under AQE, and closes the test-coverage gap that let the bug
through.

Three parts:

1. DAGScheduler: don't evict the test-injection state per-stage.
cleanupStateForJobAndIndependentStages removed the per-shuffle injection bookkeeping (the three
injectShuffleFetchFailures* maps) whenever a stage was removed. Under AQE each Exchange is
materialized as its own map-stage job, so that cleanup ran between the producer job and the
consumer job and dropped the pending deferred corruption before the consumer was ever submitted -
no FetchFailed, no retry. The maps are keyed by the globally-unique (never-reused) shuffleId,
only allocated under Utils.isTesting, and bounded by the number of shuffles in a test
SparkContext, so they are simply retained for its lifetime (restoring the original body of that
cleanup loop).

2. MetricsFailureInjectionSuite: add AQE coverage. The existing
INJECT_SHUFFLE_FETCH_FAILURES tests all run with AQE disabled (the suite mixes in
DisableAdaptiveExecutionSuite), so none exercised AQE's per-shuffle-materialization path - the
exact path where the eviction above suppressed the retry. New test
Three stage metrics block failure injection with AQE runs the same 3-stage query with
ADAPTIVE_EXECUTION_ENABLED=true and asserts the non-leaf stage's raw counter overcounts (a retry
actually fired) while SLAM stays stable. It fails on the pre-fix code and passes after.

3. DSv2 MERGE/UPDATE retry tests (MergeIntoTableSuiteBase, UpdateTableSuiteBase): the
"metric values are stable across stage retries" tests now run under the injection and exercise a
real retry. For the metadata MERGE variants - where the writer's RequiresDistributionAndOrdering
forces a re-shuffle between MergeRowsExec and the writer - MergeRowsExec sits in a non-leaf
shuffle map stage, re-runs under the injection, and its raw per-row counters overcount, while the
SLAM-aware MergeSummary stays correct; the test asserts both. The noMetadata variants skip the
overcount assertion (there MergeRowsExec is in the result stage and cannot be re-run by an
upstream injection). UPDATE writer-side metrics live on the result stage and single-count by
design (ResultStage.findMissingPartitions only re-runs not-yet-completed partitions), so that
test is regression coverage that retries don't break the SLAM-aware UpdateSummary. A noMetadata
accessor is added on RowLevelOperationSuiteBase so MERGE variants can branch on whether the
writer requires a re-shuffle.

Why are the changes needed?

The DSv2 DML retry tests added in SPARK-56743 only verified that SLAM values stay correct given
retries happen - which is vacuously true even when no retry fires. With the merged injection infra
they did not actually trigger a writer-stage retry under AQE, because the per-stage eviction
dropped the deferred corruption between AQE's per-shuffle jobs. This PR makes the tests demand a
real retry (raw-metric overcount), fixes the infra so that retry actually happens under AQE, and
adds an infra-level AQE test so the regression is caught directly in MetricsFailureInjectionSuite
rather than only end-to-end.

Does this PR introduce any user-facing change?

No. The DAGScheduler change only affects test-only state allocated under Utils.isTesting; the
rest is test code.

How was this patch tested?

  • New + existing MetricsFailureInjectionSuite (13 tests, incl. the new AQE test) pass; the new
    AQE test was confirmed to fail on the pre-fix code (eviction present) and pass after.
  • SQLLastAttemptMetricIntegrationSuite (+ WithStageRetries / WithChecksumMismatch) and
    SQLLastAttemptMetricPlanShapesSuite still pass (258 tests) - no regression from the
    DAGScheduler change.
  • All 4 MERGE and 4 UPDATE row-level-operation variants pass; metadata MERGE genuinely overcounts
    the raw MergeRowsExec accumulator (numTargetRowsUpdated=6) while MergeSummary reports 2.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code, Opus 4.8.

…ric tests; add AQE injection coverage

Follow-up to SPARK-56743 (the INJECT_SHUFFLE_FETCH_FAILURES injection infra
and SQLLastAttemptMetric for DSv2 DML metrics). Makes the DSv2 MERGE/UPDATE
retry tests actually trigger a writer-stage retry, fixes a DAGScheduler bug
that prevented that under AQE, and closes the test gap that let the bug
through.

DAGScheduler: stop evicting the test-only injection state per-stage.
cleanupStateForJobAndIndependentStages removed the per-shuffle injection
bookkeeping (the three injectShuffleFetchFailures* maps) whenever a stage was
removed. Under AQE each Exchange is materialized as its own map-stage job, so
that cleanup ran between the producer job and the consumer job and dropped the
pending deferred corruption before the consumer was ever submitted - no
FetchFailed, no retry. The maps are keyed by the globally-unique (never-reused)
shuffleId, only allocated under Utils.isTesting, and bounded by the number of
shuffles in a test SparkContext, so they are retained for its lifetime
(restoring the original body of that cleanup loop).

MetricsFailureInjectionSuite: add AQE coverage. The existing
INJECT_SHUFFLE_FETCH_FAILURES tests all run with AQE disabled (the suite mixes
in DisableAdaptiveExecutionSuite), so none exercised AQE's per-shuffle
materialization - the exact path where the eviction above suppressed the retry.
New test "Three stage metrics block failure injection with AQE" runs the same
3-stage query with ADAPTIVE_EXECUTION_ENABLED=true and asserts the non-leaf
stage's raw counter overcounts (a retry actually fired) while SLAM stays stable.
It fails on the pre-fix code and passes after.

DSv2 MERGE/UPDATE retry tests: the "metric values are stable across stage
retries" tests now run under the injection and exercise a real retry. For the
metadata MERGE variants - where the writer's RequiresDistributionAndOrdering
forces a re-shuffle between MergeRowsExec and the writer - MergeRowsExec sits in
a non-leaf shuffle map stage, re-runs under the injection, and its raw per-row
counters overcount while the SLAM-aware MergeSummary stays correct; the test
asserts both. The noMetadata variants skip the overcount assertion (there
MergeRowsExec is in the result stage and cannot be re-run by an upstream
injection). UPDATE writer-side metrics live on the result stage and single-count
by design, so that test is regression coverage that retries don't break the
SLAM-aware UpdateSummary. Adds a noMetadata accessor on
RowLevelOperationSuiteBase so MERGE variants can branch on whether the writer
requires a re-shuffle.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant