[SPARK-56743][SPARK-56773][SQL][CORE][TESTS] Exercise writer-stage retries in DSv2 DML metric tests and fix injection-state cleanup under AQE#56597
Open
juliuszsompolski wants to merge 1 commit into
Conversation
…ric tests; add AQE injection coverage Follow-up to SPARK-56743 (the INJECT_SHUFFLE_FETCH_FAILURES injection infra and SQLLastAttemptMetric for DSv2 DML metrics). Makes the DSv2 MERGE/UPDATE retry tests actually trigger a writer-stage retry, fixes a DAGScheduler bug that prevented that under AQE, and closes the test gap that let the bug through. DAGScheduler: stop evicting the test-only injection state per-stage. cleanupStateForJobAndIndependentStages removed the per-shuffle injection bookkeeping (the three injectShuffleFetchFailures* maps) whenever a stage was removed. Under AQE each Exchange is materialized as its own map-stage job, so that cleanup ran between the producer job and the consumer job and dropped the pending deferred corruption before the consumer was ever submitted - no FetchFailed, no retry. The maps are keyed by the globally-unique (never-reused) shuffleId, only allocated under Utils.isTesting, and bounded by the number of shuffles in a test SparkContext, so they are retained for its lifetime (restoring the original body of that cleanup loop). MetricsFailureInjectionSuite: add AQE coverage. The existing INJECT_SHUFFLE_FETCH_FAILURES tests all run with AQE disabled (the suite mixes in DisableAdaptiveExecutionSuite), so none exercised AQE's per-shuffle materialization - the exact path where the eviction above suppressed the retry. New test "Three stage metrics block failure injection with AQE" runs the same 3-stage query with ADAPTIVE_EXECUTION_ENABLED=true and asserts the non-leaf stage's raw counter overcounts (a retry actually fired) while SLAM stays stable. It fails on the pre-fix code and passes after. DSv2 MERGE/UPDATE retry tests: the "metric values are stable across stage retries" tests now run under the injection and exercise a real retry. For the metadata MERGE variants - where the writer's RequiresDistributionAndOrdering forces a re-shuffle between MergeRowsExec and the writer - MergeRowsExec sits in a non-leaf shuffle map stage, re-runs under the injection, and its raw per-row counters overcount while the SLAM-aware MergeSummary stays correct; the test asserts both. The noMetadata variants skip the overcount assertion (there MergeRowsExec is in the result stage and cannot be re-run by an upstream injection). UPDATE writer-side metrics live on the result stage and single-count by design, so that test is regression coverage that retries don't break the SLAM-aware UpdateSummary. Adds a noMetadata accessor on RowLevelOperationSuiteBase so MERGE variants can branch on whether the writer requires a re-shuffle. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Follow-up to SPARK-56743 (the
INJECT_SHUFFLE_FETCH_FAILURESinjection infra and SLAM for DSv2 DML metrics). It strengthens theDSv2 MERGE/UPDATE retry tests so they actually exercise a writer-stage retry, fixes a
DAGScheduler bug that prevented that under AQE, and closes the test-coverage gap that let the bug
through.
Three parts:
1. DAGScheduler: don't evict the test-injection state per-stage.
cleanupStateForJobAndIndependentStagesremoved the per-shuffle injection bookkeeping (the threeinjectShuffleFetchFailures*maps) whenever a stage was removed. Under AQE eachExchangeismaterialized as its own map-stage job, so that cleanup ran between the producer job and the
consumer job and dropped the pending deferred corruption before the consumer was ever submitted -
no
FetchFailed, no retry. The maps are keyed by the globally-unique (never-reused)shuffleId,only allocated under
Utils.isTesting, and bounded by the number of shuffles in a testSparkContext, so they are simply retained for its lifetime (restoring the original body of thatcleanup loop).
2.
MetricsFailureInjectionSuite: add AQE coverage. The existingINJECT_SHUFFLE_FETCH_FAILUREStests all run with AQE disabled (the suite mixes inDisableAdaptiveExecutionSuite), so none exercised AQE's per-shuffle-materialization path - theexact path where the eviction above suppressed the retry. New test
Three stage metrics block failure injection with AQEruns the same 3-stage query withADAPTIVE_EXECUTION_ENABLED=trueand asserts the non-leaf stage's raw counter overcounts (a retryactually fired) while SLAM stays stable. It fails on the pre-fix code and passes after.
3. DSv2 MERGE/UPDATE retry tests (
MergeIntoTableSuiteBase,UpdateTableSuiteBase): the"metric values are stable across stage retries"tests now run under the injection and exercise areal retry. For the metadata MERGE variants - where the writer's
RequiresDistributionAndOrderingforces a re-shuffle between
MergeRowsExecand the writer -MergeRowsExecsits in a non-leafshuffle map stage, re-runs under the injection, and its raw per-row counters overcount, while the
SLAM-aware
MergeSummarystays correct; the test asserts both. ThenoMetadatavariants skip theovercount assertion (there
MergeRowsExecis in the result stage and cannot be re-run by anupstream injection). UPDATE writer-side metrics live on the result stage and single-count by
design (
ResultStage.findMissingPartitionsonly re-runs not-yet-completed partitions), so thattest is regression coverage that retries don't break the SLAM-aware
UpdateSummary. AnoMetadataaccessor is added on
RowLevelOperationSuiteBaseso MERGE variants can branch on whether thewriter requires a re-shuffle.
Why are the changes needed?
The DSv2 DML retry tests added in SPARK-56743 only verified that SLAM values stay correct given
retries happen - which is vacuously true even when no retry fires. With the merged injection infra
they did not actually trigger a writer-stage retry under AQE, because the per-stage eviction
dropped the deferred corruption between AQE's per-shuffle jobs. This PR makes the tests demand a
real retry (raw-metric overcount), fixes the infra so that retry actually happens under AQE, and
adds an infra-level AQE test so the regression is caught directly in
MetricsFailureInjectionSuiterather than only end-to-end.
Does this PR introduce any user-facing change?
No. The
DAGSchedulerchange only affects test-only state allocated underUtils.isTesting; therest is test code.
How was this patch tested?
MetricsFailureInjectionSuite(13 tests, incl. the new AQE test) pass; the newAQE test was confirmed to fail on the pre-fix code (eviction present) and pass after.
SQLLastAttemptMetricIntegrationSuite(+WithStageRetries/WithChecksumMismatch) andSQLLastAttemptMetricPlanShapesSuitestill pass (258 tests) - no regression from theDAGSchedulerchange.the raw
MergeRowsExecaccumulator (numTargetRowsUpdated=6) whileMergeSummaryreports 2.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code, Opus 4.8.