[SPARK-56919][SQL] Move setupJob before materializeAdaptiveSparkPlan to prevent table path loss#56126
Conversation
|
@dongjoon-hyun / @gengliangwang. Could you please review? This is a one-line fix - moves |
|
Gentle ping @cloud-fan @gengliangwang - this is a one-line fix for a data loss bug: |
yadavay-amzn
left a comment
There was a problem hiding this comment.
Comments (no approval)
Assessment: The fix correctly identifies a real data-loss bug: INSERT OVERWRITE deletes the table path before calling FileFormatWriter.write(), and if materializeAdaptiveSparkPlan throws (AQE shuffle failure), setupJob was never reached, so the output directory was never recreated — permanent path loss.
Moving setupJob before materializeAdaptiveSparkPlan ensures the directory is recreated before anything can throw. The test is well-constructed and exercises the actual runtime failure path (not just plan shape).
Blocking concern:
-
Staging directory leak on AQE failure: With the new ordering, if
materializeAdaptiveSparkPlanthrows aftersetupJobbut beforewriteAndCommit, neithercommitJobnorabortJobis called.setupJob(via Hadoop'sFileOutputCommitter) creates a staging directory atoutputPath/.spark-staging-{jobId}/_temporary/. This directory will be leaked. The old code had the same semantic guarantee —setupJobwas intentionally placed outside the try/catch block inwriteAndCommit— but in the old code, ifsetupJobsucceeded, execution always enteredwriteAndCommit's try block, soabortJobwould fire on any subsequent failure. Now there's a gap. Consider wrapping the post-setupJobsection ofwrite()in a try/catch that callscommitter.abortJob(job)on failure, e.g.:committer.setupJob(job) try { // materializeAdaptiveSparkPlan + the rest of write() ... } catch { case cause: Throwable => committer.abortJob(job) throw cause }
Or at minimum, document that the staging dir may leak in this failure window and explain why that's acceptable (it arguably is — temp dirs are ephemeral and the data-loss fix is more important).
Non-blocking:
-
nit: The removed comment ("This call shouldn't be put into the
tryblock below because it only initializes and prepares the job, any exception thrown from here shouldn't cause abortJob() to be called") was historically accurate. Now thatsetupJobis in a different location entirely, the deletion of that comment is correct, but the new comment could mention why it's safe to not have abortJob protection here (staging dir is acceptable to leak vs. losing the table path). -
nit: The test uses
"spark.sql.optimizer.plannedWrite.enabled"as a string literal. Consider usingSQLConf.PLANNED_WRITE_ENABLED.keyfor consistency with the other config references. -
The fix also affects
FileStreamSink(another caller ofFileFormatWriter.write). This should be benign since streaming sinks don't do the delete-before-write pattern, but noting for awareness.
What I verified
-
Code-traced the full write path:
InsertIntoHadoopFsRelationCommand.run()→deleteMatchingPartitions(deletes table path) →FileFormatWriter.write()→ (old: materialize → writeAndCommit → setupJob; new: setupJob → materialize → writeAndCommit). Confirmed the ordering change is the minimal fix. -
Verified setupJob semantics:
HadoopMapReduceCommitProtocol.setupJobcreates job/task IDs in config, instantiates the OutputCommitter, and calls Hadoop'scommitter.setupJobwhich creates the output directory. Nothing between the newsetupJobposition andwriteAndCommitdepends on config set bysetupJob, andwriteJobUUID(set aftersetupJob) is not consumed bysetupJob. -
Verified no double-call:
setupJobis removed fromwriteAndCommitand added once inwrite(), covering both the planned-write and unplanned-write paths via the twoexecuteWriteoverloads. -
Verified test exercises real bug path:
plannedWrite.enabled=falseensuresmaterializeAdaptiveSparkPlanis called;repartition(2)forces shuffle;fail_udfthrows during shuffle execution insideAdaptiveSparkPlanExec.finalPhysicalPlan; the assertion checks the actual filesystem path existence. This is a proper runtime regression test, not just a plan-shape check. Rating: ADEQUATE — exercises the exact failure mode. -
Did not run the test locally — assessed via code trace only.
cloud-fan
left a comment
There was a problem hiding this comment.
Review summary: correct, minimal fix for a real data-loss bug — 0 blocking, 1 non-blocking, 2 nits.
Design / architecture (non-blocking)
FileFormatWriter.writenow callssetupJoboutsidewriteAndCommit'stry/catch, so ifmaterializeAdaptiveSparkPlanthrows, neithercommitJobnorabortJobruns and the committer staging dir (_temporary/.spark-staging-{jobId}) is leaked. This is the concern @yadavay-amzn already raised as blocking — I'd rate it non-blocking: the leaked dirs are_/.-prefixed (filtered fromFileIndexreads) and self-heal on the next overwrite'sdeleteMatchingPartitions, and preventing the table-path loss is the priority. The clean form is to wrap materialize + the rest ofwrite()intry { ... } catch { case c: Throwable => committer.abortJob(job); throw c }; I confirmedHadoopMapReduceCommitProtocol.abortJobdeletes only_temporary/stagingDir, not the output path, so this preserves the fix rather than re-introducing the bug.
Nits (both already noted by @yadavay-amzn):
- Test: prefer
SQLConf.PLANNED_WRITE_ENABLED.keyover the"spark.sql.optimizer.plannedWrite.enabled"string literal. - The new comment could note why there's no
abortJobprotection at the newsetupJobsite (a leaked temp dir is acceptable vs. losing the table path).
Verification
Independently confirmed the reorder is safe:
setupJobdoes not readspark.sql.sources.writeJobUUID(set on the job config at line 174, aftersetupJob) — its jobId comes from the ctor arg /createJobID, so moving it earlier introduces no ordering dependency.setupJobis still invoked exactly once: removed fromwriteAndCommit(the shared path for bothexecuteWriteoverloads) and added once inwrite().- The other caller,
FileStreamSink, doesn't delete-before-write, so the earliersetupJobis benign there.
Since @yadavay-amzn's review already covers every point, I'm not adding duplicate inline comments.
65ae865 to
92d3b62
Compare
|
Thanks @yadavay-amzn and @cloud-fan for the reviews! Addressed all points in the latest commit:
|
cloud-fan
left a comment
There was a problem hiding this comment.
3 addressed, 0 remaining, 4 new. (4 = 4 newly introduced, 0 late catches.)
0 blocking, 3 non-blocking, 1 nit.
Correct, minimal fix for a real data-loss bug; the prior round's concerns are resolved. The new items are all consequences of the latest commit's additions - none blocks merge.
Design / architecture (1)
- FileFormatWriter.scala:208: outer catch double-calls
abortJobon write/commit failure (writeAndCommitalready aborts) - see inline
Correctness (1)
- FileFormatWriter.scala:153: "leaked staging dir is acceptable" comment is stale - the added
abortJobcleans it up - see inline
Suggestions (1)
- FileFormatWriter.scala:158:
trybody not re-indented (optional; diff-size tradeoff) - see inline
Nits: 1 minor item (see inline comments).
Verification
Confirmed the bug and fix by code trace: InsertIntoHadoopFsRelationCommand eagerly deletes the output path (deleteMatchingPartitions -> fs.delete, line 136) before FileFormatWriter.write; pre-fix, a materializeAdaptiveSparkPlan throw skipped setupJob, so the dir was never recreated. Moving setupJob earlier recreates it before materialize can throw, and setupJob doesn't read writeJobUUID (set later at line 178), so the reorder is safe. HadoopMapReduceCommitProtocol.abortJob deletes only _temporary/staging, not the output path, so the new catch preserves the fix.
PR description suggestions
- Document: the
try/catch+abortJobcleanup added in the latest commit. The "What changes were proposed" section still describes only the original one-linesetupJobmove; the abort-on-failure handling (and thatabortJobnow runs whenmaterializeAdaptiveSparkPlanthrows) is a substantive part of the change and is undocumented.
| } | ||
| } catch { | ||
| case cause: Throwable => | ||
| committer.abortJob(job) |
There was a problem hiding this comment.
writeAndCommit already calls abortJob and rethrows on a write- or commit-failure, and both executeWrite calls sit inside this new try - so on those paths abortJob now runs twice (it was once before this PR). It's harmless for the built-in committers (their abortJob is idempotent), but it's a behavior change to the FileCommitProtocol.abortJob contract, and a non-idempotent custom committer could double-clean. The materialize-failure path you're fixing only hits this outer catch, so it's unaffected. Simplest single-abort form: drop writeAndCommit's now-redundant try/catch and move its logError here.
| // A leaked staging dir (_temporary / .spark-staging-*) is acceptable vs. losing the table | ||
| // path — these dirs are dot/underscore-prefixed (filtered from reads) and self-heal on the | ||
| // next overwrite. |
There was a problem hiding this comment.
This rationale is now stale: the try/catch you added below calls abortJob on any failure after setupJob, and HadoopMapReduceCommitProtocol.abortJob deletes _temporary / .spark-staging-* - so the staging dir isn't actually leaked on the materialize-failure path. The only leak window is setupJob itself throwing (it's outside the try). The rewrite below also drops the non-ASCII em-dash that would otherwise fail scalastyle.
| // A leaked staging dir (_temporary / .spark-staging-*) is acceptable vs. losing the table | |
| // path — these dirs are dot/underscore-prefixed (filtered from reads) and self-heal on the | |
| // next overwrite. | |
| // setupJob is outside the try below because it only initializes the job; the try/catch | |
| // calls abortJob on any later failure (e.g. materialize throwing), which cleans up the | |
| // staging dir (_temporary / .spark-staging-*). |
| // next overwrite. | ||
| committer.setupJob(job) | ||
|
|
||
| try { |
There was a problem hiding this comment.
Optional: the body here isn't re-indented, so def materializeAdaptiveSparkPlan and the locals read as method-level rather than inside the try. Re-indenting ~47 lines bloats the diff, so leaving it is a defensible call - your judgment on the project norm.
…to prevent table path loss INSERT OVERWRITE deletes the output path before calling FileFormatWriter.write(). If materializeAdaptiveSparkPlan throws (e.g., AQE shuffle stage failure), setupJob inside writeAndCommit is never reached, leaving the path permanently deleted. Move committer.setupJob(job) to before the materializeAdaptiveSparkPlan call so the output path is recreated regardless of whether AQE succeeds. Closes SPARK-56919'
92d3b62 to
e4736d8
Compare
|
Thanks @cloud-fan! for the review. Addressed all items in the latest commit:
|
What changes were proposed in this pull request?
committer.setupJob(job)from insidewriteAndCommit()to beforematerializeAdaptiveSparkPlan()inFileFormatWriter.write(), so the output path is recreated before anything can throw.setupJobbody intry { ... } catch { committer.abortJob(job); throw }so the staging dir is cleaned up on any failure (e.g., AQE shuffle stage failure inmaterializeAdaptiveSparkPlan).writeAndCommit's innertry / catch + abortJobsince the outer catch now handles it - avoiding double-calling ofabortJobfor write / commit failuresWhy are the changes needed?
INSERT OVERWRITEdeletes the output path before callingwrite(). WhenmaterializeAdaptiveSparkPlanthrows (AQE shuffle stage failure),writeAndCommitis never reached, sosetupJobnever recreates the path. The table path is permanently lost. The outertry / catchensuresabortJobcleans up thestaging dir (_temporary / .spark-staging-*)on any failure after setupJob.Does this PR introduce any user-facing change?
Yes. Previously, a failed
INSERT OVERWRITEwith AQE could permanently delete the table path. Now the path survives the failure.How was this patch tested?
Added regression test in
InsertSuitethat uses a failing UDF in a shuffle stage to trigger AQE failure duringmaterializeAdaptiveSparkPlan. Verifies the table path exists after the failed overwrite.Was this patch authored or co-authored using generative AI tooling?
Yes. Authored using Claude Opus 4.6.