[SPARK-57491][CORE] Prevent and detect stale push-based shuffle data from duplicate map task attempts#56559
[SPARK-57491][CORE] Prevent and detect stale push-based shuffle data from duplicate map task attempts#56559gaoyajun02 wants to merge 5 commits into
Conversation
…from duplicate map task attempts
cloud-fan
left a comment
There was a problem hiding this comment.
2 blocking, 5 non-blocking, 3 nits.
A well-motivated correctness fix filling a real gap in the existing checksum-mismatch path; two blocking items below.
Design / architecture (2)
- MapOutputTracker.scala:121:
stale*MapId*naming holds partitionId (== mapIndex), notMapStatus.mapId— a load-bearing invariant the names hide — see inline - (non-blocking, question) overlaps the existing
checksumMismatchIndices/rollback path and the merger's own stale-block handling; document the interaction or consider unifying
Correctness (4)
- MapOutputTrackerSuite.scala:640 (existing, unchanged line): the "SPARK-32921" test still binds the reply with
askSync[(Array[Byte], Array[Byte])], but the endpoint now returns a 3-tuple →ClassCastException. This test needs updating. - MapOutputTracker.scala:114:
staleMapIdsis never cleared except byunregisterShuffle— permanent fallback even after a clean recompute — see inline - TaskSetManager.scala:966: over-broad marking (winner re-delivery; attempt may never have pushed) — see inline
- PushBasedFetchHelper.scala:313:
@returnoverstates precision ("stale data" vs "marked partition present") — see inline
Suggestions (1)
- No tests exercise the reduce-side
checkStaleMapIdInMergedBlockfallback or the deferred-push skip-on-kill
Nits: 3 minor items (see inline comments).
Verification
Traced the marked-value invariant: ShuffleMapTask pushes with mapIndex = partitionId, RemoteBlockPushResolver records chunkTracker.add(mapIndex), and the driver marks tasks(index).partitionId. So the reduce-side bitmap.contains(id) works only because partitionId == mapIndex — never because it uses MapStatus.mapId. Detection is functionally correct today, but per-partition rather than per-attempt: both attempts share the mapIndex, so a marked block always falls back regardless of whether duplicate bytes are actually present (conservative and safe).
PR description suggestions
- Document: why the existing checksum-mismatch / rollback machinery does not cover this case (the loser's
MapStatusis never registered, so no checksum comparison happens).
| * detects that multiple task attempts for the same map output pushed data to the merger. | ||
| * @param staleMapId the mapId of the stale (redundant) attempt | ||
| */ | ||
| def markStalePushedMap(staleMapId: Int): Unit = withWriteLock { |
There was a problem hiding this comment.
This set holds the partitionId, which equals the mapIndex the merger records in its chunk bitmaps — not MapStatus.mapId. ShuffleMapTask pushes with mapIndex = partitionId, RemoteBlockPushResolver does chunkTracker.add(mapIndex), and the driver marks tasks(index).partitionId, so the reduce-side contains() check works only because partitionId == mapIndex. Naming all of this *MapId* (and logging mapId=${mapStatus.mapId} in TaskSetManager) hides that invariant: someone "fixing" the marked value to the real mapId would break the bitmap match and serve stale data as clean. Suggest renaming to staleMapIndexes / markStalePushedPartition and logging the partitionId you actually mark.
There was a problem hiding this comment.
Agreed rename plan:
The codebase already uses staleMapIndexes / markStalePushedPartition / getStaleMapIndexes in most places. The remaining inconsistencies are:
MapOutputTrackerWorker:staleMapIds→ should bestaleMapIndexes(for consistency withShuffleStatusandPushBasedFetchHelper)TaskSetManagerlog line 967:mapId=${mapStatus.mapId}→ should bepartitionId=$partitionId(log what we actually mark, not the MapStatus field)- Test names / comments: any remaining
*MapId*references to the stale set should use*PartitionId*or*MapIndex*
We'll clean these up to make the invariant obvious and prevent future "corrections" that break the bitmap match. Thanks for catching this.
| // partition already has a successful attempt registered. Any MapStatus arriving | ||
| // here is from a stale (redundant) attempt that also pushed data. | ||
| // Mark its mapId as stale so reducers can detect it in merged block chunks. | ||
| shuffleStatus.markStalePushedMap(partitionId) |
There was a problem hiding this comment.
This marks the partition stale unconditionally, without checking whether this attempt actually pushed. With the new deferred push, a killed attempt's completion listener skips the push, so it usually pushed nothing — yet it is still marked, forcing reducers to fall back. The info.finished call site is sharper: it can fire on a re-delivered winning attempt, marking the legitimate partition. The net effect is unnecessary fallback on most speculation/retry, including deterministic stages. Can the marking be gated on evidence that the attempt pushed (and skip the winner re-delivery case)?
There was a problem hiding this comment.
Thank you for the review. You raise an excellent point about the unconditional marking. After tracing through the code, here's my analysis:
- The info.finished path: winner re-delivery, not a stale push.
handleSuccessfulTask is serialized under TaskSchedulerImpl.synchronized (line 896), so two results for the same partition are never processed concurrently — they are strictly ordered. The info.finished guard exists to handle message retransmission of the same winning attempt, not a concurrent late arrival from a different attempt. Marking it as stale is incorrect: this is the canonical result arriving twice. - The killedByOtherAttempt path: push has occurredFor a task result to reach
handleSuccessfulTask, the executor must have sent it with state FINISHED (Executor.scala:1029). A task only reports FINISHED after runTask() completes successfully and the completion listener fires. At that point, ShuffleWriteProcessor's listener checks !context.isInterrupted() && !context.isFailed() — both are still false because the task finished normally. So initiateBlockPush() is always called before the result is sent back. The kill signal arrives later (or concurrently on a different thread) but cannot retroactively undo the already-submitted push job.
We have actually observed this in production(with this PR's
changes applied):
when the driver receives two tasks with the same mapIndex and rejects one
our debug logs confirm that the rejected task had already started pushing before its result arrived at the driver. This validates that the killedByOtherAttempt path always represents a real stale push.
Therefore, only the killedByOtherAttempt path legitimately needs markStalePushedPartition. The info.finished path should skip it entirely.
| * We record the stale mapIds here so the reduce side can check chunkBitmaps and fallback | ||
| * if stale data is present in a merged block. | ||
| */ | ||
| private[this] val staleMapIds = new java.util.HashSet[Int]() |
There was a problem hiding this comment.
staleMapIds is only ever added to — addMapOutput/updateMapOutput and the output-invalidation paths never clear it; only unregisterShuffle drops it. So a partition stays marked, and its merged blocks keep falling back to unmerged, for the whole life of the shuffle, even after a clean recompute re-registers a valid output. Clear the relevant id when its mapIndex is (re-)registered.
There was a problem hiding this comment.
We've carefully evaluated whether staleMapIndexes should be cleared in addMapOutput, and believe the current design is correct as-is. Here's the reasoning:
Core principle: staleMapIndexes cleanup should be tied to merged block lifecycle, not map output lifecycle.
The stale set protects against reading merged blocks that contain stale chunks. A partition's stale mark can only be safely removed when we know all corresponding merged blocks have been refreshed or discarded — which means all merger locations must be updated (either new shuffleMergeId, or explicit merge result invalidation).
Why not clearing in addMapOutput:
-
addMapOutputonly updates map-side metadata (mapStatuses) — it does not invalidate any existing merged blocks on mergers. The physical merged files on the external shuffle service may still contain stale chunks from duplicate pushes. -
Within the same
stageAttempt/shuffleMergeId, a late-arriving speculativeaddMapOutputcall is exactly the event whose MapStatus is the stale one — clearing the stale mark here would defeat layers 2/3.
Stage retry (full reset) — confirmed handled:
When unregisterAllMapAndMergeOutput is called (stage retry / rollback / barrier stage abort), it triggers:
removeMergeResultsByFilter(x => true)— all merge results clearedremoveShuffleMergerLocations()— all merger locations clearednewShuffleMergeState()→shuffleMergeId += 1— new merge IDincrementEpoch()— workers fetch freshShuffleStatuson nextgetStatusescall
The worker-side staleMapIndexes for this shuffleId is either cleared by unregisterShuffle or replaced entirely when workers re-fetch with the new epoch and get a fresh serialized snapshot where the driver-side staleMapIndexes set is empty (new ShuffleStatus).
Partial merge cleanup (single FetchFailed) — intentional keep:
For non-barrier stages, unregisterMergeResult only removes a specific <reduceId, bmAddress> merge entry without incrementing shuffleMergeId. Other reduces' merged blocks on the same mergers may still contain stale chunks, so keeping the stale mark and falling back is the safe choice.
Future optimization (not blocking):
If we want to avoid unnecessary fallbacks in the partial-cleanup case, we could track stale at merger-location granularity (e.g., staleMapIndexesByMerger[mergerAddress][partitionId]) and clear entries when that specific merger's merge results are all invalidated. But this adds complexity for marginal gain — the fallback path is correct and only costs performance, not correctness.
In summary: the current "only-add, clear-on-shuffle-end" semantics are intentional and match the merged block lifecycle. We're happy to add location-level tracking as a follow-up optimization if desired.
What changes were proposed in this pull request?
Fix push-based shuffle serving stale/incorrect data when multiple task attempts for the same map partition both push data to the external merger. Three layers of defense:
Why are the changes needed?
With push-based shuffle enabled, indeterminate stages can produce incorrect results when speculation or task retry causes two attempts of the same map partition to both push data to the external merger. The merged block ends up with interleaved data from both attempts, but only one attempt's MapStatus is committed on the driver. Downstream reducers silently read corrupted/duplicated data.
This is triggered by non-deterministic functions (rand(), UUID(), etc.) in shuffle keys or join conditions where row ordering differs between attempts. Even with deferred push (layer 1), there is a race window: if an original attempt succeeds and pushes, then gets killed by a speculative winner that also pushes, the merger receives data from both attempts but only one MapStatus is registered.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
Yes, co-authored with GLM-5V-Turbo.