Skip to content

[SPARK-57491][CORE] Prevent and detect stale push-based shuffle data from duplicate map task attempts#56559

Open
gaoyajun02 wants to merge 5 commits into
apache:masterfrom
gaoyajun02:SPARK-33235
Open

[SPARK-57491][CORE] Prevent and detect stale push-based shuffle data from duplicate map task attempts#56559
gaoyajun02 wants to merge 5 commits into
apache:masterfrom
gaoyajun02:SPARK-33235

Conversation

@gaoyajun02

@gaoyajun02 gaoyajun02 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Fix push-based shuffle serving stale/incorrect data when multiple task attempts for the same map partition both push data to the external merger. Three layers of defense:

  • ShuffleWriteProcessor — Defer shuffle block push to TaskCompletionListener so that killed/failed tasks never push. The listener checks context.isInterrupted() / context.isFailed() before initiating push, and logs which task was skipped and why.
  • TaskSetManager + MapOutputTracker — Track stale (duplicate) partitionIds on the driver side. When a speculative or retried ShuffleMapTask result arrives after another attempt for the same partition has already committed, TaskSetManager.detectStalePushIfShuffleTask marks the partition's ID as stale in ShuffleStatus. The stale set is serialized into serializedMapAndMergeStatus and propagated to executor-side MapOutputTrackerWorker alongside existing map output metadata.
  • PushBasedFetchHelper — Before reading any merged block on the reducer side, check chunk-level RoaringBitmaps for stale partitionIds via the new checkStaleMapIdInMergedBlock method. If a stale partitionId is found in any chunk bitmap, log a WARN and fall back to fetching original unmerged blocks. When the stale set is empty (the common case), this check short-circuits immediately with zero overhead.

Why are the changes needed?

With push-based shuffle enabled, indeterminate stages can produce incorrect results when speculation or task retry causes two attempts of the same map partition to both push data to the external merger. The merged block ends up with interleaved data from both attempts, but only one attempt's MapStatus is committed on the driver. Downstream reducers silently read corrupted/duplicated data.
This is triggered by non-deterministic functions (rand(), UUID(), etc.) in shuffle keys or join conditions where row ordering differs between attempts. Even with deferred push (layer 1), there is a race window: if an original attempt succeeds and pushes, then gets killed by a speculative winner that also pushes, the merger receives data from both attempts but only one MapStatus is registered.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • UT
  • Manual verification on internal cluster confirmed: speculative duplicate pushes are either prevented by deferred push or detected and fallen back from, with expected WARN log on fallback.

Was this patch authored or co-authored using generative AI tooling?

Yes, co-authored with GLM-5V-Turbo.

@gaoyajun02

Copy link
Copy Markdown
Contributor Author

PTAL @cloud-fan @Ngone51 @mridulm @otterc

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 blocking, 5 non-blocking, 3 nits.
A well-motivated correctness fix filling a real gap in the existing checksum-mismatch path; two blocking items below.

Design / architecture (2)

  • MapOutputTracker.scala:121: stale*MapId* naming holds partitionId (== mapIndex), not MapStatus.mapId — a load-bearing invariant the names hide — see inline
  • (non-blocking, question) overlaps the existing checksumMismatchIndices/rollback path and the merger's own stale-block handling; document the interaction or consider unifying

Correctness (4)

  • MapOutputTrackerSuite.scala:640 (existing, unchanged line): the "SPARK-32921" test still binds the reply with askSync[(Array[Byte], Array[Byte])], but the endpoint now returns a 3-tuple → ClassCastException. This test needs updating.
  • MapOutputTracker.scala:114: staleMapIds is never cleared except by unregisterShuffle — permanent fallback even after a clean recompute — see inline
  • TaskSetManager.scala:966: over-broad marking (winner re-delivery; attempt may never have pushed) — see inline
  • PushBasedFetchHelper.scala:313: @return overstates precision ("stale data" vs "marked partition present") — see inline

Suggestions (1)

  • No tests exercise the reduce-side checkStaleMapIdInMergedBlock fallback or the deferred-push skip-on-kill

Nits: 3 minor items (see inline comments).

Verification

Traced the marked-value invariant: ShuffleMapTask pushes with mapIndex = partitionId, RemoteBlockPushResolver records chunkTracker.add(mapIndex), and the driver marks tasks(index).partitionId. So the reduce-side bitmap.contains(id) works only because partitionId == mapIndex — never because it uses MapStatus.mapId. Detection is functionally correct today, but per-partition rather than per-attempt: both attempts share the mapIndex, so a marked block always falls back regardless of whether duplicate bytes are actually present (conservative and safe).

PR description suggestions

  • Document: why the existing checksum-mismatch / rollback machinery does not cover this case (the loser's MapStatus is never registered, so no checksum comparison happens).

* detects that multiple task attempts for the same map output pushed data to the merger.
* @param staleMapId the mapId of the stale (redundant) attempt
*/
def markStalePushedMap(staleMapId: Int): Unit = withWriteLock {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set holds the partitionId, which equals the mapIndex the merger records in its chunk bitmaps — not MapStatus.mapId. ShuffleMapTask pushes with mapIndex = partitionId, RemoteBlockPushResolver does chunkTracker.add(mapIndex), and the driver marks tasks(index).partitionId, so the reduce-side contains() check works only because partitionId == mapIndex. Naming all of this *MapId* (and logging mapId=${mapStatus.mapId} in TaskSetManager) hides that invariant: someone "fixing" the marked value to the real mapId would break the bitmap match and serve stale data as clean. Suggest renaming to staleMapIndexes / markStalePushedPartition and logging the partitionId you actually mark.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed rename plan:

The codebase already uses staleMapIndexes / markStalePushedPartition / getStaleMapIndexes in most places. The remaining inconsistencies are:

  1. MapOutputTrackerWorker: staleMapIds → should be staleMapIndexes (for consistency with ShuffleStatus and PushBasedFetchHelper)
  2. TaskSetManager log line 967: mapId=${mapStatus.mapId} → should be partitionId=$partitionId (log what we actually mark, not the MapStatus field)
  3. Test names / comments: any remaining *MapId* references to the stale set should use *PartitionId* or *MapIndex*

We'll clean these up to make the invariant obvious and prevent future "corrections" that break the bitmap match. Thanks for catching this.

// partition already has a successful attempt registered. Any MapStatus arriving
// here is from a stale (redundant) attempt that also pushed data.
// Mark its mapId as stale so reducers can detect it in merged block chunks.
shuffleStatus.markStalePushedMap(partitionId)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This marks the partition stale unconditionally, without checking whether this attempt actually pushed. With the new deferred push, a killed attempt's completion listener skips the push, so it usually pushed nothing — yet it is still marked, forcing reducers to fall back. The info.finished call site is sharper: it can fire on a re-delivered winning attempt, marking the legitimate partition. The net effect is unnecessary fallback on most speculation/retry, including deterministic stages. Can the marking be gated on evidence that the attempt pushed (and skip the winner re-delivery case)?

@gaoyajun02 gaoyajun02 Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. You raise an excellent point about the unconditional marking. After tracing through the code, here's my analysis:

  1. The info.finished path: winner re-delivery, not a stale push.
    handleSuccessfulTask is serialized under TaskSchedulerImpl.synchronized (line 896), so two results for the same partition are never processed concurrently — they are strictly ordered. The info.finished guard exists to handle message retransmission of the same winning attempt, not a concurrent late arrival from a different attempt. Marking it as stale is incorrect: this is the canonical result arriving twice.
  2. The killedByOtherAttempt path: push has occurredFor a task result to reach
    handleSuccessfulTask, the executor must have sent it with state FINISHED (Executor.scala:1029). A task only reports FINISHED after runTask() completes successfully and the completion listener fires. At that point, ShuffleWriteProcessor's listener checks !context.isInterrupted() && !context.isFailed() — both are still false because the task finished normally. So initiateBlockPush() is always called before the result is sent back. The kill signal arrives later (or concurrently on a different thread) but cannot retroactively undo the already-submitted push job.
    We have actually observed this in production(with this PR's
    changes applied):
    when the driver receives two tasks with the same mapIndex and rejects one
Image

our debug logs confirm that the rejected task had already started pushing before its result arrived at the driver. This validates that the killedByOtherAttempt path always represents a real stale push.

Image

Therefore, only the killedByOtherAttempt path legitimately needs markStalePushedPartition. The info.finished path should skip it entirely.

* We record the stale mapIds here so the reduce side can check chunkBitmaps and fallback
* if stale data is present in a merged block.
*/
private[this] val staleMapIds = new java.util.HashSet[Int]()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

staleMapIds is only ever added to — addMapOutput/updateMapOutput and the output-invalidation paths never clear it; only unregisterShuffle drops it. So a partition stays marked, and its merged blocks keep falling back to unmerged, for the whole life of the shuffle, even after a clean recompute re-registers a valid output. Clear the relevant id when its mapIndex is (re-)registered.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've carefully evaluated whether staleMapIndexes should be cleared in addMapOutput, and believe the current design is correct as-is. Here's the reasoning:

Core principle: staleMapIndexes cleanup should be tied to merged block lifecycle, not map output lifecycle.

The stale set protects against reading merged blocks that contain stale chunks. A partition's stale mark can only be safely removed when we know all corresponding merged blocks have been refreshed or discarded — which means all merger locations must be updated (either new shuffleMergeId, or explicit merge result invalidation).

Why not clearing in addMapOutput:

  1. addMapOutput only updates map-side metadata (mapStatuses) — it does not invalidate any existing merged blocks on mergers. The physical merged files on the external shuffle service may still contain stale chunks from duplicate pushes.

  2. Within the same stageAttempt / shuffleMergeId, a late-arriving speculative addMapOutput call is exactly the event whose MapStatus is the stale one — clearing the stale mark here would defeat layers 2/3.

Stage retry (full reset) — confirmed handled:

When unregisterAllMapAndMergeOutput is called (stage retry / rollback / barrier stage abort), it triggers:

  • removeMergeResultsByFilter(x => true) — all merge results cleared
  • removeShuffleMergerLocations() — all merger locations cleared
  • newShuffleMergeState()shuffleMergeId += 1 — new merge ID
  • incrementEpoch() — workers fetch fresh ShuffleStatus on next getStatuses call

The worker-side staleMapIndexes for this shuffleId is either cleared by unregisterShuffle or replaced entirely when workers re-fetch with the new epoch and get a fresh serialized snapshot where the driver-side staleMapIndexes set is empty (new ShuffleStatus).

Partial merge cleanup (single FetchFailed) — intentional keep:

For non-barrier stages, unregisterMergeResult only removes a specific <reduceId, bmAddress> merge entry without incrementing shuffleMergeId. Other reduces' merged blocks on the same mergers may still contain stale chunks, so keeping the stale mark and falling back is the safe choice.

Future optimization (not blocking):

If we want to avoid unnecessary fallbacks in the partial-cleanup case, we could track stale at merger-location granularity (e.g., staleMapIndexesByMerger[mergerAddress][partitionId]) and clear entries when that specific merger's merge results are all invalidated. But this adds complexity for marginal gain — the fallback path is correct and only costs performance, not correctness.

In summary: the current "only-add, clear-on-shuffle-end" semantics are intentional and match the merged block lifecycle. We're happy to add location-level tracking as a follow-up optimization if desired.

Comment thread core/src/main/scala/org/apache/spark/storage/PushBasedFetchHelper.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/MapOutputTracker.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/MapOutputTracker.scala Outdated
Comment thread core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala Outdated
gaoyajun02 and others added 4 commits June 18, 2026 10:40
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants