action rewinding: recover lost CAS blobs#1349
Open
sluongng wants to merge 1 commit into
Open
Conversation
Remote CAS loss could leave deferred builds unable to recover generated inputs or final outputs without a daemon restart. This matters because the build graph still contains producer actions that can recreate those blobs, but Buck was not always turning missing-CAS errors into graph rewinds. The existing retry covered some upload and forced materialization failures, but missed default final materialization, local executor input materialization, and upload passes with several generated inputs missing at once. Track lost inputs as a typed batch so the build command can dirty all producer BuildKey nodes and the consumer in one rewind. Canonicalize rewind keys through the registered action lookup before dirtying DICE. This makes dynamic_output redirects invalidate the producer action that can recreate the missing blob. When a rewound action is replayed, bypass both Buck action-cache lookups and the remote executor cache lookup. Remote execution can otherwise return the same cached ActionResult, leaving the missing CAS blob absent and causing the consumer to hit the rewind cap. When local materialization discovers an expired CAS entry, convert the materializer not-found error into the same typed context. Also treat default final materialization not-found errors as rewindable, since materializations = deferred still materializes requested outputs unless the stricter skip-final mode is selected. When final materialization and final upload run together, upload can report a missing CAS blob before materialization cleans the shared queue. Because those branches run under try_compute2, the upload error can drop the materialization side before it removes queue_tracker entries. Clear those entries on the upload-side rewind path as well. Also clear the per-transaction materialization queue after committing a rewind, so the retry does not skip outputs that were queued before the DICE transaction was invalidated. The tests use Buck remote-execution test hooks and a hybrid execution platform instead of external RE configuration. They cover remote generated inputs, directory leaves, worker-side missing input reports, local-only consumers, default final materialization, final upload with materialization, and a missing-input count above the repeated-rewind cap.
Contributor
|
This pull request has been imported. If you are a Meta employee, you can view this in D109988939. (Because this pull request was imported automatically, there will not be any future comments.) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Remote CAS loss could leave deferred builds unable to recover generated
inputs or final outputs without a daemon restart. This matters because
the build graph still contains producer actions that can recreate those
blobs, but Buck was not always turning missing-CAS errors into graph
rewinds.
This is the Buck2 equivalent of Bazel's action-rewinding path. Bazel
added
--rewind_lost_inputsin bazelbuild/bazel#25477, distinguishingsingle-build action rewinding from build-level retries via
--experimental_remote_cache_eviction_retries. That Bazel work alsocalled out concurrency-sensitive cases around arbitrary
--jobsvaluesand async cache uploads through
--remote_cache_async.This change tracks lost inputs as a typed batch so the build command can
dirty all producer BuildKey nodes and the consumer in one rewind.
Canonicalize rewind keys through the registered action lookup before
dirtying DICE. This makes dynamic_output redirects invalidate the
producer action that can recreate the missing blob.
When a rewound action is replayed, bypass both Buck action-cache lookups
and the remote executor cache lookup. Remote execution can otherwise
return the same cached ActionResult, leaving the missing CAS blob absent
and causing the consumer to hit the rewind cap.
When local materialization discovers an expired CAS entry, convert the
materializer not-found error into the same typed context. Also treat
default final materialization not-found errors as rewindable, since
materializations = deferred still materializes requested outputs unless
the stricter skip-final mode is selected.
When final materialization and final upload run together, upload can
report a missing CAS blob before materialization cleans the shared
queue. Because those branches run under try_compute2, the upload error
can drop the materialization side before it removes queue_tracker
entries. Clear those entries on the upload-side rewind path as well.
Also clear the per-transaction materialization queue after committing a
rewind, so the retry does not skip outputs that were queued before the
DICE transaction was invalidated.
The first version of this fix was developed and exercised against
BuildBuddy remote execution. For upstream, the tests were rewritten to
use Meta's remote-execution test hooks and a hybrid execution platform
instead of depending on external RE configuration. They cover remote
generated inputs, directory leaves, worker-side missing input reports,
local-only consumers, default final materialization, final upload with
materialization, and a missing-input count above the repeated-rewind
cap.
References:
--rewind_lost_inputsbazelbuild/bazel#25477RewindingTestwith remote execution bazelbuild/bazel#25412