Skip to content

Fix shard ownership bug when decommissioning indexer#6509

Open
nadav-govari wants to merge 1 commit into
mainfrom
nadav/decommissioning-bug
Open

Fix shard ownership bug when decommissioning indexer#6509
nadav-govari wants to merge 1 commit into
mainfrom
nadav/decommissioning-bug

Conversation

@nadav-govari

Copy link
Copy Markdown
Collaborator

Description

Indexing plan broadcast bug when an indexer retires causes crash loop of multiple indexers trying to acquire and index one shard, which slows down indexing and decommissioning dramatically and crashes indexing pipelines (but doesnt corrupt data or cause any data loss).

In detail

When an ingester signals that it’s entering the retiring state, it’s IngesterStatus is no longer Ready, excluding it from being included in the next indexing plan:

fn select_available_indexers_for_scheduling(&self) -> Vec<IndexerNodeInfo> {

The indexers used to build the plan are the ones then iterated over to send the new indexing plan to:

for (node_id, indexing_tasks) in new_physical_plan.indexing_tasks_per_indexer() {

The problem with this is that the decommissioning indexers are never notified of the new plan (since they’re not in the new plan). The new indexing plan reassigns the shards from this indexer to a different one, which calls AcquireShard with its own publish token, which updates the metastore. It then fetches and indexes the shard and publishes a split.

Since the decommissioning indexer doesn’t know that it’s supposed to stop indexing the shard, it continues indexing the shard as well, and tries to publish a split, but fails because the publish token is different. This triggers the killswitch and the pipeline restarts. This is where the bug is: since it never receives the new indexing plan, the IndexingService still has this shard in its assigned shards - this causes it to call AcquireShard itself, with a new publish token. So now it’s the owner of the shard. The new indexer tries to publish its next split, and now it fails with the same publish token problem, and it restarts and reacquires the shard for itself.

This ping pong happens until either the shard is completely indexed and reaches EOF, or until the decommissioning timeout is hit and the decommissioning indexer is force killed, at which point the new indexer takes over and continues indexing the shard as intended.

Luckily the publish token works completely as intended and prevents us from having contention or corrupted data. It just means decommissioning takes forever and indexing is much slower.

Fix: Send the new indexing plan to all nodes in the ingester pool, not just the nodes in the plan

Tests

Added unit test, added integration test.

There’s two additional future improvements that could aid here:

AcquireShards should be based on the most recent indexing plan

Currently, all of a retiring indexer’s shards are indexed on other indexers, and the retiring indexer sits idle. The indexing plan could be modified so that a decommissioning indexer is allowed to index its (and only its) local shards.

@nadav-govari nadav-govari requested a review from a team as a code owner June 12, 2026 21:27

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 75a150a9dc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +420 to +423
let indexing_tasks = new_physical_plan
.indexer(indexer.node_id.as_str())
.unwrap_or(&[])
.to_vec();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep dropped indexers retryable after a failed empty plan

When a retiring indexer is omitted from new_physical_plan, this synthesizes an empty plan only for this single apply attempt. If that RPC returns an error (the code below just logs it), last_applied_physical_plan still does not contain the retiring node, and control_running_plan later compares only select_available_indexers_for_scheduling() (ready nodes when any exist), so it will not retry clearing that node's stale pipelines. In that transient failure scenario the decommissioning indexer can continue the publish-token ping-pong this change is meant to stop; include these empty assignments in the desired state or otherwise make the dropped-node apply retryable.

Useful? React with 👍 / 👎.

Comment on lines +419 to +423
for indexer in self.indexer_pool.values() {
let indexing_tasks = new_physical_plan
.indexer(indexer.node_id.as_str())
.unwrap_or(&[])
.to_vec();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid letting stale empty plans clobber newer assignments

Because each apply is spawned asynchronously and there is no per-node ordering/version check, sending an empty plan to every pool member can race with the next rebuild. For example, an indexer that is present but not ready during plan P1 receives this synthesized empty plan; if it becomes ready and gets tasks from P2 before the delayed P1 RPC arrives, the late empty P1 apply shuts down the newly assigned pipelines. Previously P1 would not send anything to nodes outside the plan, so stale applies could not clear assignments for nodes that were unplanned at the time; restrict empty plans to nodes that actually need draining or attach ordering to applied plans.

Useful? React with 👍 / 👎.

Comment on lines 427 to +428
let notify_on_drop = notify_on_drop.clone();
tokio::spawn({
let indexer = indexers
.iter()
.find(|indexer| indexer.node_id == node_id.as_str())
.expect("This should never happen as the plan was built from these indexers.")
.clone();
let indexing_tasks = indexing_tasks.clone();
async move {
if let Err(error) = indexer
.client
.clone()
.apply_indexing_plan(ApplyIndexingPlanRequest { indexing_tasks })
.await
{
warn!(
%error,
node_id=%indexer.node_id,
generation_id=indexer.generation_id,
"failed to apply indexing plan to indexer"
);
}
drop(notify_on_drop);
tokio::spawn(async move {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don’t block rebuild waiters on unrelated indexers

This now clones the rebuild completion guard for every indexer in the pool, not just the nodes in the plan, so any slow or unreachable non-ready indexer can hold the waiter open until its apply_indexing_plan RPC finishes or times out. That waiter is used by rebuild_plan_debounced before replying to operations such as creating an index with sources, so a retiring/initializing indexer that only needs an empty plan can add a full RPC timeout to unrelated control-plane requests. Limit the guarded fanout to planned nodes plus nodes that are known to need a drain, or drop the guard before best-effort empty-plan sends.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant