Fix shard ownership bug when decommissioning indexer#6509
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 75a150a9dc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let indexing_tasks = new_physical_plan | ||
| .indexer(indexer.node_id.as_str()) | ||
| .unwrap_or(&[]) | ||
| .to_vec(); |
There was a problem hiding this comment.
Keep dropped indexers retryable after a failed empty plan
When a retiring indexer is omitted from new_physical_plan, this synthesizes an empty plan only for this single apply attempt. If that RPC returns an error (the code below just logs it), last_applied_physical_plan still does not contain the retiring node, and control_running_plan later compares only select_available_indexers_for_scheduling() (ready nodes when any exist), so it will not retry clearing that node's stale pipelines. In that transient failure scenario the decommissioning indexer can continue the publish-token ping-pong this change is meant to stop; include these empty assignments in the desired state or otherwise make the dropped-node apply retryable.
Useful? React with 👍 / 👎.
| for indexer in self.indexer_pool.values() { | ||
| let indexing_tasks = new_physical_plan | ||
| .indexer(indexer.node_id.as_str()) | ||
| .unwrap_or(&[]) | ||
| .to_vec(); |
There was a problem hiding this comment.
Avoid letting stale empty plans clobber newer assignments
Because each apply is spawned asynchronously and there is no per-node ordering/version check, sending an empty plan to every pool member can race with the next rebuild. For example, an indexer that is present but not ready during plan P1 receives this synthesized empty plan; if it becomes ready and gets tasks from P2 before the delayed P1 RPC arrives, the late empty P1 apply shuts down the newly assigned pipelines. Previously P1 would not send anything to nodes outside the plan, so stale applies could not clear assignments for nodes that were unplanned at the time; restrict empty plans to nodes that actually need draining or attach ordering to applied plans.
Useful? React with 👍 / 👎.
| let notify_on_drop = notify_on_drop.clone(); | ||
| tokio::spawn({ | ||
| let indexer = indexers | ||
| .iter() | ||
| .find(|indexer| indexer.node_id == node_id.as_str()) | ||
| .expect("This should never happen as the plan was built from these indexers.") | ||
| .clone(); | ||
| let indexing_tasks = indexing_tasks.clone(); | ||
| async move { | ||
| if let Err(error) = indexer | ||
| .client | ||
| .clone() | ||
| .apply_indexing_plan(ApplyIndexingPlanRequest { indexing_tasks }) | ||
| .await | ||
| { | ||
| warn!( | ||
| %error, | ||
| node_id=%indexer.node_id, | ||
| generation_id=indexer.generation_id, | ||
| "failed to apply indexing plan to indexer" | ||
| ); | ||
| } | ||
| drop(notify_on_drop); | ||
| tokio::spawn(async move { |
There was a problem hiding this comment.
Don’t block rebuild waiters on unrelated indexers
This now clones the rebuild completion guard for every indexer in the pool, not just the nodes in the plan, so any slow or unreachable non-ready indexer can hold the waiter open until its apply_indexing_plan RPC finishes or times out. That waiter is used by rebuild_plan_debounced before replying to operations such as creating an index with sources, so a retiring/initializing indexer that only needs an empty plan can add a full RPC timeout to unrelated control-plane requests. Limit the guarded fanout to planned nodes plus nodes that are known to need a drain, or drop the guard before best-effort empty-plan sends.
Useful? React with 👍 / 👎.
Description
Indexing plan broadcast bug when an indexer retires causes crash loop of multiple indexers trying to acquire and index one shard, which slows down indexing and decommissioning dramatically and crashes indexing pipelines (but doesnt corrupt data or cause any data loss).
In detail
When an ingester signals that it’s entering the retiring state, it’s IngesterStatus is no longer Ready, excluding it from being included in the next indexing plan:
quickwit/quickwit/quickwit-control-plane/src/indexing_scheduler/mod.rs
Line 390 in f3880d5
The indexers used to build the plan are the ones then iterated over to send the new indexing plan to:
quickwit/quickwit/quickwit-control-plane/src/indexing_scheduler/mod.rs
Line 417 in f3880d5
The problem with this is that the decommissioning indexers are never notified of the new plan (since they’re not in the new plan). The new indexing plan reassigns the shards from this indexer to a different one, which calls AcquireShard with its own publish token, which updates the metastore. It then fetches and indexes the shard and publishes a split.
Since the decommissioning indexer doesn’t know that it’s supposed to stop indexing the shard, it continues indexing the shard as well, and tries to publish a split, but fails because the publish token is different. This triggers the killswitch and the pipeline restarts. This is where the bug is: since it never receives the new indexing plan, the IndexingService still has this shard in its assigned shards - this causes it to call AcquireShard itself, with a new publish token. So now it’s the owner of the shard. The new indexer tries to publish its next split, and now it fails with the same publish token problem, and it restarts and reacquires the shard for itself.
This ping pong happens until either the shard is completely indexed and reaches EOF, or until the decommissioning timeout is hit and the decommissioning indexer is force killed, at which point the new indexer takes over and continues indexing the shard as intended.
Luckily the publish token works completely as intended and prevents us from having contention or corrupted data. It just means decommissioning takes forever and indexing is much slower.
Fix: Send the new indexing plan to all nodes in the ingester pool, not just the nodes in the plan
Tests
Added unit test, added integration test.
There’s two additional future improvements that could aid here:
AcquireShards should be based on the most recent indexing plan
Currently, all of a retiring indexer’s shards are indexed on other indexers, and the retiring indexer sits idle. The indexing plan could be modified so that a decommissioning indexer is allowed to index its (and only its) local shards.