Skip to content

fix(hnsw): force single-thread build on the CDC/sync write path#24993

Merged
mergify[bot] merged 2 commits into
matrixorigin:mainfrom
cpegeric:hnsw_flaky_ut
Jun 15, 2026
Merged

fix(hnsw): force single-thread build on the CDC/sync write path#24993
mergify[bot] merged 2 commits into
matrixorigin:mainfrom
cpegeric:hnsw_flaky_ut

Conversation

@cpegeric

Copy link
Copy Markdown
Contributor

USearch #735 / MatrixOne #24849 (open): concurrent USearch add() can orphan HNSW graph nodes — the vector is stored (contains()==true) but never linked into the graph, so search() can never reach it, producing flaky recall@1. The create-index path (build.go) already forces a single build thread, but the CDC/sync path (HnswSync) still drove insertAllInParallel with ThreadsBuild (= NumCPU by default), issuing concurrent AddUnsafe() to the same index.

  • types.go: add GetConcurrencyForSingleThreadBuild (returns 1) with a one-line revert note for when usearch fixes the race. Scoped to HNSW — ivfflat (ivf_create.go) keeps GetConcurrencyForBuild and stays parallel.
  • sync.go: NewHnswSync routes ThreadsBuild through the new helper (both the sysvar and default branches), with an inline comment at each call site. The hnsw_threads_build / hnsw_max_index_capacity reads are untouched.
  • model.go: revert NewHnswModelForBuild to honor its nthread param again (commit 57a991e had hardcoded nthread:=1). With both callers now passing 1 (build.go forces 1; sync uses the helper) the hardcode is redundant, and removing it avoids a hidden override that would silently keep the model single-threaded even after the helper/build.go are reverted.

Net: AddUnsafe() is single-threaded on all build/write paths (create-index, sequentialUpdate, insertAllInParallel with nthread=1 -> one goroutine; each also ChangeThreadsAdd(1)). Verified: insertAllInParallel nthread=1 at runtime; TestSyncAddOneModel passes single-threaded with no hang.

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24977

What this PR does / why we need it:

fix the bug to keep ThreadsBuild to 1 in all build path.

USearch matrixorigin#735 / MatrixOne matrixorigin#24849 (open): concurrent USearch add() can orphan
HNSW graph nodes — the vector is stored (contains()==true) but never linked
into the graph, so search() can never reach it, producing flaky recall@1. The
create-index path (build.go) already forces a single build thread, but the
CDC/sync path (HnswSync) still drove insertAllInParallel with ThreadsBuild
(= NumCPU by default), issuing concurrent AddUnsafe() to the same index.

- types.go: add GetConcurrencyForSingleThreadBuild (returns 1) with a one-line
  revert note for when usearch fixes the race. Scoped to HNSW — ivfflat
  (ivf_create.go) keeps GetConcurrencyForBuild and stays parallel.
- sync.go: NewHnswSync routes ThreadsBuild through the new helper (both the
  sysvar and default branches), with an inline comment at each call site. The
  hnsw_threads_build / hnsw_max_index_capacity reads are untouched.
- model.go: revert NewHnswModelForBuild to honor its nthread param again
  (commit 57a991e had hardcoded nthread:=1). With both callers now passing 1
  (build.go forces 1; sync uses the helper) the hardcode is redundant, and
  removing it avoids a hidden override that would silently keep the model
  single-threaded even after the helper/build.go are reverted.

Net: AddUnsafe() is single-threaded on all build/write paths (create-index,
sequentialUpdate, insertAllInParallel with nthread=1 -> one goroutine; each
also ChangeThreadsAdd(1)). Verified: insertAllInParallel nthread=1 at runtime;
TestSyncAddOneModel passes single-threaded with no hang.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@XuPeng-SH XuPeng-SH left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for one missing regression guard around this workaround.

The code change itself looks directionally right, but this PR moves the single-thread safety guarantee from NewHnswModelForBuild(...) back out to the HNSW sync call sites. Given the underlying bug is silent recall corruption, I think we need a deterministic test that locks this invariant down.

Today sync_test.go sets hnsw_threads_build = 8, but it only checks that RunOnce succeeds; it does not assert that NewHnswSync actually clamps ThreadsBuild to 1, nor that models created on this path end up with NThread == 1.

Please add a focused regression test for the sync/CDC path before approval.

@mergify

mergify Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Merge Queue Status

  • Entered queue2026-06-15 13:34 UTC · Rule: main
  • Checks skipped · PR is already up-to-date
  • Merged2026-06-15 13:34 UTC · at 462cee232c213c3055b29b49335c4392afc97a8c · squash

This pull request spent 28 seconds in the queue, including 3 seconds running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • github-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-neutral = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-skipped = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / SCA Test on Ubuntu/x86
    • check-neutral = Matrixone CI / SCA Test on Ubuntu/x86
    • check-skipped = Matrixone CI / SCA Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / UT Test on Ubuntu/x86
    • check-neutral = Matrixone CI / UT Test on Ubuntu/x86
    • check-skipped = Matrixone CI / UT Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-neutral = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-skipped = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Utils CI / Coverage
    • check-neutral = Matrixone Utils CI / Coverage
    • check-skipped = Matrixone Utils CI / Coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/S Denotes a PR that changes [10,99] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants