fix(hnsw): force single-thread build on the CDC/sync write path#24993
Conversation
USearch matrixorigin#735 / MatrixOne matrixorigin#24849 (open): concurrent USearch add() can orphan HNSW graph nodes — the vector is stored (contains()==true) but never linked into the graph, so search() can never reach it, producing flaky recall@1. The create-index path (build.go) already forces a single build thread, but the CDC/sync path (HnswSync) still drove insertAllInParallel with ThreadsBuild (= NumCPU by default), issuing concurrent AddUnsafe() to the same index. - types.go: add GetConcurrencyForSingleThreadBuild (returns 1) with a one-line revert note for when usearch fixes the race. Scoped to HNSW — ivfflat (ivf_create.go) keeps GetConcurrencyForBuild and stays parallel. - sync.go: NewHnswSync routes ThreadsBuild through the new helper (both the sysvar and default branches), with an inline comment at each call site. The hnsw_threads_build / hnsw_max_index_capacity reads are untouched. - model.go: revert NewHnswModelForBuild to honor its nthread param again (commit 57a991e had hardcoded nthread:=1). With both callers now passing 1 (build.go forces 1; sync uses the helper) the hardcode is redundant, and removing it avoids a hidden override that would silently keep the model single-threaded even after the helper/build.go are reverted. Net: AddUnsafe() is single-threaded on all build/write paths (create-index, sequentialUpdate, insertAllInParallel with nthread=1 -> one goroutine; each also ChangeThreadsAdd(1)). Verified: insertAllInParallel nthread=1 at runtime; TestSyncAddOneModel passes single-threaded with no hang. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
XuPeng-SH
left a comment
There was a problem hiding this comment.
Requesting changes for one missing regression guard around this workaround.
The code change itself looks directionally right, but this PR moves the single-thread safety guarantee from NewHnswModelForBuild(...) back out to the HNSW sync call sites. Given the underlying bug is silent recall corruption, I think we need a deterministic test that locks this invariant down.
Today sync_test.go sets hnsw_threads_build = 8, but it only checks that RunOnce succeeds; it does not assert that NewHnswSync actually clamps ThreadsBuild to 1, nor that models created on this path end up with NThread == 1.
Please add a focused regression test for the sync/CDC path before approval.
Merge Queue Status
This pull request spent 28 seconds in the queue, including 3 seconds running CI. Required conditions to merge
|
USearch #735 / MatrixOne #24849 (open): concurrent USearch add() can orphan HNSW graph nodes — the vector is stored (contains()==true) but never linked into the graph, so search() can never reach it, producing flaky recall@1. The create-index path (build.go) already forces a single build thread, but the CDC/sync path (HnswSync) still drove insertAllInParallel with ThreadsBuild (= NumCPU by default), issuing concurrent AddUnsafe() to the same index.
Net: AddUnsafe() is single-threaded on all build/write paths (create-index, sequentialUpdate, insertAllInParallel with nthread=1 -> one goroutine; each also ChangeThreadsAdd(1)). Verified: insertAllInParallel nthread=1 at runtime; TestSyncAddOneModel passes single-threaded with no hang.
What type of PR is this?
Which issue(s) this PR fixes:
issue #24977
What this PR does / why we need it:
fix the bug to keep ThreadsBuild to 1 in all build path.