Skip to content

Fix corruption due to lock sharding issues by centralizing locking#5838

Merged
martin-frbg merged 2 commits into
OpenMathLib:developfrom
ngoldbaum:fix-level3-thread-locks-2
Jun 15, 2026
Merged

Fix corruption due to lock sharding issues by centralizing locking#5838
martin-frbg merged 2 commits into
OpenMathLib:developfrom
ngoldbaum:fix-level3-thread-locks-2

Conversation

@ngoldbaum

@ngoldbaum ngoldbaum commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Fixes #5836.

I used both Claude and Codex to work on this. I typed this PR description by hand.

Summary

Fixes the specific lock sharding issue described in #5836 as well as other related issues caused by the current locking strategy allowing shared overlapping calls of different kernels. The locks are function-local and so appear as different mutex objects in each separate compilation unit.

I fixed that by moving the locking to its own compilation unit with a new internal helper functions for serializing calls. This also centralizes the platform-specific locking logic into the new file.

This has the net effect of reducing throughput for workloads that make overlapping external calls into different threaded kernels, because those calls are now serialized consistently instead of using per-kernel lock state. I think the existing behavior is a bug rather than an intentional design choice but I wanted to raise that front and center.

It also fixes several other issues, more or less as a consequence of making the above change systematically:

  • There is a behavior change in the OpenMP gemm3m implementation. Currently, locking is skipped for OpenMP gemm3m. I think this is a bug but maybe it's intentional? It would be a behavior change regardless.
  • The old win32 path appears to have used a separate critical section per invocation, so it did not provide cross-call serialization - there is one critical section per call in the old implementation. The new file initializes a process-wide lock once and uses that.
  • The old implementation caused substantial OpenMP oversubscription by duplicating parallel_section_left. Now there's only one version of this variable so there are far fewer OpenMP threads when one mixes kernels.

Testing

To verify the correctness of the fix, I added a new multithreaded stress test based on the reproducer in #5836. I also enabled multithreaded stress testing for msys2 on a Windows host to test the Windows threading model.

Additionally there is a thread sanitizer test run. I manually verified that the new tests trigger validation errors and/or TSan race reports. I also verified the new mixed DGEMM stress test (no TSan) fails with incorrect results on an unpatched develop build and passes with this change.

Right now there's only TSan testing for OPENBLAS_NUM_THREADS=2 and the pthreads backend. TSan detected data races with OPENBLAS_NUM_THREADS=4 and also with the OpenMP backend. I am intentionally leaving the TSan CI as-is in this PR. We'll need to look at other issues before setting up more thorough TSan CI.

@martin-frbg

Copy link
Copy Markdown
Collaborator

Thank you - beat me to it. (The single CI failure in Jenkins is an internal docker error related to the use of sudo for preparing a cmake-based build on zarch - I've restarted that job now)

@martin-frbg martin-frbg added this to the 0.3.34 milestone Jun 15, 2026
@martin-frbg martin-frbg merged commit 9bdf051 into OpenMathLib:develop Jun 15, 2026
103 of 104 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gemm_driver templating breaks GEMM locking when different GEMMs happen concurrently

2 participants