Skip to content

fix(optim): untie batched constant MatMul for OpenVINO GPU#817

Open
xieofxie wants to merge 7 commits into
mainfrom
hualxie/fix_ov_gpu
Open

fix(optim): untie batched constant MatMul for OpenVINO GPU#817
xieofxie wants to merge 7 commits into
mainfrom
hualxie/fix_ov_gpu

Conversation

@xieofxie

@xieofxie xieofxie commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Problem

winml perf -m cross-encoder/nli-deberta-v3-small --task zero-shot-classification --ep openvino --device gpu fails to compile:

[GPU] Failed to select implementation for
name:matmul:/deberta/encoder/layer.5/attention/self/MatMul_1
type: gemm
... compile_graph.cpp:59  (shape_type == dynamic_shape || node->selected_impl != nullptr)

Root cause

OpenVINO GPU's oneDNN gemm cannot select an implementation for a batched (rank ≥ 3) MatMul where an operand is a compile-time constant. Verified by isolation against the real OV-GPU EP:

case result
3D dynamic @ 3D dynamic (content q·kᵀ) ✅ compiles
3D dynamic @ 3D constant (position terms) ❌ fails
3D @ 2D constant ✅ compiles
operand converted to runtime input ✅ compiles

For a static-shaped node selected_impl must be non-null; impl-selection returns nothing for the batched-constant gemm, so the assert fires. DeBERTa hits this because its disentangled-attention position key/query depend only on weights and fold to 3D constants during export (12 such MatMuls — 2 per layer). Disabling torch constant-folding doesn't help: OV folds the all-constant subgraph itself.

Fix

A new EP-gated surgery transform, untie-constant-batched-matmul, routes each constant operand through Add(const, zero) where zero is a data-dependent runtime [1] tensor (Cast → Reshape(-1) → Slice[0:1] → Sub). This makes the operand runtime-valued so OV's constant folder can't repack it into a gemm weight, while:

  • keeping the single batched MatMul (no perf regression — a 2D per-head decomposition also works but explodes into 144 tiny matmuls),
  • leaving numerics unchanged (+0).

Wired via autoconf: BatchedConstMatMulValidator detects the pattern and, gated to Intel IHV + GPU, emits a GraphOptimization opportunity the existing autoconf loop auto-applies. Pattern-based and architecture-agnostic (no model-name hardcoding). The detector doesn't re-fire after surgery, so autoconf converges.

Two incidental bugs fixed:

  • Model-validator device filter was case-sensitive ("gpu""GPU") → made case-insensitive.
  • First construction used ReduceMin (no axes), which crashed the static analyzer's reduction input-generator → replaced with ubiquitous analyzer-safe ops.

Verification

  • Original failing command now compiles on OV-GPU and benchmarks (~15.5 ms avg, ~64 samples/sec).
  • GPU output matches CPU reference (argmax matches; diff 6e-4, normal fp16/fp32).
  • Detector gates correctly (openvino+GPU on; NPU / CPU / DML off).
  • New unit tests pass; full optim + analyze unit suites (1923 tests) pass — no regressions.

Note: artifacts are cached by build-config hash, so an existing stale cache needs --rebuild / --ignore-cache to pick up the fix.

🤖 Generated with Claude Code

OpenVINO GPU's oneDNN gemm cannot select an implementation for a batched
(rank >= 3) MatMul where an operand is a compile-time constant; the same
gemm with a dynamic operand, and 2D constant gemm, both compile fine.
Transformer disentangled-attention position terms (e.g. DeBERTa) fold to
3D constants and fail to compile with:

  [GPU] Failed to select implementation for ... type: gemm
  (compile_graph.cpp:59 selected_impl == nullptr)

Add an EP-gated `untie-constant-batched-matmul` surgery that routes the
constant operand through Add(const, zero), where zero is a data-dependent
runtime [1] tensor (Cast -> Reshape(-1) -> Slice[0:1] -> Sub). This makes
the operand runtime-valued so OV's constant folder cannot repack it into a
gemm weight, while keeping the single batched MatMul (no perf regression)
and leaving numerics unchanged (+0).

Wired via autoconf: BatchedConstMatMulValidator detects the pattern and,
gated to Intel IHV + GPU, emits a GraphOptimization opportunity the
existing autoconf loop auto-applies. Pattern-based, architecture-agnostic.

Also makes the model-validator device filter case-insensitive so builds
that pass lowercase "gpu" are matched.
@xieofxie xieofxie requested a review from a team as a code owner June 5, 2026 06:59
@xieofxie

xieofxie commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@xieofxie

xieofxie commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Add an issue to allow user to opt out #835

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the design is sound: the surgery correctly routes constant operands through a data-dependent +0 to defeat OpenVINO GPU's constant folder, and the validator correctly gates on Intel IHV + GPU. A few correctness issues and documentation gaps are worth addressing before merge.

Comment thread src/winml/modelkit/optim/pipes/surgery.py Outdated
Comment thread src/winml/modelkit/optim/pipes/surgery.py Outdated
Comment thread src/winml/modelkit/optim/pipes/surgery.py Outdated
Comment thread src/winml/modelkit/optim/pipes/surgery.py
Comment thread tests/unit/optim/pipes/test_pipe_surgery.py
hualxie added 2 commits June 10, 2026 10:31
- Use loop index for the untied operand name instead of node.name, which is
  optional in ONNX and can be blank/duplicated (would collide and yield an
  invalid graph).
- Update docstring to describe the actual Cast/Reshape/Slice/Sub construction
  (was stale ReduceMin wording) and document the non-empty-first-input
  assumption.
- Split the Slice starts/ends/axes initializers into distinct named tensors.
- Note the Constant-node detection gap in the validator (shared with surgery).
- Add a test for two unnamed batched-const MatMuls (name-collision regression).
"class": PatternMatchingValidator,
"enabled_devices": None, # All devices
},
"batched_const_matmul": {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use static rules

clamp_min: float = -1e3
clamp_max: float = 1e3
remove_isnan_in_attention_mask: bool = False
untie_constant_batched_matmul: bool = False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to rewrite pipe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants