fix(optim): untie batched constant MatMul for OpenVINO GPU by xieofxie · Pull Request #817 · microsoft/winml-cli

xieofxie · 2026-06-05T06:59:11Z

Problem

winml perf -m cross-encoder/nli-deberta-v3-small --task zero-shot-classification --ep openvino --device gpu fails to compile:

[GPU] Failed to select implementation for
name:matmul:/deberta/encoder/layer.5/attention/self/MatMul_1
type: gemm
... compile_graph.cpp:59  (shape_type == dynamic_shape || node->selected_impl != nullptr)

Root cause

OpenVINO GPU's oneDNN gemm cannot select an implementation for a batched (rank ≥ 3) MatMul where an operand is a compile-time constant. Verified by isolation against the real OV-GPU EP:

case	result
3D dynamic @ 3D dynamic (content q·kᵀ)	✅ compiles
3D dynamic @ 3D constant (position terms)	❌ fails
3D @ 2D constant	✅ compiles
operand converted to runtime input	✅ compiles

For a static-shaped node selected_impl must be non-null; impl-selection returns nothing for the batched-constant gemm, so the assert fires. DeBERTa hits this because its disentangled-attention position key/query depend only on weights and fold to 3D constants during export (12 such MatMuls — 2 per layer). Disabling torch constant-folding doesn't help: OV folds the all-constant subgraph itself.

Fix

A new EP-gated surgery transform, untie-constant-batched-matmul, routes each constant operand through Add(const, zero) where zero is a data-dependent runtime [1] tensor (Cast → Reshape(-1) → Slice[0:1] → Sub). This makes the operand runtime-valued so OV's constant folder can't repack it into a gemm weight, while:

keeping the single batched MatMul (no perf regression — a 2D per-head decomposition also works but explodes into 144 tiny matmuls),
leaving numerics unchanged (+0).

Wired via autoconf: BatchedConstMatMulValidator detects the pattern and, gated to Intel IHV + GPU, emits a GraphOptimization opportunity the existing autoconf loop auto-applies. Pattern-based and architecture-agnostic (no model-name hardcoding). The detector doesn't re-fire after surgery, so autoconf converges.

Two incidental bugs fixed:

Model-validator device filter was case-sensitive ("gpu" ≠ "GPU") → made case-insensitive.
First construction used ReduceMin (no axes), which crashed the static analyzer's reduction input-generator → replaced with ubiquitous analyzer-safe ops.

Verification

Original failing command now compiles on OV-GPU and benchmarks (~15.5 ms avg, ~64 samples/sec).
GPU output matches CPU reference (argmax matches; diff 6e-4, normal fp16/fp32).
Detector gates correctly (openvino+GPU on; NPU / CPU / DML off).
New unit tests pass; full optim + analyze unit suites (1923 tests) pass — no regressions.

Note: artifacts are cached by build-config hash, so an existing stale cache needs --rebuild / --ignore-cache to pick up the fix.

🤖 Generated with Claude Code

OpenVINO GPU's oneDNN gemm cannot select an implementation for a batched (rank >= 3) MatMul where an operand is a compile-time constant; the same gemm with a dynamic operand, and 2D constant gemm, both compile fine. Transformer disentangled-attention position terms (e.g. DeBERTa) fold to 3D constants and fail to compile with: [GPU] Failed to select implementation for ... type: gemm (compile_graph.cpp:59 selected_impl == nullptr) Add an EP-gated `untie-constant-batched-matmul` surgery that routes the constant operand through Add(const, zero), where zero is a data-dependent runtime [1] tensor (Cast -> Reshape(-1) -> Slice[0:1] -> Sub). This makes the operand runtime-valued so OV's constant folder cannot repack it into a gemm weight, while keeping the single batched MatMul (no perf regression) and leaving numerics unchanged (+0). Wired via autoconf: BatchedConstMatMulValidator detects the pattern and, gated to Intel IHV + GPU, emits a GraphOptimization opportunity the existing autoconf loop auto-applies. Pattern-based, architecture-agnostic. Also makes the model-validator device filter case-insensitive so builds that pass lowercase "gpu" are matched.

xieofxie · 2026-06-05T07:16:29Z

Source code here https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/graph/graph_optimizer/compile_graph.cpp

Also issue created openvinotoolkit/openvino#36272

xieofxie · 2026-06-08T08:39:36Z

Add an issue to allow user to opt out #835

DingmaomaoBJTU

Overall the design is sound: the surgery correctly routes constant operands through a data-dependent +0 to defeat OpenVINO GPU's constant folder, and the validator correctly gates on Intel IHV + GPU. A few correctness issues and documentation gaps are worth addressing before merge.

- Use loop index for the untied operand name instead of node.name, which is optional in ONNX and can be blank/duplicated (would collide and yield an invalid graph). - Update docstring to describe the actual Cast/Reshape/Slice/Sub construction (was stale ReduceMin wording) and document the non-empty-first-input assumption. - Split the Slice starts/ends/axes initializers into distinct named tensors. - Note the Constant-node detection gap in the validator (shared with surgery). - Add a test for two unnamed batched-const MatMuls (name-collision regression).

DingmaomaoBJTU · 2026-06-10T02:51:45Z

            "class": PatternMatchingValidator,
            "enabled_devices": None,  # All devices
        },
+        "batched_const_matmul": {


Please use static rules

DingmaomaoBJTU · 2026-06-10T02:52:43Z

    clamp_min: float = -1e3
    clamp_max: float = 1e3
    remove_isnan_in_attention_mask: bool = False
+    untie_constant_batched_matmul: bool = False


Change to rewrite pipe？

xieofxie requested a review from a team as a code owner June 5, 2026 06:59

xieofxie mentioned this pull request Jun 8, 2026

feat: support opt out specific optimization method #835

Open

hualxie added 4 commits June 8, 2026 16:42

Merge remote-tracking branch 'origin/main' into hualxie/fix_ov_gpu

b82c105

update

5d04ce0

use EPName

d9f5ca7

sort

9dbfb6f

DingmaomaoBJTU reviewed Jun 10, 2026

View reviewed changes

hualxie added 2 commits June 10, 2026 10:31

Merge remote-tracking branch 'origin/main' into hualxie/fix_ov_gpu

1712a3f

DingmaomaoBJTU reviewed Jun 10, 2026

View reviewed changes

xieofxie mentioned this pull request Jun 10, 2026

feat: untie matmul rewrite #857

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(optim): untie batched constant MatMul for OpenVINO GPU#817

fix(optim): untie batched constant MatMul for OpenVINO GPU#817
xieofxie wants to merge 7 commits into
mainfrom
hualxie/fix_ov_gpu

xieofxie commented Jun 5, 2026

Uh oh!

xieofxie commented Jun 5, 2026

Uh oh!

xieofxie commented Jun 8, 2026

Uh oh!

DingmaomaoBJTU left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU Jun 10, 2026

Uh oh!

DingmaomaoBJTU Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xieofxie commented Jun 5, 2026

Problem

Root cause

Fix

Verification

Uh oh!

xieofxie commented Jun 5, 2026

Uh oh!

xieofxie commented Jun 8, 2026

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants