Skip to content

Sync with Microsoft ONNX Runtime - 16062026#1139

Open
ai-fw-intg wants to merge 9 commits into
ovep-developfrom
sync_msft_16062026
Open

Sync with Microsoft ONNX Runtime - 16062026#1139
ai-fw-intg wants to merge 9 commits into
ovep-developfrom
sync_msft_16062026

Conversation

@ai-fw-intg

Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

sushraja-msft and others added 9 commits June 15, 2026 08:16
One more reference to 6.33.0 that was missed.

CVE-2026-0994
Upgrade protobuf from 6.33.0 to 6.33.5 to fix the vulnerability.
## Summary

Adds a symmetric weight-only **MoE GEMV fast path** for single-token
(batch-1) decode of quantized MoE models such as GPT-OSS-20B, replacing
the CUTLASS grouped-GEMM path when the expanded row count is small. The
fast path now covers **INT4 and INT8** weights, **per-column and
block-wise (group size 32/64/128)** scales, and **FP16 and BF16**
activations. On an H200 (SM90) this improves end-to-end GPT-OSS-20B INT4
token-generation throughput by roughly **15% over the CUTLASS
grouped-GEMM baseline** and about **8% over the FasterTransformer kernel
used in ORT 1.26** across prompt lengths 128/1024/2048, while producing
bit-faithful output.

## Motivation

At batch-1 decode each token expands to `top_k` rows (4 for
GPT-OSS-20B), so the MoE FC1/FC2 GEMMs are extremely skinny. The CUTLASS
grouped GEMM is built for throughput at larger M and leaves the decode
path memory-bound and underutilized. A dedicated weight-only INT GEMV
with per-expert dispatch is a better fit for this regime and closes the
gap to (and surpasses) ORT 1.26 for GPT-OSS-20B.

This also adds block_size=32 support as requested in
microsoft#29035.

## Key Changes

### New MoE GEMV kernel

| File | Change |
|---|---|
| `contrib_ops/cuda/llm/moe_gemm/moe_gemv.h` | Public interface:
`is_moe_gemv_supported` dispatch predicate and the symmetric INT
launchers `launch_moe_gemv_int_symmetric<T, WeightType>` /
`launch_moe_gemv_int_symmetric_interleaved_swiglu<T, WeightType>` (plus
the original INT4 per-channel launchers). |
| `contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu` | Symmetric INT MoE GEMV
(INT4 `uint4b_t` / INT8 `uint8_t`, FP16/BF16). One CTA per expanded row
× N-tile; per-expert weight/scale/bias offsets via a direct
row-to-expert map (prefix-offset scan fallback). Supports per-column
(`group_size <= 0`) and block-wise (`group_size` 64/128) scales. FC1 has
an interleaved SwiGLU-fused epilogue; a static top-k one-row finalize
specialization is used for FC2 routing. |

### Runner wiring and profiler

| File | Change |
|---|---|
| `contrib_ops/cuda/llm/moe_gemm/moe_kernels.cu` / `.h` | Branch FC1 and
FC2 to the GEMV fast path when supported (carrying `group_size` through
`QuantParams`); fall back to grouped GEMM otherwise.
`ORT_DISABLE_MOE_GEMV=1` forces the grouped-GEMM path for A/B testing. |
| `contrib_ops/cuda/llm/moe_gemm/moe_gemm_profiler.cc` / `.h`,
`moe_util_kernels.h` | Use M (token count) in the MoE GEMM profiler so
tactic selection reflects the decode shape. |

### Backward-compatible SwiGLU fusion

| File | Change |
|---|---|
| `contrib_ops/cuda/moe/moe.cc`, `moe_quantization.cc` (+ `.h`) | Treat
`swiglu_fusion == 0` with no separate FC3 as interleaved fusion (`==
1`). The published GPT-OSS-20B model (and any model exported by ORT <
1.27) hard-coded the interleaved layout and emits no `swiglu_fusion`
attribute, so it defaults to 0; those weights are pre-fused into FC1 and
must be treated as fusion mode 1. |

### Tests, profiling, and docs

| File | Change |
|---|---|
| `test/python/transformers/profile_qmoe_gemv.py` / `.sh` | Standalone
QMoE GEMV profiling harness (NVTX-ranged, GEMV-vs-grouped-GEMM kernel
comparison). |
| `test/python/transformers/test_qmoe_cuda.py`, `test_moe_cuda.py` |
QMoE GEMV decode-latency/parity coverage across INT4/INT8,
per-column/block-wise, FP16/BF16, plus import tidy-ups. |
| `docs/contrib_ops/cuda/qmoe_gemv_experiments.md` | Full experiment
log: kernel-level sweeps, dispatch-gate tuning, block-wise and BF16
enablement, INT8 per-column root-cause + fix, and the GenAI end-to-end
throughput comparison (CUTLASS baseline / GEMV final / FT baseline). |
| `docs/contrib_ops/cuda/moe_qmoe.md` | QMoE doc updates describing the
GEMV fast path and its dispatch gate. |

## Results

### GPT-OSS-20B INT4 end-to-end (H200/SM90, batch 1)

Token-generation throughput (tps, higher is better):

| Prompt length | CUTLASS baseline (gemm) | GEMV (final) | FT (ORT 1.26)
| GEMV vs cutlass | GEMV vs FT |
|---|---|---|---|---|---|
| 128 | 248.9 | 288.0 | 265.2 | +15.7% | +8.6% |
| 1024 | 237.8 | 272.2 | 252.9 | +14.5% | +7.6% |
| 2048 | 231.3 | 265.0 | 245.6 | +14.6% | +7.9% |

### Decode microbenchmark coverage (H200/SM90, batch 1)

Benchmark-loop latency in milliseconds, lower is better. `Enabled` is
the default GEMV build; `Fallback` sets `ORT_DISABLE_MOE_GEMV=1`
(grouped GEMM). Every case reported `has_invalid_output=false`.

| Case | Quant | DType | Enabled ms | Fallback ms | Speedup |
|---|---|---|---|---|---|
| `int8_per_column_m1_top2_1024x4096_e8` | INT8 per-column | FP16 |
0.0566 | 0.0816 | 1.44x |
| `int8_per_column_m1_top2_1024x4096_e8` | INT8 per-column | BF16 |
0.0578 | 0.0862 | 1.49x |
| `gpt_oss_20b_m1_top4_int8_2880x2880_e32` | INT8 per-column | FP16 |
0.0785 | 0.0947 | 1.21x |
| `gpt_oss_20b_m1_top4_int8_2880x2880_e32` | INT8 per-column | BF16 |
0.0785 | 0.0989 | 1.26x |

Block-wise INT4 (`block_size=64`, `1024x4096`, e8) routes to GEMV for
both FP16 and BF16 with FC1/FC2 kernel times within noise across dtypes
(FP16 4.53/6.95 us, BF16 4.62/7.01 us). See `qmoe_gemv_experiments.md`
for the full sweep.

## Correctness

- GEMV-enabled and `ORT_DISABLE_MOE_GEMV=1` (grouped GEMM) produce
identical, correct output for the GPT-OSS-20B sanity prompt; Nsight
traces confirm `moe_gemv_kernel` (FC2) and
`moe_gemv_interleaved_swiglu_kernel` (FC1) run for the decode shapes.
- INT4/INT8 SwiGLU parity cases pass for FP16 and BF16 (max absolute
difference ~1e-3 against the reference). Regression: `pytest -k
"TestSwigluQMoE or TestQMoEIntPrePackSmoke"` → `34 passed, 4 skipped`.
- Test on GPT-OSS-20b with block_size=32, and generated results looks
good.

## Testing Notes
- Build: `bash .env/cuda_130.sh --build` (CUDA 13.0,
`CMAKE_CUDA_ARCHITECTURES=89;90`). Use `--clean_moe` after dispatch-code
edits to avoid stale `moe_kernels.cu.o`.
- Kernel profiling:
`onnxruntime/test/python/transformers/profile_qmoe_gemv.sh`.
- Parity/latency: `pytest
onnxruntime/test/python/transformers/test_qmoe_cuda.py`.
- End-to-end: GenAI `benchmark_e2e.py` on GPT-OSS-20B INT4, batch 1,
prompt lengths 128/1024/2048. Run on an idle GPU — shared-GPU contention
corrupts the decode-latency measurement.
- A/B check: compare default vs `ORT_DISABLE_MOE_GEMV=1` for both
throughput and output parity.
Bumps [esbuild](https://github.com/evanw/esbuild) from 0.25.12 to
0.28.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/evanw/esbuild/releases">esbuild's
releases</a>.</em></p>
<blockquote>
<h2>v0.28.1</h2>
<ul>
<li>
<p>Disallow <code>\</code> in local development server HTTP requests (<a
href="https://github.com/evanw/esbuild/security/advisories/GHSA-g7r4-m6w7-qqqr">GHSA-g7r4-m6w7-qqqr</a>)</p>
<p>This release fixes a security issue where HTTP requests to esbuild's
local development server could traverse outside of the serve directory
on Windows using a <code>\</code> backslash character. It happened due
to the use of Go's <code>path.Clean()</code> function, which only
handles Unix-style <code>/</code> characters. HTTP requests with paths
containing <code>\</code> are no longer allowed.</p>
<p>Thanks to <a
href="https://github.com/dellalibera"><code>@​dellalibera</code></a> for
reporting this issue.</p>
</li>
<li>
<p>Add integrity checks to the Deno API (<a
href="https://github.com/evanw/esbuild/security/advisories/GHSA-gv7w-rqvm-qjhr">GHSA-gv7w-rqvm-qjhr</a>)</p>
<p>The previous release of esbuild added integrity checks to esbuild's
npm install script. This release also adds integrity checks to esbuild's
Deno install script. Now esbuild's Deno API will also fail with an error
if the downloaded esbuild binary contains something other than the
expected content.</p>
<p>Note that esbuild's Deno API installs from
<code>registry.npmjs.org</code> by default, but allows the
<code>NPM_CONFIG_REGISTRY</code> environment variable to override this
with a custom package registry. This change means that the esbuild
executable served by <code>NPM_CONFIG_REGISTRY</code> must now match the
expected content.</p>
<p>Thanks to <a
href="https://github.com/sondt99"><code>@​sondt99</code></a> for
reporting this issue.</p>
</li>
<li>
<p>Avoid inlining <code>using</code> and <code>await using</code>
declarations (<a
href="https://redirect.github.com/evanw/esbuild/issues/4482">#4482</a>)</p>
<p>Previously esbuild's minifier sometimes incorrectly inlined
<code>using</code> and <code>await using</code> declarations into
subsequent uses of that declaration, which then fails to dispose of the
resource correctly. This bug happened because inlining was done for
<code>let</code> and <code>const</code> declarations by avoiding doing
it for <code>var</code> declarations, which no longer worked when more
declaration types were added. Here's an example:</p>
<pre lang="js"><code>// Original code
{
  using x = new Resource()
  x.activate()
}
<p>// Old output (with --minify)<br />
new Resource().activate();</p>
<p>// New output (with --minify)<br />
{using e=new Resource;e.activate()}<br />
</code></pre></p>
</li>
<li>
<p>Fix module evaluation when an error is thrown (<a
href="https://redirect.github.com/evanw/esbuild/issues/4461">#4461</a>,
<a
href="https://redirect.github.com/evanw/esbuild/pull/4467">#4467</a>)</p>
<p>If an error is thrown during module evaluation, esbuild previously
didn't preserve the state of the module for subsequent module
references. This was observable if <code>import()</code> or
<code>require()</code> is used to import a module multiple times. The
thrown error is supposed to be thrown by every call to
<code>import()</code> or <code>require()</code>, not just the first.
With this release, esbuild will now throw the same error every time you
call <code>import()</code> or <code>require()</code> on a module that
throws during its evaluation.</p>
</li>
<li>
<p>Fix some edge cases around the <code>new</code> operator (<a
href="https://redirect.github.com/evanw/esbuild/issues/4477">#4477</a>)</p>
<p>Previously esbuild incorrectly printed certain edge cases involving
complex expressions inside the target of a <code>new</code> expression
(specifically an optional chain and/or a tagged template literal). The
generated code for the <code>new</code> target was not correctly wrapped
with parentheses, and either contained a syntax error or had different
semantics. These edge cases have been fixed so that they now correctly
wrap the <code>new</code> target in parentheses. Here is an example of
some affected code:</p>
<pre lang="js"><code>// Original code
new (foo()`bar`)()
new (foo()?.bar)()
<p>// Old output<br />
new foo()<code>bar</code>();<br />
new (foo())?.bar();</p>
<p></code></pre></p>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/evanw/esbuild/blob/main/CHANGELOG-2025.md">esbuild's
changelog</a>.</em></p>
<blockquote>
<h1>Changelog: 2025</h1>
<p>This changelog documents all esbuild versions published in the year
2025 (versions 0.25.0 through 0.27.2).</p>
<h2>0.27.2</h2>
<ul>
<li>
<p>Allow import path specifiers starting with <code>#/</code> (<a
href="https://redirect.github.com/evanw/esbuild/pull/4361">#4361</a>)</p>
<p>Previously the specification for <code>package.json</code> disallowed
import path specifiers starting with <code>#/</code>, but this
restriction <a
href="https://redirect.github.com/nodejs/node/pull/60864">has recently
been relaxed</a> and support for it is being added across the JavaScript
ecosystem. One use case is using it for a wildcard pattern such as
mapping <code>#/*</code> to <code>./src/*</code> (previously you had to
use another character such as <code>#_*</code> instead, which was more
confusing). There is some more context in <a
href="https://redirect.github.com/nodejs/node/issues/49182">nodejs/node#49182</a>.</p>
<p>This change was contributed by <a
href="https://github.com/hybrist"><code>@​hybrist</code></a>.</p>
</li>
<li>
<p>Automatically add the <code>-webkit-mask</code> prefix (<a
href="https://redirect.github.com/evanw/esbuild/issues/4357">#4357</a>,
<a
href="https://redirect.github.com/evanw/esbuild/issues/4358">#4358</a>)</p>
<p>This release automatically adds the <code>-webkit-</code> vendor
prefix for the <a
href="https://developer.mozilla.org/en-US/docs/Web/CSS/Reference/Properties/mask"><code>mask</code></a>
CSS shorthand property:</p>
<pre lang="css"><code>/* Original code */
main {
  mask: url(x.png) center/5rem no-repeat
}
<p>/* Old output (with --target=chrome110) */<br />
main {<br />
mask: url(x.png) center/5rem no-repeat;<br />
}</p>
<p>/* New output (with --target=chrome110) */<br />
main {<br />
-webkit-mask: url(x.png) center/5rem no-repeat;<br />
mask: url(x.png) center/5rem no-repeat;<br />
}<br />
</code></pre></p>
<p>This change was contributed by <a
href="https://github.com/BPJEnnova"><code>@​BPJEnnova</code></a>.</p>
</li>
<li>
<p>Additional minification of <code>switch</code> statements (<a
href="https://redirect.github.com/evanw/esbuild/issues/4176">#4176</a>,
<a
href="https://redirect.github.com/evanw/esbuild/issues/4359">#4359</a>)</p>
<p>This release contains additional minification patterns for reducing
<code>switch</code> statements. Here is an example:</p>
<pre lang="js"><code>// Original code
switch (x) {
  case 0:
    foo()
    break
  case 1:
  default:
    bar()
}
</code></pre>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/evanw/esbuild/commit/bb9db84c02433fbe37b3509f53f9f3e3cc48725e"><code>bb9db84</code></a>
publish 0.28.1 to npm</li>
<li><a
href="https://github.com/evanw/esbuild/commit/9ff053e53b8eeb990f59355dbea365277ac45ee2"><code>9ff053e</code></a>
security: add integrity checks to the Deno API</li>
<li><a
href="https://github.com/evanw/esbuild/commit/0a9bf2135b67c7e28989a5ba19f0f000805a5ab5"><code>0a9bf21</code></a>
enforce non-negative size in gzip parser</li>
<li><a
href="https://github.com/evanw/esbuild/commit/e2a1a7132058ee067fe736eac15f695861b8654e"><code>e2a1a71</code></a>
security: forbid <code>\\</code> in local dev server requests</li>
<li><a
href="https://github.com/evanw/esbuild/commit/83a2cbfc35809f4fd5152da59572d7bed7739d78"><code>83a2cbf</code></a>
fix <a
href="https://redirect.github.com/evanw/esbuild/issues/4482">#4482</a>:
don't inline <code>using</code> declarations</li>
<li><a
href="https://github.com/evanw/esbuild/commit/308ad745d824c77bc607603451b257d0f2fd9a38"><code>308ad74</code></a>
fix <a
href="https://redirect.github.com/evanw/esbuild/issues/4471">#4471</a>:
renaming of nested <code>var</code> declarations</li>
<li><a
href="https://github.com/evanw/esbuild/commit/f013f5f99a015bce92ec48d49181d4ad3177b29b"><code>f013f5f</code></a>
fix some typos</li>
<li><a
href="https://github.com/evanw/esbuild/commit/aafd6e48b1088336a5f5a17e930be7e840d43d8c"><code>aafd6e4</code></a>
chore: fix some minor issues in comments (<a
href="https://redirect.github.com/evanw/esbuild/issues/4462">#4462</a>)</li>
<li><a
href="https://github.com/evanw/esbuild/commit/15300c30b5e22f7cfcbed850c246d35095658386"><code>15300c3</code></a>
follow up: cjs evaluation fixes</li>
<li><a
href="https://github.com/evanw/esbuild/commit/1bda0c31d7697c0af44b3ab39b81e599e559a395"><code>1bda0c3</code></a>
fix <a
href="https://redirect.github.com/evanw/esbuild/issues/4461">#4461</a>,
fix <a
href="https://redirect.github.com/evanw/esbuild/issues/4467">#4467</a>:
esm evaluation fixes</li>
<li>Additional commits viewable in <a
href="https://github.com/evanw/esbuild/compare/v0.25.12...v0.28.1">compare
view</a></li>
</ul>
</details>
<details>
<summary>Maintainer changes</summary>
<p>This version was pushed to npm by <a
href="https://www.npmjs.com/~GitHub%20Actions">GitHub Actions</a>, a new
releaser for esbuild since your current version.</p>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=esbuild&package-manager=npm_and_yarn&previous-version=0.25.12&new-version=0.28.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Add support for 8-bit quantized weights (bits=8) in the MatMulNBits op
builder. Previously only 4-bit was supported. The 8-bit path uses uint8
block size and scales the dequantized values appropriately.
### Description
Support casting to int64 from float32 via IEEE-754 bit decomposition.

- Introduce a new `float_to_int64` helper that emits the
truncated-toward-zero value in full int64 range.
- `to_` type now always allows int64, regardless of `enable_int64`;
casting *from* int64 stays gated by `enable_int64`.
- Adds `cast_op_test.cc` coverage for the newly introduced conversions.

### Motivation and Context
While running the mask-generation vision encoder (`Xenova/sam-vit-base`)
on the WebGPU EP via Transformers.js, float32-to-int64 cast nodes fall
back to the CPU provider under the default session configuration,
because casting to int64 was previously gated behind `enable_int64`
flag, introducing host memcpy and synchronization overhead.

Making cast-to-int64 correct across the full int64 range lets it run on
the
WebGPU EP by default, keeping these nodes on-device and eliminating the
stalls.

### Performance Impact

Measured on the `vision_encoder.onnx` of `Xenova/sam-vit-base`
(mask-generation, SAM ViT-base vision encoder) on the
WebGPU EP.

| Platform         | Latency reduction | Speedup |
|------------------|------------------:|--------:|
| Intel Wildcat Lake |            −22.8% |   1.30× |
| Intel Panther Lake |            −17.1% |   1.21× |

This change yields a 1.2–1.3× speedup on the SAM ViT-base vision encoder
under default configuration.
…icrosoft#28757)

### Description

ResizeOpTest.ResizeOpNearestUpSample_RoundPreferCeil_HalfPixel_2x2to7x8
fails on Intel WebGPU devices:
```
error: The difference between cur_expected[i] and cur_actual[i] is 2, which exceeds tolerance, where
cur_expected[i] evaluates to 3,
cur_actual[i] evaluates to 1, and
tolerance evaluates to 0.00030999997397884727.
```

### Motivation and Context

The WebGPU shader logic used exact half-tie equality checks combined
with i32 truncation arithmetic. This approach is fragile under GPU
floating-point precision constraints and misbehaves on negative halfway
coordinates.

This change replaces the exact comparisons with a robust, epsilon-based
fractional tie check (<= 1e-6) for both ROUND_PREFER_CEIL and
ROUND_PREFER_FLOOR execution paths.

Also add regression coverage for round_prefer_floor in
resize_op_test.cc:
- ResizeOpNearestUpSample_RoundPreferFloor_HalfPixel_2x2to7x8
- ResizeOpNearestUpSample_RoundPreferFloor_HalfPixel_GH28291_Regression
…icrosoft#29030)

### Description
`GatherBlockQuantized` on WebGPU computes the dispatch size as
`ceil(output_size / 64)`, which is 0 when the gather output is empty
(e.g. an empty indices tensor), and `NormalizeDispatchGroupSize` rejects
the (0, 1, 1) dispatch. Return early for an empty output instead,
matching other WebGPU kernels.

Adds `GatherBlockQuantizedOpTest.WebGpu_EmptyIndices_8Bits_Uint8`.

### Motivation and Context
Fixes microsoft#28772.
### Description
<!-- Describe your changes. -->

Bump version number after WebGPU plugin EP 0.2.0 release branch
creation.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Preparing for WebGPU plugin EP 0.2.0 release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants