Sync with Microsoft ONNX Runtime - 16062026#1139
Open
ai-fw-intg wants to merge 9 commits into
Open
Conversation
One more reference to 6.33.0 that was missed. CVE-2026-0994 Upgrade protobuf from 6.33.0 to 6.33.5 to fix the vulnerability.
## Summary Adds a symmetric weight-only **MoE GEMV fast path** for single-token (batch-1) decode of quantized MoE models such as GPT-OSS-20B, replacing the CUTLASS grouped-GEMM path when the expanded row count is small. The fast path now covers **INT4 and INT8** weights, **per-column and block-wise (group size 32/64/128)** scales, and **FP16 and BF16** activations. On an H200 (SM90) this improves end-to-end GPT-OSS-20B INT4 token-generation throughput by roughly **15% over the CUTLASS grouped-GEMM baseline** and about **8% over the FasterTransformer kernel used in ORT 1.26** across prompt lengths 128/1024/2048, while producing bit-faithful output. ## Motivation At batch-1 decode each token expands to `top_k` rows (4 for GPT-OSS-20B), so the MoE FC1/FC2 GEMMs are extremely skinny. The CUTLASS grouped GEMM is built for throughput at larger M and leaves the decode path memory-bound and underutilized. A dedicated weight-only INT GEMV with per-expert dispatch is a better fit for this regime and closes the gap to (and surpasses) ORT 1.26 for GPT-OSS-20B. This also adds block_size=32 support as requested in microsoft#29035. ## Key Changes ### New MoE GEMV kernel | File | Change | |---|---| | `contrib_ops/cuda/llm/moe_gemm/moe_gemv.h` | Public interface: `is_moe_gemv_supported` dispatch predicate and the symmetric INT launchers `launch_moe_gemv_int_symmetric<T, WeightType>` / `launch_moe_gemv_int_symmetric_interleaved_swiglu<T, WeightType>` (plus the original INT4 per-channel launchers). | | `contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu` | Symmetric INT MoE GEMV (INT4 `uint4b_t` / INT8 `uint8_t`, FP16/BF16). One CTA per expanded row × N-tile; per-expert weight/scale/bias offsets via a direct row-to-expert map (prefix-offset scan fallback). Supports per-column (`group_size <= 0`) and block-wise (`group_size` 64/128) scales. FC1 has an interleaved SwiGLU-fused epilogue; a static top-k one-row finalize specialization is used for FC2 routing. | ### Runner wiring and profiler | File | Change | |---|---| | `contrib_ops/cuda/llm/moe_gemm/moe_kernels.cu` / `.h` | Branch FC1 and FC2 to the GEMV fast path when supported (carrying `group_size` through `QuantParams`); fall back to grouped GEMM otherwise. `ORT_DISABLE_MOE_GEMV=1` forces the grouped-GEMM path for A/B testing. | | `contrib_ops/cuda/llm/moe_gemm/moe_gemm_profiler.cc` / `.h`, `moe_util_kernels.h` | Use M (token count) in the MoE GEMM profiler so tactic selection reflects the decode shape. | ### Backward-compatible SwiGLU fusion | File | Change | |---|---| | `contrib_ops/cuda/moe/moe.cc`, `moe_quantization.cc` (+ `.h`) | Treat `swiglu_fusion == 0` with no separate FC3 as interleaved fusion (`== 1`). The published GPT-OSS-20B model (and any model exported by ORT < 1.27) hard-coded the interleaved layout and emits no `swiglu_fusion` attribute, so it defaults to 0; those weights are pre-fused into FC1 and must be treated as fusion mode 1. | ### Tests, profiling, and docs | File | Change | |---|---| | `test/python/transformers/profile_qmoe_gemv.py` / `.sh` | Standalone QMoE GEMV profiling harness (NVTX-ranged, GEMV-vs-grouped-GEMM kernel comparison). | | `test/python/transformers/test_qmoe_cuda.py`, `test_moe_cuda.py` | QMoE GEMV decode-latency/parity coverage across INT4/INT8, per-column/block-wise, FP16/BF16, plus import tidy-ups. | | `docs/contrib_ops/cuda/qmoe_gemv_experiments.md` | Full experiment log: kernel-level sweeps, dispatch-gate tuning, block-wise and BF16 enablement, INT8 per-column root-cause + fix, and the GenAI end-to-end throughput comparison (CUTLASS baseline / GEMV final / FT baseline). | | `docs/contrib_ops/cuda/moe_qmoe.md` | QMoE doc updates describing the GEMV fast path and its dispatch gate. | ## Results ### GPT-OSS-20B INT4 end-to-end (H200/SM90, batch 1) Token-generation throughput (tps, higher is better): | Prompt length | CUTLASS baseline (gemm) | GEMV (final) | FT (ORT 1.26) | GEMV vs cutlass | GEMV vs FT | |---|---|---|---|---|---| | 128 | 248.9 | 288.0 | 265.2 | +15.7% | +8.6% | | 1024 | 237.8 | 272.2 | 252.9 | +14.5% | +7.6% | | 2048 | 231.3 | 265.0 | 245.6 | +14.6% | +7.9% | ### Decode microbenchmark coverage (H200/SM90, batch 1) Benchmark-loop latency in milliseconds, lower is better. `Enabled` is the default GEMV build; `Fallback` sets `ORT_DISABLE_MOE_GEMV=1` (grouped GEMM). Every case reported `has_invalid_output=false`. | Case | Quant | DType | Enabled ms | Fallback ms | Speedup | |---|---|---|---|---|---| | `int8_per_column_m1_top2_1024x4096_e8` | INT8 per-column | FP16 | 0.0566 | 0.0816 | 1.44x | | `int8_per_column_m1_top2_1024x4096_e8` | INT8 per-column | BF16 | 0.0578 | 0.0862 | 1.49x | | `gpt_oss_20b_m1_top4_int8_2880x2880_e32` | INT8 per-column | FP16 | 0.0785 | 0.0947 | 1.21x | | `gpt_oss_20b_m1_top4_int8_2880x2880_e32` | INT8 per-column | BF16 | 0.0785 | 0.0989 | 1.26x | Block-wise INT4 (`block_size=64`, `1024x4096`, e8) routes to GEMV for both FP16 and BF16 with FC1/FC2 kernel times within noise across dtypes (FP16 4.53/6.95 us, BF16 4.62/7.01 us). See `qmoe_gemv_experiments.md` for the full sweep. ## Correctness - GEMV-enabled and `ORT_DISABLE_MOE_GEMV=1` (grouped GEMM) produce identical, correct output for the GPT-OSS-20B sanity prompt; Nsight traces confirm `moe_gemv_kernel` (FC2) and `moe_gemv_interleaved_swiglu_kernel` (FC1) run for the decode shapes. - INT4/INT8 SwiGLU parity cases pass for FP16 and BF16 (max absolute difference ~1e-3 against the reference). Regression: `pytest -k "TestSwigluQMoE or TestQMoEIntPrePackSmoke"` → `34 passed, 4 skipped`. - Test on GPT-OSS-20b with block_size=32, and generated results looks good. ## Testing Notes - Build: `bash .env/cuda_130.sh --build` (CUDA 13.0, `CMAKE_CUDA_ARCHITECTURES=89;90`). Use `--clean_moe` after dispatch-code edits to avoid stale `moe_kernels.cu.o`. - Kernel profiling: `onnxruntime/test/python/transformers/profile_qmoe_gemv.sh`. - Parity/latency: `pytest onnxruntime/test/python/transformers/test_qmoe_cuda.py`. - End-to-end: GenAI `benchmark_e2e.py` on GPT-OSS-20B INT4, batch 1, prompt lengths 128/1024/2048. Run on an idle GPU — shared-GPU contention corrupts the decode-latency measurement. - A/B check: compare default vs `ORT_DISABLE_MOE_GEMV=1` for both throughput and output parity.
Bumps [esbuild](https://github.com/evanw/esbuild) from 0.25.12 to 0.28.1. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/evanw/esbuild/releases">esbuild's releases</a>.</em></p> <blockquote> <h2>v0.28.1</h2> <ul> <li> <p>Disallow <code>\</code> in local development server HTTP requests (<a href="https://github.com/evanw/esbuild/security/advisories/GHSA-g7r4-m6w7-qqqr">GHSA-g7r4-m6w7-qqqr</a>)</p> <p>This release fixes a security issue where HTTP requests to esbuild's local development server could traverse outside of the serve directory on Windows using a <code>\</code> backslash character. It happened due to the use of Go's <code>path.Clean()</code> function, which only handles Unix-style <code>/</code> characters. HTTP requests with paths containing <code>\</code> are no longer allowed.</p> <p>Thanks to <a href="https://github.com/dellalibera"><code>@dellalibera</code></a> for reporting this issue.</p> </li> <li> <p>Add integrity checks to the Deno API (<a href="https://github.com/evanw/esbuild/security/advisories/GHSA-gv7w-rqvm-qjhr">GHSA-gv7w-rqvm-qjhr</a>)</p> <p>The previous release of esbuild added integrity checks to esbuild's npm install script. This release also adds integrity checks to esbuild's Deno install script. Now esbuild's Deno API will also fail with an error if the downloaded esbuild binary contains something other than the expected content.</p> <p>Note that esbuild's Deno API installs from <code>registry.npmjs.org</code> by default, but allows the <code>NPM_CONFIG_REGISTRY</code> environment variable to override this with a custom package registry. This change means that the esbuild executable served by <code>NPM_CONFIG_REGISTRY</code> must now match the expected content.</p> <p>Thanks to <a href="https://github.com/sondt99"><code>@sondt99</code></a> for reporting this issue.</p> </li> <li> <p>Avoid inlining <code>using</code> and <code>await using</code> declarations (<a href="https://redirect.github.com/evanw/esbuild/issues/4482">#4482</a>)</p> <p>Previously esbuild's minifier sometimes incorrectly inlined <code>using</code> and <code>await using</code> declarations into subsequent uses of that declaration, which then fails to dispose of the resource correctly. This bug happened because inlining was done for <code>let</code> and <code>const</code> declarations by avoiding doing it for <code>var</code> declarations, which no longer worked when more declaration types were added. Here's an example:</p> <pre lang="js"><code>// Original code { using x = new Resource() x.activate() } <p>// Old output (with --minify)<br /> new Resource().activate();</p> <p>// New output (with --minify)<br /> {using e=new Resource;e.activate()}<br /> </code></pre></p> </li> <li> <p>Fix module evaluation when an error is thrown (<a href="https://redirect.github.com/evanw/esbuild/issues/4461">#4461</a>, <a href="https://redirect.github.com/evanw/esbuild/pull/4467">#4467</a>)</p> <p>If an error is thrown during module evaluation, esbuild previously didn't preserve the state of the module for subsequent module references. This was observable if <code>import()</code> or <code>require()</code> is used to import a module multiple times. The thrown error is supposed to be thrown by every call to <code>import()</code> or <code>require()</code>, not just the first. With this release, esbuild will now throw the same error every time you call <code>import()</code> or <code>require()</code> on a module that throws during its evaluation.</p> </li> <li> <p>Fix some edge cases around the <code>new</code> operator (<a href="https://redirect.github.com/evanw/esbuild/issues/4477">#4477</a>)</p> <p>Previously esbuild incorrectly printed certain edge cases involving complex expressions inside the target of a <code>new</code> expression (specifically an optional chain and/or a tagged template literal). The generated code for the <code>new</code> target was not correctly wrapped with parentheses, and either contained a syntax error or had different semantics. These edge cases have been fixed so that they now correctly wrap the <code>new</code> target in parentheses. Here is an example of some affected code:</p> <pre lang="js"><code>// Original code new (foo()`bar`)() new (foo()?.bar)() <p>// Old output<br /> new foo()<code>bar</code>();<br /> new (foo())?.bar();</p> <p></code></pre></p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/evanw/esbuild/blob/main/CHANGELOG-2025.md">esbuild's changelog</a>.</em></p> <blockquote> <h1>Changelog: 2025</h1> <p>This changelog documents all esbuild versions published in the year 2025 (versions 0.25.0 through 0.27.2).</p> <h2>0.27.2</h2> <ul> <li> <p>Allow import path specifiers starting with <code>#/</code> (<a href="https://redirect.github.com/evanw/esbuild/pull/4361">#4361</a>)</p> <p>Previously the specification for <code>package.json</code> disallowed import path specifiers starting with <code>#/</code>, but this restriction <a href="https://redirect.github.com/nodejs/node/pull/60864">has recently been relaxed</a> and support for it is being added across the JavaScript ecosystem. One use case is using it for a wildcard pattern such as mapping <code>#/*</code> to <code>./src/*</code> (previously you had to use another character such as <code>#_*</code> instead, which was more confusing). There is some more context in <a href="https://redirect.github.com/nodejs/node/issues/49182">nodejs/node#49182</a>.</p> <p>This change was contributed by <a href="https://github.com/hybrist"><code>@hybrist</code></a>.</p> </li> <li> <p>Automatically add the <code>-webkit-mask</code> prefix (<a href="https://redirect.github.com/evanw/esbuild/issues/4357">#4357</a>, <a href="https://redirect.github.com/evanw/esbuild/issues/4358">#4358</a>)</p> <p>This release automatically adds the <code>-webkit-</code> vendor prefix for the <a href="https://developer.mozilla.org/en-US/docs/Web/CSS/Reference/Properties/mask"><code>mask</code></a> CSS shorthand property:</p> <pre lang="css"><code>/* Original code */ main { mask: url(x.png) center/5rem no-repeat } <p>/* Old output (with --target=chrome110) */<br /> main {<br /> mask: url(x.png) center/5rem no-repeat;<br /> }</p> <p>/* New output (with --target=chrome110) */<br /> main {<br /> -webkit-mask: url(x.png) center/5rem no-repeat;<br /> mask: url(x.png) center/5rem no-repeat;<br /> }<br /> </code></pre></p> <p>This change was contributed by <a href="https://github.com/BPJEnnova"><code>@BPJEnnova</code></a>.</p> </li> <li> <p>Additional minification of <code>switch</code> statements (<a href="https://redirect.github.com/evanw/esbuild/issues/4176">#4176</a>, <a href="https://redirect.github.com/evanw/esbuild/issues/4359">#4359</a>)</p> <p>This release contains additional minification patterns for reducing <code>switch</code> statements. Here is an example:</p> <pre lang="js"><code>// Original code switch (x) { case 0: foo() break case 1: default: bar() } </code></pre> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/evanw/esbuild/commit/bb9db84c02433fbe37b3509f53f9f3e3cc48725e"><code>bb9db84</code></a> publish 0.28.1 to npm</li> <li><a href="https://github.com/evanw/esbuild/commit/9ff053e53b8eeb990f59355dbea365277ac45ee2"><code>9ff053e</code></a> security: add integrity checks to the Deno API</li> <li><a href="https://github.com/evanw/esbuild/commit/0a9bf2135b67c7e28989a5ba19f0f000805a5ab5"><code>0a9bf21</code></a> enforce non-negative size in gzip parser</li> <li><a href="https://github.com/evanw/esbuild/commit/e2a1a7132058ee067fe736eac15f695861b8654e"><code>e2a1a71</code></a> security: forbid <code>\\</code> in local dev server requests</li> <li><a href="https://github.com/evanw/esbuild/commit/83a2cbfc35809f4fd5152da59572d7bed7739d78"><code>83a2cbf</code></a> fix <a href="https://redirect.github.com/evanw/esbuild/issues/4482">#4482</a>: don't inline <code>using</code> declarations</li> <li><a href="https://github.com/evanw/esbuild/commit/308ad745d824c77bc607603451b257d0f2fd9a38"><code>308ad74</code></a> fix <a href="https://redirect.github.com/evanw/esbuild/issues/4471">#4471</a>: renaming of nested <code>var</code> declarations</li> <li><a href="https://github.com/evanw/esbuild/commit/f013f5f99a015bce92ec48d49181d4ad3177b29b"><code>f013f5f</code></a> fix some typos</li> <li><a href="https://github.com/evanw/esbuild/commit/aafd6e48b1088336a5f5a17e930be7e840d43d8c"><code>aafd6e4</code></a> chore: fix some minor issues in comments (<a href="https://redirect.github.com/evanw/esbuild/issues/4462">#4462</a>)</li> <li><a href="https://github.com/evanw/esbuild/commit/15300c30b5e22f7cfcbed850c246d35095658386"><code>15300c3</code></a> follow up: cjs evaluation fixes</li> <li><a href="https://github.com/evanw/esbuild/commit/1bda0c31d7697c0af44b3ab39b81e599e559a395"><code>1bda0c3</code></a> fix <a href="https://redirect.github.com/evanw/esbuild/issues/4461">#4461</a>, fix <a href="https://redirect.github.com/evanw/esbuild/issues/4467">#4467</a>: esm evaluation fixes</li> <li>Additional commits viewable in <a href="https://github.com/evanw/esbuild/compare/v0.25.12...v0.28.1">compare view</a></li> </ul> </details> <details> <summary>Maintainer changes</summary> <p>This version was pushed to npm by <a href="https://www.npmjs.com/~GitHub%20Actions">GitHub Actions</a>, a new releaser for esbuild since your current version.</p> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Add support for 8-bit quantized weights (bits=8) in the MatMulNBits op builder. Previously only 4-bit was supported. The 8-bit path uses uint8 block size and scales the dequantized values appropriately.
### Description Support casting to int64 from float32 via IEEE-754 bit decomposition. - Introduce a new `float_to_int64` helper that emits the truncated-toward-zero value in full int64 range. - `to_` type now always allows int64, regardless of `enable_int64`; casting *from* int64 stays gated by `enable_int64`. - Adds `cast_op_test.cc` coverage for the newly introduced conversions. ### Motivation and Context While running the mask-generation vision encoder (`Xenova/sam-vit-base`) on the WebGPU EP via Transformers.js, float32-to-int64 cast nodes fall back to the CPU provider under the default session configuration, because casting to int64 was previously gated behind `enable_int64` flag, introducing host memcpy and synchronization overhead. Making cast-to-int64 correct across the full int64 range lets it run on the WebGPU EP by default, keeping these nodes on-device and eliminating the stalls. ### Performance Impact Measured on the `vision_encoder.onnx` of `Xenova/sam-vit-base` (mask-generation, SAM ViT-base vision encoder) on the WebGPU EP. | Platform | Latency reduction | Speedup | |------------------|------------------:|--------:| | Intel Wildcat Lake | −22.8% | 1.30× | | Intel Panther Lake | −17.1% | 1.21× | This change yields a 1.2–1.3× speedup on the SAM ViT-base vision encoder under default configuration.
…icrosoft#28757) ### Description ResizeOpTest.ResizeOpNearestUpSample_RoundPreferCeil_HalfPixel_2x2to7x8 fails on Intel WebGPU devices: ``` error: The difference between cur_expected[i] and cur_actual[i] is 2, which exceeds tolerance, where cur_expected[i] evaluates to 3, cur_actual[i] evaluates to 1, and tolerance evaluates to 0.00030999997397884727. ``` ### Motivation and Context The WebGPU shader logic used exact half-tie equality checks combined with i32 truncation arithmetic. This approach is fragile under GPU floating-point precision constraints and misbehaves on negative halfway coordinates. This change replaces the exact comparisons with a robust, epsilon-based fractional tie check (<= 1e-6) for both ROUND_PREFER_CEIL and ROUND_PREFER_FLOOR execution paths. Also add regression coverage for round_prefer_floor in resize_op_test.cc: - ResizeOpNearestUpSample_RoundPreferFloor_HalfPixel_2x2to7x8 - ResizeOpNearestUpSample_RoundPreferFloor_HalfPixel_GH28291_Regression
…icrosoft#29030) ### Description `GatherBlockQuantized` on WebGPU computes the dispatch size as `ceil(output_size / 64)`, which is 0 when the gather output is empty (e.g. an empty indices tensor), and `NormalizeDispatchGroupSize` rejects the (0, 1, 1) dispatch. Return early for an empty output instead, matching other WebGPU kernels. Adds `GatherBlockQuantizedOpTest.WebGpu_EmptyIndices_8Bits_Uint8`. ### Motivation and Context Fixes microsoft#28772.
### Description <!-- Describe your changes. --> Bump version number after WebGPU plugin EP 0.2.0 release branch creation. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Preparing for WebGPU plugin EP 0.2.0 release.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.