Sync with Microsoft ONNX Runtime - 14062026#1137
Open
ai-fw-intg wants to merge 21 commits into
Open
Conversation
### Description Move the existing model package C API off the stable `OrtApi` onto the experimental name-based lookup mechanism added in microsoft#28746. Each model package function is registered individually in `include/onnxruntime/core/session/onnxruntime_experimental_c_api.inc` with the `OrtModelPackageApi_` prefix and the `_SinceV28` version suffix, following the lifecycle rules in `docs/design/Experimental_C_API.md`. Headline changes: - `OrtApi::GetModelPackageApi`, the `OrtModelPackageApi` struct, `OrtApis::GetModelPackageApi`, the `OrtModelPackageAPI` namespace, `onnxruntime/core/session/model_package_api.h`, and the C++ wrappers (`Ort::GetModelPackageApi`, `ORT_DEFINE_RELEASE_FROM_API_STRUCT(ModelPackage*)`, `ModelPackageOptions/Context/ComponentContext`) are removed. - Opaque handle types (`OrtModelPackageOptions`, `OrtModelPackageContext`, `OrtModelPackageComponentContext`) move into `onnxruntime_experimental_c_api.h`. - All 15 model package functions are registered in `onnxruntime_experimental_c_api.inc`. Impls move into `namespace OrtExperimentalApis` with `_SinceV28`-suffixed names in `model_package_api.cc`; bodies are unchanged. - `experimental_c_api.cc` gains a forward-decl block (driven by the same `.inc` X-macro) so the auto-generated registration table can take the address of every entry, even those defined in `model_package_api.cc`. - The Python bindings (`PyModelPackageContext` / `PyModelPackageOptions` / `PyModelPackageComponentContext` and their `onnxruntime.__init__` exports) are removed. Per the design doc we start the experimental API in C/C++ only. - `onnxruntime/test/autoep/test_model_package.cc` switches to a local `ModelPackageFns` struct populated through the `Ort::Experimental::Get_OrtModelPackageApi_*_Fn(api)` typed accessors. Consumer usage going forward, in C++: ```cpp #include "onnxruntime_c_api.h" #include "onnxruntime_experimental_c_api.h" const OrtApi* ort = OrtGetApiBase()->GetApi(ORT_API_VERSION); if (auto* fn = Ort::Experimental::Get_OrtModelPackageApi_CreateModelPackageContext_SinceV28_Fn(ort)) { OrtModelPackageContext* ctx = nullptr; Ort::ThrowOnError(fn(ORT_TSTR("/path/to/pkg"), &ctx)); // ... } ``` ### Motivation and Context The model package API was added to the stable `OrtApi` in 1.27 but has not shipped in a release yet. Now that microsoft#28746 has landed the experimental C API framework, the right home for an iterating preview surface like model package is behind `OrtApi::GetExperimentalFunction`, not on the stable struct. Moving it to experimental: - frees us to change signatures (each name is uniquely versioned) without breaking the stable ABI; - gives consumers a clear "is this specific thing available?" contract instead of a struct that *looks* stable but isn't; - lets the surface be promoted to stable cleanly later (move entries to `OrtApi`, drop the `_SinceV<N>` suffix, remove the experimental entries). --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#28978) ## Summary The CUDA QMoE INT4/INT8 grouped GEMM always dispatches to the Ampere (SM80) CUTLASS kernel — even on Hopper (SM90) — because mixed int-weight + fp16/bf16 activation is not a valid Hopper TMA warp-specialized specialisation. This PR makes weight prepacking always emit the SM80 (column-interleaved) `fpA_intB` layout regardless of the runtime device SM, fixing silently-wrong output on Hopper, and centralizes the arch-clamping logic in a single shared helper. It also cleans up the related tests and tightens MoE parity tolerances that were too loose to catch the layout bug. ## Motivation microsoft#28749 uses 90 for sm90 weight prepacking. On SM90, `isValidHopperMOESpecialisation<half_t, uint4b_t/uint8_t>()` is `false`, so the grouped MoE GEMM falls back to the SM80 kernel. The weight preprocessor, however, skips column interleaving for `arch == 90`, so an auto-detected (`force_arch=-1`) pack on an H200 produced the non-interleaved SM90 layout that the SM80 kernel cannot consume — yielding wrong results. The previous `PrePackIntExpertWeights` logic clamped to `sm_` (passing SM90 through), and the test that exercised the offline packer used auto-detect, so both could emit the wrong layout. ## Key Changes | Area | Change | |---|---| | `fpA_intB_gemm_preprocessors{.h,_impl.cu}` | Extracted `get_arch_for_mixed_gemm_weight_preprocess(int arch)` as a shared, declared helper (clamps SM to the layout group: `<80→75`, `90→90`, else `80`). | | `fpA_intB_gemm_preprocessors_impl.h` | `getLayoutDetailsForTransform` now routes through the shared helper instead of duplicating the arch-range logic. | | `moe_quantization.cc` (`PrePackIntExpertWeights`) | Always packs INT4/INT8 expert weights for the SM80 layout (`get_arch_for_mixed_gemm_weight_preprocess(80)`) instead of clamping to the runtime `sm_`, since the SM80 kernel runs on every GPU. | | `onnxruntime_pybind_quant.cc` (`PackWeightsForMixedGemm`) | Replaced the ad-hoc `{75,80,90}` allowlist with the shared helper, so `force_arch` is clamped consistently with the runtime dispatch (removes the now-unused `<set>` include). | | `contrib_defs.cc` / `moe_quantization.h` | Updated `weights_prepacked` schema/field docs: layouts for `-1`/`1` are EP-determined; for the CUDA EP `-1` and `1` are equivalent today (both SM80), `1` reserved for a future Hopper-specific layout. | | `test_qmoe_cuda.py` | Removed the dead, never-called `preprocess_weights_for_mixed_gemm` helper; the real path (`quant_dequant_blockwise`) already pins `sm=80`. | | `test_moe_cuda.py` | Pinned the offline packer to `arch=80`, and tightened FP16 QMoE parity tolerance from `atol 3.0 (4-bit)` / `2.0 (8-bit)` to `0.5` now that the layout is correct. | | `docs/` | Regenerated `ContribOperators.md` and updated `moe_qmoe.md` to match the new schema docs and SM80-always packing rationale. | ## Testing Notes On an H200 (SM90), with the CUDA 12.x/13.x Python wheel: ```bash python -m pytest onnxruntime/test/python/transformers/test_qmoe_cuda.py python -m pytest onnxruntime/test/python/transformers/test_moe_cuda.py -k "PhiQMoE or qmoe" ``` - `test_qmoe_cuda.py` SwiGLU parity: SM80 layout → max diff ~0.001 (pass, tol 0.1); the prior SM90 layout produced max diff ~1.2 (fail), confirming the fix. - `test_moe_cuda.py` `TestPhiQMoE` (4-bit and 8-bit, all batch/seq combinations): worst observed `max_diff` ≈ 0.375 with the fixed layout, comfortably under the new `atol=0.5`. - `ruff check` passes on both edited test files. --------- Co-authored-by: tlwu <tlwu@example.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
… length (microsoft#28389) ## Summary - Extend the FlashAttention decode path to work with any sequence length (not just seq_len=1), with causal masking and `use_seqlen_k` support for static KV cache - Add m_tile optimization to process multiple Q rows per workgroup (m_tile=1/2/4), amortizing K/V loads - Fuse the separate QKT and SplitVx shaders into a single QKV kernel using online softmax, eliminating the intermediate `qk` tensor (`B×H×seq×present_seq`) and reducing dispatch count from 3 to 2 - Route between prefill (FlashAttentionProgram) and split-reduce (fused QKV + VxReduce) paths based on sequence length ## Resolved Issues **Whisper decoding prefill improved from 4.68ms to 1.09ms.** Whisper's decoder attention has a small sequence length but large total sequence length (seq_len=4, total_seq_len=1500). The default prefill shader (FlashAttentionProgram) has low parallelism in this case because each workgroup iterates serially over the full KV cache. The split-reduce path tiles the KV dimension across workgroups, achieving much higher GPU occupancy for this workload shape. ## Details **Fused QKV kernel**: Each workgroup computes QK^T dot products, applies attention bias and causal mask, computes local softmax (per-tile max and sum), normalizes, and multiplies by V — all in one kernel. Per-tile metadata (max, sum) is written for the VxReduce shader to rescale partial outputs using online softmax: `output = Σ(partial_i × local_sum_i × exp(local_max_i - global_max)) / global_sum`. **Path routing** (`use_split_reduce`): The split-reduce path is used when `sequence_length_ < 32`; otherwise the single-kernel FlashAttentionProgram prefill path is used. Microbenchmarks on Phi-4 (32 heads, head_size 128, GQA group 3) show split-reduce is 1.13×-2.07× faster than the fused prefill kernel across `sequence_length ∈ {16, 30, 31}` × `total_sequence_length ∈ {128, 500, 2000}`. The previous heuristic additionally gated on `total_sequence_length_ > 1000`, but that signal is 0 under graph capture (seqlen_k lives on the GPU) and the carve-out is unnecessary because split-reduce is uniformly faster for short Q. ## Test plan - [x] 30/30 MHA unit tests pass - [x] phi4-graph-prune produces correct output - [x] whisper-tiny-int4 produces correct transcription - [x] clang-format clean
This pull request introduces important safety checks to prevent out-of-bounds access in the logits processing code for transformers. The main updates ensure that token IDs are validated against the vocabulary size before being used, which improves robustness and prevents potential crashes. **Safety and robustness improvements:** * Added bounds checking for token IDs in the `RepetitionPenaltyLogitsProcessor<T>::Process` method to ensure only valid IDs are used when accessing `beam_token_scores`. * Added bounds checking for token IDs in the `NoRepeatNGramLogitsProcessor<T>::Process` method to prevent out-of-bounds writes to `beam_token_scores`. * Updated the `NextTokenScores::SetScore` method to return early if the provided `token_id` is out of bounds, replacing the previous assert with a safe check.
…8703) ## Description This PR adds Linux NPU discovery through sysfs accel devices Currently, `DeviceDiscovery::DiscoverDevicesForPlatform()` on Linux discovers CPU and GPU devices, but NPU discovery is still missing. As a result, plugin execution providers that filter devices by `OrtHardwareDeviceType_NPU` do not receive any NPU hardware devices on Linux, even when the NPU is present and exposed by the kernel. This change scans `/sys/class/accel` for `accelN` devices and creates `OrtHardwareDevice` entries with: - `type = OrtHardwareDeviceType_NPU` - PCI `vendor_id` - PCI `device_id` - `accel_idx` metadata - `pci_bus_id` metadata when available This enables Linux systems with NPUs exposed through the accel subsystem, such as AMD Ryzen AI / XDNA devices, to be reported through ORT device discovery and made available to plugin EP factories. ## Changes - Add Linux sysfs discovery for NPU devices under `/sys/class/accel`. - Read NPU PCI vendor and device IDs from the underlying sysfs device path. - Add NPU metadata including `accel_idx` and `pci_bus_id`. - Include discovered NPU devices in `DeviceDiscovery::DiscoverDevicesForPlatform()`. - Add a `kSysfsAccelPath` constant for the Linux accel sysfs path. ## Motivation Linux plugin EPs that target NPUs rely on ORT passing `OrtHardwareDeviceType_NPU` devices into `GetSupportedDevices()`. Without Linux NPU discovery, those EPs cannot claim NPU devices and provider selection policies such as `PREFER_NPU` silently fall back to CPU. Fixes microsoft#28660.
…RT version (microsoft#28794) ### Description Adds new telemetry event for inference failure which logs ep versions and types along with runtime error. Adds logging of ORT version in other telemetry events. Adds logging of ep versions in SessionCreation telemetry ### Motivation and Context To better diagnose failures in inference --------- Co-authored-by: Darshak Bhatti <dabhatti@micorsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description Fix STFT frame pointer arithmetic for complex-valued input so frame starts are computed in input samples, not trailing real/imag components. Since the frame view pointer is `U*`, one pointer increment advances one full real or complex sample. Also add validation that `frame_step` is positive and keep a defensive bounds check before creating non-owning tensor views. Review feedback addressed: simplified the frame pointer arithmetic, fixed the swapped STFT input comments, documented the defensive bounds check, and added double-complex regression coverage. The new STFT validation/regression tests exclude `kDmlExecutionProvider` because these CPU STFT validation/regression paths do not consistently match DirectML behavior in Windows GPU CI. ### Motivation and Context For complex input shaped `[batch_size, signal_length, 2]`, pointer increments already advance by one real/imag pair. Multiplying frame offsets by `signal_components == 2` again can advance past the valid frame start, allowing later frames to read across batches or beyond the input allocation. ### Testing - `git diff --check -- onnxruntime/core/providers/cpu/signal/dft.cc onnxruntime/test/providers/cpu/signal/signal_ops_test.cc` - `.\.venv\Scripts\python.exe tools\ci_build\build.py --config RelWithDebInfo --build --parallel --target onnxruntime_provider_test --build_dir build\Windows` - `.\onnxruntime_provider_test.exe --gtest_filter="SignalOpsTest.STFTFloat:SignalOpsTest.STFTFrameStepMustBePositive:SignalOpsTest.STFTFloatComplexInputBatched:SignalOpsTest.STFTDoubleComplexInputBatched"` from `build\Windows\RelWithDebInfo\RelWithDebInfo` --------- Co-authored-by: Gopalakrishnan Nallasamy <gnallasamy@microsoft.com>
…icrosoft#28965) ### Description When a QMoE model sets `weights_prepacked=0` (raw `[E, N, K/pack]` int weights) and the session has `session.disable_prepacking`, `PrePack()` never runs, so `packed_fc{1,2}_weights_` stay null and `int_weights_consumed_by_prepack` is false. The code then falls through to the raw initializer pointers — but those bytes are not in CUTLASS layout, so the runner consumes them as-if-prepacked and produces silently wrong output with no diagnostic. Changes in `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc` (`QMoE::ComputeInternal`): - **Int path**: Added a defensive `INVALID_ARGUMENT` guard — when `is_int && !weights_prepacked_` but either prepack buffer is null, return a clear error instead of feeding non-CUTLASS bytes to the runner. - **wfp4afp8 native path**: Same fall-through (`packed_fp4_fc{1,2}_weights_ ? ... : raw`) replaced with an explicit guard that errors when the repacked FP4 buffers were not produced. Also added a focused regression test in `onnxruntime/test/contrib_ops/moe_test.cc` covering `quant_type='int'` with `weights_prepacked=0` and `session.disable_prepacking=1`, asserting that QMoE fails with an actionable error instead of producing output. Merged the branch with the latest `main`. ### Motivation and Context A prior fix removed the null-pointer crash on this path but left a misleading-success outcome that is newly user-reachable via the `weights_prepacked=0` contract — the exact silent-failure mode the offline-path work set out to eliminate. These guards convert that into a loud, actionable error. The wfp4afp8 branch shares the same fall-through and is hardened for consistency. The added regression test ensures this fail-loudly behavior remains covered going forward, especially when prepacking is disabled at the session level. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
This pull request strengthens shape inference validation for several custom BERT-related ONNX operators by adding explicit rank checks for input tensors. These changes ensure that input tensors meet minimum rank requirements, improving error messaging and preventing incorrect shape propagation. **Enhanced shape validation for custom ONNX operators:** *RelativePositionBias and GatedRelativePositionBias:* - Added checks to ensure `bias_table` (for `RelativePositionBias`) and `token_offset` (for `GatedRelativePositionBias`) inputs have rank ≥ 2, with clear error messages if not. [[1]](diffhunk://#diff-8bf31275168b1e4a2aecd6760acf0ef92347134b003dfdc687c5d3cec4a178ecR1959-R1965) [[2]](diffhunk://#diff-8bf31275168b1e4a2aecd6760acf0ef92347134b003dfdc687c5d3cec4a178ecR2228-R2230) *CausalConvWithState:* - Added checks to ensure both `input` and `weight` tensors have rank ≥ 2, failing shape inference with descriptive errors if violated. *LinearAttention:* - Added checks to ensure `query` and `value` tensors have rank ≥ 3 for both output and state shape inference, with early returns or errors if requirements are not met. [[1]](diffhunk://#diff-8bf31275168b1e4a2aecd6760acf0ef92347134b003dfdc687c5d3cec4a178ecR2460-R2465) [[2]](diffhunk://#diff-8bf31275168b1e4a2aecd6760acf0ef92347134b003dfdc687c5d3cec4a178ecR2483-R2486) *SkipLayerNormalization:* - Added a check to ensure the `input` tensor has rank ≥ 1, improving error reporting for invalid input shapes.
…soft#28980) ### Description Optimizes the CUDA QMoE router top-k (`LaunchSoftmaxTopK`) for small-batch / autoregressive decode by replacing the old one-thread-per-row hot path with parallel CUB and warp-level top-k kernels. The dispatch now uses the fastest specialized path for common MoE expert counts while preserving the existing softmax normalization and deterministic lower-index tie-breaking semantics. This PR also factors the warp-level top-k sorting code into a reusable CUDA helper header and adds direct CUDA-internal tests so the new routing paths are covered independently of higher-level QMoE tests. ### Motivation and Context The previous router path launched a 256-thread block per row but did all top-k work in a single thread. In decode scenarios such as `num_rows == 1`, that made the router latency-bound on a serial scan of all expert logits and turned `SoftmaxTopKKernel` into a major MoE decode bottleneck. For a Qwen3-style MoE workload with 256 experts, top-8 routing, and 40 MoE layers, the original router accounted for roughly 50% of decode GPU time. Moving the work to block/warp-parallel kernels removes that bottleneck while keeping the same output ordering and scaling behavior. ### Key Changes | Area | Change | |---|---| | QMoE router dispatch | Adds `DispatchSoftmaxTopK` routing for `k <= 64` and `num_experts <= 1024`, with a fallback to the original scalar kernel for larger or uncommon shapes. | | Tiny expert counts | Adds `SoftmaxTopKWarpBitonicKernel` for `num_experts <= 32`, using one warp per row and in-register bitonic sorting via warp shuffles. | | Small expert counts | Adds `SoftmaxTopKWarpMergeKernel` for `32 < num_experts <= 64`, using a single warp and CUB warp merge sort. | | Larger common MoE counts | Uses `SoftmaxTopKMergeKernel` with CUB block merge sort for `num_experts <= 128`, `256`, `512`, and `1024`. | | Reusable top-k helpers | Adds `onnxruntime/core/providers/cuda/cu_inc/topk_warp_sort.cuh` with reusable warp bitonic and warp merge sort helpers. | | Stable tie-breaking | Packs `(score, index)` into a `uint64_t` stable sort key for the CUB merge paths, matching onnxruntime-genai's lower-index tie-breaking and avoiding compound comparators. | | Softmax cleanup | Factors shared softmax scale, safe reciprocal, top-k normalization, warp reduction, and CUB block reduction helpers to keep the optimized kernels consistent. | | Tests | Adds CUDA-internal `SoftmaxTopK_*` tests covering warp bitonic, warp merge, block merge, stable ties, normalization, `float`, `half`, and `bfloat16`. | ### Performance H200 measurements for the target QMoE decode scenario showed the router cost dropping from roughly `5.56 ms/token` to `0.17 ms/token`, improving end-to-end Qwen3.6-35B-A3B INT4 decode throughput from about `80 tok/s` to `113 tok/s`. Additional profiling of the `32 < num_experts <= 64` warp merge path showed the packed `uint64_t` stable sort key is consistently faster than a `{float, int}` struct comparator on H200: | Experts | Sort-only packed/struct | Full softmax+top-k packed/struct | |---:|---:|---:| | 33 | 0.680x | 0.704x | | 48 | 0.672x | 0.695x | | 64 | 0.673x | 0.696x | ### Testing - `lintrunner -a` - `ninja onnxruntime_providers_cuda_ut` - `ninja onnxruntime_provider_test` - `GTEST_FILTER='CUDA_EP_Unittest.SoftmaxTopK_*' ./onnxruntime_provider_test --gtest_filter='CUDA_EP_Unittest.All'` - `onnxruntime/test/python/transformers/test_qmoe_cuda.py -k parity` (`44 passed`)
…ft#29022) Bumps [shell-quote](https://github.com/ljharb/shell-quote) from 1.8.3 to 1.8.4. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/ljharb/shell-quote/blob/main/CHANGELOG.md">shell-quote's changelog</a>.</em></p> <blockquote> <h2><a href="https://github.com/ljharb/shell-quote/compare/v1.8.3...v1.8.4">v1.8.4</a> - 2026-05-22</h2> <h3>Commits</h3> <ul> <li>[Fix] <code>quote</code>: validate object-token shapes <a href="https://github.com/ljharb/shell-quote/commit/4378a6e613db5948168684864e49b42b83134d2d"><code>4378a6e</code></a></li> <li>[Dev Deps] update <code>@ljharb/eslint-config</code>, <code>auto-changelog</code>, <code>eslint</code>, <code>npmignore</code> <a href="https://github.com/ljharb/shell-quote/commit/22ebec04349065a45ad8afc8cc8d53c4624634a6"><code>22ebec0</code></a></li> <li>[Tests] increase coverage <a href="https://github.com/ljharb/shell-quote/commit/9f3caa31900cc6ee64858b31134144c648ce206d"><code>9f3caa3</code></a></li> <li>[readme] replace runkit CI badge with shields.io check-runs badge <a href="https://github.com/ljharb/shell-quote/commit/3344a047dd1e95f71c4ca27522cbfd05c56277e0"><code>3344a04</code></a></li> <li>[Dev Deps] update <code>@ljharb/eslint-config</code> <a href="https://github.com/ljharb/shell-quote/commit/699c5113d135f4d4591574bebf173334ffa453d4"><code>699c511</code></a></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/ljharb/shell-quote/commit/ff166e2b63eb5f932bd131a8886a99e9afdf45ae"><code>ff166e2</code></a> v1.8.4</li> <li><a href="https://github.com/ljharb/shell-quote/commit/4378a6e613db5948168684864e49b42b83134d2d"><code>4378a6e</code></a> [Fix] <code>quote</code>: validate object-token shapes</li> <li><a href="https://github.com/ljharb/shell-quote/commit/22ebec04349065a45ad8afc8cc8d53c4624634a6"><code>22ebec0</code></a> [Dev Deps] update <code>@ljharb/eslint-config</code>, <code>auto-changelog</code>, <code>eslint</code>, `npmig...</li> <li><a href="https://github.com/ljharb/shell-quote/commit/9f3caa31900cc6ee64858b31134144c648ce206d"><code>9f3caa3</code></a> [Tests] increase coverage</li> <li><a href="https://github.com/ljharb/shell-quote/commit/3344a047dd1e95f71c4ca27522cbfd05c56277e0"><code>3344a04</code></a> [readme] replace runkit CI badge with shields.io check-runs badge</li> <li><a href="https://github.com/ljharb/shell-quote/commit/699c5113d135f4d4591574bebf173334ffa453d4"><code>699c511</code></a> [Dev Deps] update <code>@ljharb/eslint-config</code></li> <li>See full diff in <a href="https://github.com/ljharb/shell-quote/compare/v1.8.3...v1.8.4">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…28982) Closes part of microsoft#27729. The CUDA EP registered Softplus and Softsign only for opset 1 (HFD). Opset 22 extended the type constraint to include BFloat16, but the CUDA EP was never updated. The BFloat16 compute kernels were already instantiated via `UNARY_ACTIVATION_OP_HFD_WITH_BF16`; only the EP registration was missing. Changes: - Cap opset-1 kernels as versioned [1, 21] (Softplus, Softsign) - Register opset-22 kernels with BFloat16 support (HFDX) - Add opset-22 tests: float (CPU+CUDA) and BFloat16 (CUDA, sm≥530) - Update `docs/OperatorKernels.md`
…soft#28966) This pull request expands support for additional ONNX opset versions in the attention fusion optimization code, making the optimizer compatible with newer and more diverse ONNX models. The changes primarily update the accepted opset versions for various operators such as `Transpose`, `Reshape`, `Squeeze`, `Unsqueeze`, `Shape`, and others across multiple functions. This ensures broader model compatibility and improves the robustness of the fusion logic. **Expanded opset version support for attention fusion:** * Updated accepted opset versions for key operators (`Transpose`, `Reshape`, `Squeeze`, `Unsqueeze`, `Shape`, `Add`, `Mul`, `Sub`, `Div`, `Cast`, etc.) in the main attention fusion logic (`attention_fusion.cc`), allowing matching and fusion of newer ONNX models using these operators at opsets up to 25. [[1]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L352-R367) [[2]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L382-R384) [[3]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L394-R395) [[4]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L405-R405) [[5]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L463-R471) [[6]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L500-R500) [[7]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L514-R514) [[8]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L923-R927) [[9]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L956-R958) [[10]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L1073-R1074) [[11]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L1166-R1166) [[12]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L1268-R1275) **Helper and mask subgraph matching improvements:** * Broadened opset version checks for subgraph matching in helper functions, including those for Gemm subgraphs, unidirectional mask subgraphs, input mask subgraphs, and past subgraph matching, to support additional opset versions and operator variants. [[1]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L77-R84) [[2]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L169-R171) [[3]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L378-R379) [[4]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L395-R402) [[5]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L457-R458) [[6]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L485-R487) [[7]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L635-R637) [[8]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L769-R769) [[9]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L794-R796) [[10]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L812-R814) [[11]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L890-R890) These changes collectively future-proof the attention fusion optimizer for a wider range of ONNX models and operator versions, reducing the likelihood of unsupported patterns during optimization.
### Description <!-- Describe your changes. --> Add documentation for `OrtErrorCode` enum and its values. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Provide some documentation about what the error codes mean. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This pull request introduces a new "name-based layer assignment" feature for ONNX Runtime, allowing device assignment of model nodes based on substring matching of node names, as an alternative to annotation-based matching. The implementation ensures that name-based and annotation-based assignment modes are mutually exclusive, and updates documentation, configuration keys, and core logic to support this new capability. **Key changes:** #### Name-based Layer Assignment Feature - Added support for a new session option, `session.name_based_layer_assignment`, which enables device assignment using substring matching against node names (rather than metadata annotations). The longest matching pattern wins, and all patterns are treated as substrings (the `=` exact-match qualifier is disallowed in this mode). [[1]](diffhunk://#diff-10b3051b9e36eccfc7ca0f2d44ce78a9980ca573cde0f931ffd1456da2c681daR181-R218) [[2]](diffhunk://#diff-62d211e77c575a2fec6492c9fbfe25743fac9d6d72be7c007e7f6eb8dbecc7e7R423-R437) - Implemented the `SubstringMatcher` class and integrated it into the `LayeringIndex` logic, enabling efficient substring-based rule matching for node assignment. The matcher sorts patterns by length and returns the first (longest) match. [[1]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5R604-R627) [[2]](diffhunk://#diff-b64b395fbb0afa67dcf493f97493f3df353486c3c8344df89f25df057ca3840fR88-R116) #### Mutual Exclusivity and API Changes - Enforced mutual exclusivity between annotation-based (`session.layer_assignment_settings`) and name-based (`session.name_based_layer_assignment`) assignment options. Attempting to set both now results in an error, with clear messaging. [[1]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5L338-L359) [[2]](diffhunk://#diff-b64b395fbb0afa67dcf493f97493f3df353486c3c8344df89f25df057ca3840fR159-R177) - Updated the `LayeringIndex::Create` API and related logic to accept both configuration strings, select the active mode, and construct the appropriate matcher. [[1]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5L338-L359) [[2]](diffhunk://#diff-b64b395fbb0afa67dcf493f97493f3df353486c3c8344df89f25df057ca3840fR186) #### Documentation Updates - Expanded the documentation to describe the new name-based assignment feature, provide usage examples, highlight best practices for pattern writing, and explain the mutual exclusivity and lack of subgraph inheritance in name-based mode. [[1]](diffhunk://#diff-10b3051b9e36eccfc7ca0f2d44ce78a9980ca573cde0f931ffd1456da2c681daR181-R218) [[2]](diffhunk://#diff-10b3051b9e36eccfc7ca0f2d44ce78a9980ca573cde0f931ffd1456da2c681daL295-R357) #### Core Logic and Maintenance - Refactored the `LayeringIndex` and related methods to support both matching modes, updating node assignment and update logic to branch appropriately based on the selected mode. [[1]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5L426-R485) [[2]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5R537-R548) - Added and documented the new configuration key `kOrtSessionOptionsNameBasedLayerAssignment` in the public API headers. These changes provide a more flexible and user-friendly way to partition models across devices, especially for models with structured node names but lacking explicit annotations.
… convolutions (microsoft#28565) ### Description A path for MLAS to support NHWC Convolutions without the need for transposes was added in PR: microsoft#26834 This PR expands those changes to also support Depthwise Convolutions via the same pathway ### What changed: - The shared NHWC capability gate in onnxruntime/core/mlas/lib/convolve.cpp:1348 stopped requiring GroupCount == 1. It now allows GroupCount > 1 only when the op is true depthwise, meaning filters_per_group == 1. - The NHWC transformer in onnxruntime/core/optimizer/nhwc_transformer.cc:162 was updated to pass the real group value and compute filter_count per group instead of hard-coding group 1. That is what lets grouped depthwise Conv/FusedConv nodes get rewritten to com.microsoft.NhwcFusedConv. - The KleidiAI execution path in onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp:553 learned how to handle grouped NHWC tensors by: - gathering one group’s channels out of interleaved NHWC input into a temporary contiguous buffer, - running the existing per-group kernel, - scattering that group’s output channels back into interleaved NHWC output. - Tests were added for a working NHWC depthwise case in onnxruntime/test/contrib_ops/fused_conv_test.cc:466, and transformer tests were updated to verify both the new positive case and the expected skip cases in onnxruntime/test/optimizer/nhwc_transformer_test.cc:416. Added performance benchmark tests to allow for comparison between the new NHWC path and the old NCHW default. Sample output: ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:1/Cpg:64/Fpg:64/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time 508509 ns 508507 ns 1374 SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:1/Cpg:128/Fpg:128/I:28/28/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time 700573 ns 700386 ns 997 SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:64/Cpg:1/Fpg:1/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time 6471094 ns 6471114 ns 132 SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:72/Cpg:1/Fpg:1/I:48/80/K:3/3/P:1/1/1/1/S:2/2/D:1/1/real_time 3768969 ns 3767797 ns 217 SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:1/Cpg:64/Fpg:64/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time 414198 ns 414197 ns 1688 SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:1/Cpg:128/Fpg:128/I:28/28/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time 652454 ns 652454 ns 1074 SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:64/Cpg:1/Fpg:1/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time 6032947 ns 6032940 ns 117 SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:72/Cpg:1/Fpg:1/I:48/80/K:3/3/P:1/1/1/1/S:2/2/D:1/1/real_time 3022041 ns 3018352 ns 227 ``` --------- Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
…lanned device (microsoft#29013) ### Description `SaveInitializedTensors` only used a user-supplied initializer `OrtValue` (from `AddExternalInitializers`) in place when the initializer was planned on **CPU**. For any non-CPU (e.g. CUDA) initializer it **always allocated a fresh device tensor and copied into it**, even when the supplied `OrtValue` already lived on the planned device. This change uses the supplied `OrtValue` directly when its tensor's device matches the planned device, mirroring the existing CPU case and the `AddInitializer` (`initializers_to_share_map`) no-copy path: ```cpp const auto& graph_value_device = ort_value_from_graph.Get<Tensor>().Location().device; if (memory_info.device == default_cpu_device || graph_value_device == memory_info.device) { // Planned on CPU, or the supplied initializer already lives on the planned device: // use it in place (no per-session allocation/copy; enables cross-session sharing). ort_value = std::move(ort_value_from_graph); } else { // existing allocate-on-device + CopyTensorFromCPUToDevice fallback (true cross-device case) } ``` ### Motivation Two benefits: 1. **Avoids a redundant per-session device allocation + device copy** for every externally-supplied initializer that is already on the target device. 2. **Enables cross-session device-memory sharing.** Supplying the *same* device `OrtValue` to multiple sessions (e.g. a large token embedding + `lm_head` shared between a main decoder and an auxiliary speculative-decoding / multi-token-prediction head) now keeps a single device buffer instead of one copy per session. For a large-vocab model this saves ~2 GB of VRAM. This brings `AddExternalInitializers` in line with `AddInitializer`, which already uses the supplied `OrtValue` in place when its device matches the planned device. Fixes microsoft#29009. ### Behavior / compatibility - The CPU path is unchanged (`memory_info.device == default_cpu_device` still short-circuits first). - The true cross-device case (supplied tensor on a different device than planned) still falls back to allocate + `CopyTensorFromCPUToDevice`, so existing behavior is preserved there. - No public API change. ### Testing - Existing `TestExternalInitializersInjection` (CPU) continues to pass (CPU path untouched). - Validated end-to-end with ONNX Runtime GenAI on CUDA: sharing a fp16 embedding + `lm_head` (1017 MB each) between two sessions that load separate graphs drops the second model's device footprint by ~2145 MB (≈2 GB), with identical inference output vs the non-shared baseline. > Note: a device-level unit test that asserts the shared buffer is reused (no copy) needs internal > session-state access plus a GPU EP harness; happy to add one under `test/providers/cuda` if > reviewers prefer.
### Description ADO pipeline `Nodejs_Packaging` stage now uses CFS. ### Motivation and Context It gets network isolated otherwise, and fails.
This pull request strengthens the validation logic for the `block_row_indices` input in the SparseAttention operator and adds a corresponding unit test to ensure that zero-dimension cases are properly rejected. Validation improvements: * Updated the check in `Status CheckInputs` (in `sparse_attention_helper.h`) to require that the first dimension of `block_row_indices` is greater than zero, preventing invalid zero-dimension input. Test coverage: * Added a new unit test `RejectsZeroDimBlockRowIndices` in `sparse_attention_op_test.cc` to verify that the operator correctly rejects inputs where `block_row_indices` has a zero in its first dimension.
… callbacks (microsoft#28824) ## Description Lowers the minimum supported ONNX Runtime runtime version for the standalone CUDA plugin EP from **1.26.0** to **1.24.4**, so the plugin binary (built against the latest ORT headers) can be loaded by older ORT runtimes. The plugin negotiates the API version at load time and only advertises EP callbacks the negotiated runtime actually supports, so newer features degrade gracefully on older runtimes instead of crashing. ## Motivation The plugin is shipped as a separate package and is intended to run against a range of base `onnxruntime` runtimes. The previous hard floor of 1.26.0 was stricter than necessary: an audit of the `\since` annotations shows the plugin only calls APIs introduced in 1.24 or earlier (apart from the optional EP profiler, which is now version-gated). 1.24.4 is also the floor already used by the WebGPU plugin EP, so this aligns the two. ## Key Changes | Area | Change | |---|---| | `plugin-ep-cuda/MIN_ONNXRUNTIME_VERSION` | `1.26.0` → `1.24.4` (single source of truth for the floor) | | `cmake/onnxruntime_providers_cuda_plugin.cmake` | Reads `MIN_ONNXRUNTIME_VERSION` and bakes it into the DLL as the `ORT_PLUGIN_EP_MIN_ORT_VERSION` compile definition | | `cuda_plugin_ep.cc` | `CreateEpFactories()` negotiates the runtime API version via `onnxruntime::ep::ApiInit(...)` instead of hard-coding `GetApi(26)` | | `cuda_plugin_utils.h` | Adds `CudaPluginEpOrtVersionSupported() = min(CurrentOrtApiVersion(), ORT_API_VERSION)`; removes the hard-coded min-version constant | | 13 callback structs | Report `ort_version_supported`/`version` = `CudaPluginEpOrtVersionSupported()` | | `cuda_ep.cc` | **Defensive capability gating**: installs each newer `OrtEp` callback only when the negotiated runtime is new enough — `Sync`/`CreateProfiler` require ≥1.25, graph-capture set + `GetAvailableResource` require ≥1.26; otherwise left null | | `plugin-linux-cuda-test-stage.yml` | Adds a CI step that installs the floor (`MIN_ONNXRUNTIME_VERSION`) base `onnxruntime` and runs the plugin test against it, catching any accidental dependency on a newer API | | Docs | New §2.6 "API Version Audit and Defensive Capability Gating" in the design doc; QUICK_START min-version test recipe | ## API Version Audit | API surface | Newest `\since` used | |---|---| | `OrtApi` direct calls | 1.23 | | `OrtEpApi` direct calls | 1.24 | | EP profiler API (only with `ENABLE_CUDA_PROFILING`) | 1.25 | Apart from the optional EP profiler, every API the plugin calls is `\since 1.24` or older, justifying the 1.24.4 floor. The profiler's three `\since 1.25` functions are made unreachable on older runtimes by gating the `CreateProfiler` callback. ## Testing Notes - Incremental build on CUDA 12.8 / SM90 — clean, plugin `.so` relinked. - `test_cuda_plugin_ep.py` against the latest runtime (1.28): **87/87 tests pass**. - Plugin (built against latest headers) loaded into `onnxruntime==1.24.4`: registers, enumerates all 8 GPUs, and runs inference correctly with the newer callbacks left null. - `lintrunner` clean on changed files. - New CI step validates the plugin against the declared floor automatically.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.