Skip to content

Sync with Microsoft ONNX Runtime - 14062026#1137

Open
ai-fw-intg wants to merge 21 commits into
ovep-developfrom
sync_msft_14062026
Open

Sync with Microsoft ONNX Runtime - 14062026#1137
ai-fw-intg wants to merge 21 commits into
ovep-developfrom
sync_msft_14062026

Conversation

@ai-fw-intg

Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

jambayk and others added 21 commits June 10, 2026 17:13
### Description

Move the existing model package C API off the stable `OrtApi` onto the
experimental name-based lookup mechanism added in microsoft#28746. Each model
package function is registered individually in
`include/onnxruntime/core/session/onnxruntime_experimental_c_api.inc`
with the `OrtModelPackageApi_` prefix and the `_SinceV28` version
suffix, following the lifecycle rules in
`docs/design/Experimental_C_API.md`.

Headline changes:

- `OrtApi::GetModelPackageApi`, the `OrtModelPackageApi` struct,
`OrtApis::GetModelPackageApi`, the `OrtModelPackageAPI` namespace,
`onnxruntime/core/session/model_package_api.h`, and the C++ wrappers
(`Ort::GetModelPackageApi`,
`ORT_DEFINE_RELEASE_FROM_API_STRUCT(ModelPackage*)`,
`ModelPackageOptions/Context/ComponentContext`) are removed.
- Opaque handle types (`OrtModelPackageOptions`,
`OrtModelPackageContext`, `OrtModelPackageComponentContext`) move into
`onnxruntime_experimental_c_api.h`.
- All 15 model package functions are registered in
`onnxruntime_experimental_c_api.inc`. Impls move into `namespace
OrtExperimentalApis` with `_SinceV28`-suffixed names in
`model_package_api.cc`; bodies are unchanged.
- `experimental_c_api.cc` gains a forward-decl block (driven by the same
`.inc` X-macro) so the auto-generated registration table can take the
address of every entry, even those defined in `model_package_api.cc`.
- The Python bindings (`PyModelPackageContext` / `PyModelPackageOptions`
/ `PyModelPackageComponentContext` and their `onnxruntime.__init__`
exports) are removed. Per the design doc we start the experimental API
in C/C++ only.
- `onnxruntime/test/autoep/test_model_package.cc` switches to a local
`ModelPackageFns` struct populated through the
`Ort::Experimental::Get_OrtModelPackageApi_*_Fn(api)` typed accessors.

Consumer usage going forward, in C++:

```cpp
#include "onnxruntime_c_api.h"
#include "onnxruntime_experimental_c_api.h"

const OrtApi* ort = OrtGetApiBase()->GetApi(ORT_API_VERSION);

if (auto* fn = Ort::Experimental::Get_OrtModelPackageApi_CreateModelPackageContext_SinceV28_Fn(ort)) {
  OrtModelPackageContext* ctx = nullptr;
  Ort::ThrowOnError(fn(ORT_TSTR("/path/to/pkg"), &ctx));
  // ...
}
```

### Motivation and Context

The model package API was added to the stable `OrtApi` in 1.27 but has
not shipped in a release yet. Now that microsoft#28746 has landed the
experimental C API framework, the right home for an iterating preview
surface like model package is behind `OrtApi::GetExperimentalFunction`,
not on the stable struct.

Moving it to experimental:

- frees us to change signatures (each name is uniquely versioned)
without breaking the stable ABI;
- gives consumers a clear "is this specific thing available?" contract
instead of a struct that *looks* stable but isn't;
- lets the surface be promoted to stable cleanly later (move entries to
`OrtApi`, drop the `_SinceV<N>` suffix, remove the experimental
entries).

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#28978)

## Summary

The CUDA QMoE INT4/INT8 grouped GEMM always dispatches to the Ampere
(SM80) CUTLASS kernel — even on Hopper (SM90) — because mixed int-weight
+ fp16/bf16 activation is not a valid Hopper TMA warp-specialized
specialisation. This PR makes weight prepacking always emit the SM80
(column-interleaved) `fpA_intB` layout regardless of the runtime device
SM, fixing silently-wrong output on Hopper, and centralizes the
arch-clamping logic in a single shared helper. It also cleans up the
related tests and tightens MoE parity tolerances that were too loose to
catch the layout bug.

## Motivation

microsoft#28749 uses 90 for sm90
weight prepacking.

On SM90, `isValidHopperMOESpecialisation<half_t, uint4b_t/uint8_t>()` is
`false`, so the grouped MoE GEMM falls back to the SM80 kernel. The
weight preprocessor, however, skips column interleaving for `arch ==
90`, so an auto-detected (`force_arch=-1`) pack on an H200 produced the
non-interleaved SM90 layout that the SM80 kernel cannot consume —
yielding wrong results. The previous `PrePackIntExpertWeights` logic
clamped to `sm_` (passing SM90 through), and the test that exercised the
offline packer used auto-detect, so both could emit the wrong layout.

## Key Changes

| Area | Change |
|---|---|
| `fpA_intB_gemm_preprocessors{.h,_impl.cu}` | Extracted
`get_arch_for_mixed_gemm_weight_preprocess(int arch)` as a shared,
declared helper (clamps SM to the layout group: `<80→75`, `90→90`, else
`80`). |
| `fpA_intB_gemm_preprocessors_impl.h` | `getLayoutDetailsForTransform`
now routes through the shared helper instead of duplicating the
arch-range logic. |
| `moe_quantization.cc` (`PrePackIntExpertWeights`) | Always packs
INT4/INT8 expert weights for the SM80 layout
(`get_arch_for_mixed_gemm_weight_preprocess(80)`) instead of clamping to
the runtime `sm_`, since the SM80 kernel runs on every GPU. |
| `onnxruntime_pybind_quant.cc` (`PackWeightsForMixedGemm`) | Replaced
the ad-hoc `{75,80,90}` allowlist with the shared helper, so
`force_arch` is clamped consistently with the runtime dispatch (removes
the now-unused `<set>` include). |
| `contrib_defs.cc` / `moe_quantization.h` | Updated `weights_prepacked`
schema/field docs: layouts for `-1`/`1` are EP-determined; for the CUDA
EP `-1` and `1` are equivalent today (both SM80), `1` reserved for a
future Hopper-specific layout. |
| `test_qmoe_cuda.py` | Removed the dead, never-called
`preprocess_weights_for_mixed_gemm` helper; the real path
(`quant_dequant_blockwise`) already pins `sm=80`. |
| `test_moe_cuda.py` | Pinned the offline packer to `arch=80`, and
tightened FP16 QMoE parity tolerance from `atol 3.0 (4-bit)` / `2.0
(8-bit)` to `0.5` now that the layout is correct. |
| `docs/` | Regenerated `ContribOperators.md` and updated `moe_qmoe.md`
to match the new schema docs and SM80-always packing rationale. |

## Testing Notes

On an H200 (SM90), with the CUDA 12.x/13.x Python wheel:

```bash
python -m pytest onnxruntime/test/python/transformers/test_qmoe_cuda.py
python -m pytest onnxruntime/test/python/transformers/test_moe_cuda.py -k "PhiQMoE or qmoe"
```

- `test_qmoe_cuda.py` SwiGLU parity: SM80 layout → max diff ~0.001
(pass, tol 0.1); the prior SM90 layout produced max diff ~1.2 (fail),
confirming the fix.
- `test_moe_cuda.py` `TestPhiQMoE` (4-bit and 8-bit, all batch/seq
combinations): worst observed `max_diff` ≈ 0.375 with the fixed layout,
comfortably under the new `atol=0.5`.
- `ruff check` passes on both edited test files.

---------

Co-authored-by: tlwu <tlwu@example.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
… length (microsoft#28389)

## Summary

- Extend the FlashAttention decode path to work with any sequence length
(not just seq_len=1), with causal masking and `use_seqlen_k` support for
static KV cache
- Add m_tile optimization to process multiple Q rows per workgroup
(m_tile=1/2/4), amortizing K/V loads
- Fuse the separate QKT and SplitVx shaders into a single QKV kernel
using online softmax, eliminating the intermediate `qk` tensor
(`B×H×seq×present_seq`) and reducing dispatch count from 3 to 2
- Route between prefill (FlashAttentionProgram) and split-reduce (fused
QKV + VxReduce) paths based on sequence length

## Resolved Issues

**Whisper decoding prefill improved from 4.68ms to 1.09ms.** Whisper's
decoder attention has a small sequence length but large total sequence
length (seq_len=4, total_seq_len=1500). The default prefill shader
(FlashAttentionProgram) has low parallelism in this case because each
workgroup iterates serially over the full KV cache. The split-reduce
path tiles the KV dimension across workgroups, achieving much higher GPU
occupancy for this workload shape.

## Details

**Fused QKV kernel**: Each workgroup computes QK^T dot products, applies
attention bias and causal mask, computes local softmax (per-tile max and
sum), normalizes, and multiplies by V — all in one kernel. Per-tile
metadata (max, sum) is written for the VxReduce shader to rescale
partial outputs using online softmax: `output = Σ(partial_i ×
local_sum_i × exp(local_max_i - global_max)) / global_sum`.

**Path routing** (`use_split_reduce`): The split-reduce path is used
when `sequence_length_ < 32`; otherwise the single-kernel
FlashAttentionProgram prefill path is used. Microbenchmarks on Phi-4 (32
heads, head_size 128, GQA group 3) show split-reduce is 1.13×-2.07×
faster than the fused prefill kernel across `sequence_length ∈ {16, 30,
31}` × `total_sequence_length ∈ {128, 500, 2000}`. The previous
heuristic additionally gated on `total_sequence_length_ > 1000`, but
that signal is 0 under graph capture (seqlen_k lives on the GPU) and the
carve-out is unnecessary because split-reduce is uniformly faster for
short Q.

## Test plan

- [x] 30/30 MHA unit tests pass
- [x] phi4-graph-prune produces correct output
- [x] whisper-tiny-int4 produces correct transcription
- [x] clang-format clean
This pull request introduces important safety checks to prevent
out-of-bounds access in the logits processing code for transformers. The
main updates ensure that token IDs are validated against the vocabulary
size before being used, which improves robustness and prevents potential
crashes.

**Safety and robustness improvements:**

* Added bounds checking for token IDs in the
`RepetitionPenaltyLogitsProcessor<T>::Process` method to ensure only
valid IDs are used when accessing `beam_token_scores`.
* Added bounds checking for token IDs in the
`NoRepeatNGramLogitsProcessor<T>::Process` method to prevent
out-of-bounds writes to `beam_token_scores`.
* Updated the `NextTokenScores::SetScore` method to return early if the
provided `token_id` is out of bounds, replacing the previous assert with
a safe check.
…8703)

## Description

This PR adds Linux NPU discovery through sysfs accel devices

Currently, `DeviceDiscovery::DiscoverDevicesForPlatform()` on Linux
discovers CPU and GPU devices, but NPU discovery is still missing. As a
result, plugin execution providers that filter devices by
`OrtHardwareDeviceType_NPU` do not receive any NPU hardware devices on
Linux, even when the NPU is present and exposed by the kernel.

This change scans `/sys/class/accel` for `accelN` devices and creates
`OrtHardwareDevice` entries with:

- `type = OrtHardwareDeviceType_NPU`
- PCI `vendor_id`
- PCI `device_id`
- `accel_idx` metadata
- `pci_bus_id` metadata when available

This enables Linux systems with NPUs exposed through the accel
subsystem, such as AMD Ryzen AI / XDNA devices, to be reported through
ORT device discovery and made available to plugin EP factories.

## Changes

- Add Linux sysfs discovery for NPU devices under `/sys/class/accel`.
- Read NPU PCI vendor and device IDs from the underlying sysfs device
path.
- Add NPU metadata including `accel_idx` and `pci_bus_id`.
- Include discovered NPU devices in
`DeviceDiscovery::DiscoverDevicesForPlatform()`.
- Add a `kSysfsAccelPath` constant for the Linux accel sysfs path.

## Motivation

Linux plugin EPs that target NPUs rely on ORT passing
`OrtHardwareDeviceType_NPU` devices into `GetSupportedDevices()`.
Without Linux NPU discovery, those EPs cannot claim NPU devices and
provider selection policies such as `PREFER_NPU` silently fall back to
CPU.

Fixes microsoft#28660.
…RT version (microsoft#28794)

### Description
Adds new telemetry event for inference failure which logs ep versions
and types along with runtime error.
Adds logging of ORT version in other telemetry events.
Adds logging of ep versions in SessionCreation telemetry



### Motivation and Context
To better diagnose failures in inference

---------

Co-authored-by: Darshak Bhatti <dabhatti@micorsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
Fix STFT frame pointer arithmetic for complex-valued input so frame
starts are computed in input samples, not trailing real/imag components.
Since the frame view pointer is `U*`, one pointer increment advances one
full real or complex sample.

Also add validation that `frame_step` is positive and keep a defensive
bounds check before creating non-owning tensor views.

Review feedback addressed: simplified the frame pointer arithmetic,
fixed the swapped STFT input comments, documented the defensive bounds
check, and added double-complex regression coverage. The new STFT
validation/regression tests exclude `kDmlExecutionProvider` because
these CPU STFT validation/regression paths do not consistently match
DirectML behavior in Windows GPU CI.

### Motivation and Context
For complex input shaped `[batch_size, signal_length, 2]`, pointer
increments already advance by one real/imag pair. Multiplying frame
offsets by `signal_components == 2` again can advance past the valid
frame start, allowing later frames to read across batches or beyond the
input allocation.

### Testing
- `git diff --check -- onnxruntime/core/providers/cpu/signal/dft.cc
onnxruntime/test/providers/cpu/signal/signal_ops_test.cc`
- `.\.venv\Scripts\python.exe tools\ci_build\build.py --config
RelWithDebInfo --build --parallel --target onnxruntime_provider_test
--build_dir build\Windows`
- `.\onnxruntime_provider_test.exe
--gtest_filter="SignalOpsTest.STFTFloat:SignalOpsTest.STFTFrameStepMustBePositive:SignalOpsTest.STFTFloatComplexInputBatched:SignalOpsTest.STFTDoubleComplexInputBatched"`
from `build\Windows\RelWithDebInfo\RelWithDebInfo`

---------

Co-authored-by: Gopalakrishnan Nallasamy <gnallasamy@microsoft.com>
…icrosoft#28965)

### Description

When a QMoE model sets `weights_prepacked=0` (raw `[E, N, K/pack]` int
weights) and the session has `session.disable_prepacking`, `PrePack()`
never runs, so `packed_fc{1,2}_weights_` stay null and
`int_weights_consumed_by_prepack` is false. The code then falls through
to the raw initializer pointers — but those bytes are not in CUTLASS
layout, so the runner consumes them as-if-prepacked and produces
silently wrong output with no diagnostic.

Changes in `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc`
(`QMoE::ComputeInternal`):

- **Int path**: Added a defensive `INVALID_ARGUMENT` guard — when
`is_int && !weights_prepacked_` but either prepack buffer is null,
return a clear error instead of feeding non-CUTLASS bytes to the runner.
- **wfp4afp8 native path**: Same fall-through
(`packed_fp4_fc{1,2}_weights_ ? ... : raw`) replaced with an explicit
guard that errors when the repacked FP4 buffers were not produced.

Also added a focused regression test in
`onnxruntime/test/contrib_ops/moe_test.cc` covering `quant_type='int'`
with `weights_prepacked=0` and `session.disable_prepacking=1`, asserting
that QMoE fails with an actionable error instead of producing output.

Merged the branch with the latest `main`.


### Motivation and Context

A prior fix removed the null-pointer crash on this path but left a
misleading-success outcome that is newly user-reachable via the
`weights_prepacked=0` contract — the exact silent-failure mode the
offline-path work set out to eliminate. These guards convert that into a
loud, actionable error. The wfp4afp8 branch shares the same fall-through
and is hardened for consistency.

The added regression test ensures this fail-loudly behavior remains
covered going forward, especially when prepacking is disabled at the
session level.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
This pull request strengthens shape inference validation for several
custom BERT-related ONNX operators by adding explicit rank checks for
input tensors. These changes ensure that input tensors meet minimum rank
requirements, improving error messaging and preventing incorrect shape
propagation.

**Enhanced shape validation for custom ONNX operators:**

*RelativePositionBias and GatedRelativePositionBias:*
- Added checks to ensure `bias_table` (for `RelativePositionBias`) and
`token_offset` (for `GatedRelativePositionBias`) inputs have rank ≥ 2,
with clear error messages if not.
[[1]](diffhunk://#diff-8bf31275168b1e4a2aecd6760acf0ef92347134b003dfdc687c5d3cec4a178ecR1959-R1965)
[[2]](diffhunk://#diff-8bf31275168b1e4a2aecd6760acf0ef92347134b003dfdc687c5d3cec4a178ecR2228-R2230)

*CausalConvWithState:*
- Added checks to ensure both `input` and `weight` tensors have rank ≥
2, failing shape inference with descriptive errors if violated.

*LinearAttention:*
- Added checks to ensure `query` and `value` tensors have rank ≥ 3 for
both output and state shape inference, with early returns or errors if
requirements are not met.
[[1]](diffhunk://#diff-8bf31275168b1e4a2aecd6760acf0ef92347134b003dfdc687c5d3cec4a178ecR2460-R2465)
[[2]](diffhunk://#diff-8bf31275168b1e4a2aecd6760acf0ef92347134b003dfdc687c5d3cec4a178ecR2483-R2486)

*SkipLayerNormalization:*
- Added a check to ensure the `input` tensor has rank ≥ 1, improving
error reporting for invalid input shapes.
…soft#28980)

### Description

Optimizes the CUDA QMoE router top-k (`LaunchSoftmaxTopK`) for
small-batch / autoregressive decode by replacing the old
one-thread-per-row hot path with parallel CUB and warp-level top-k
kernels. The dispatch now uses the fastest specialized path for common
MoE expert counts while preserving the existing softmax normalization
and deterministic lower-index tie-breaking semantics.

This PR also factors the warp-level top-k sorting code into a reusable
CUDA helper header and adds direct CUDA-internal tests so the new
routing paths are covered independently of higher-level QMoE tests.

### Motivation and Context

The previous router path launched a 256-thread block per row but did all
top-k work in a single thread. In decode scenarios such as `num_rows ==
1`, that made the router latency-bound on a serial scan of all expert
logits and turned `SoftmaxTopKKernel` into a major MoE decode
bottleneck.

For a Qwen3-style MoE workload with 256 experts, top-8 routing, and 40
MoE layers, the original router accounted for roughly 50% of decode GPU
time. Moving the work to block/warp-parallel kernels removes that
bottleneck while keeping the same output ordering and scaling behavior.

### Key Changes

| Area | Change |
|---|---|
| QMoE router dispatch | Adds `DispatchSoftmaxTopK` routing for `k <=
64` and `num_experts <= 1024`, with a fallback to the original scalar
kernel for larger or uncommon shapes. |
| Tiny expert counts | Adds `SoftmaxTopKWarpBitonicKernel` for
`num_experts <= 32`, using one warp per row and in-register bitonic
sorting via warp shuffles. |
| Small expert counts | Adds `SoftmaxTopKWarpMergeKernel` for `32 <
num_experts <= 64`, using a single warp and CUB warp merge sort. |
| Larger common MoE counts | Uses `SoftmaxTopKMergeKernel` with CUB
block merge sort for `num_experts <= 128`, `256`, `512`, and `1024`. |
| Reusable top-k helpers | Adds
`onnxruntime/core/providers/cuda/cu_inc/topk_warp_sort.cuh` with
reusable warp bitonic and warp merge sort helpers. |
| Stable tie-breaking | Packs `(score, index)` into a `uint64_t` stable
sort key for the CUB merge paths, matching onnxruntime-genai's
lower-index tie-breaking and avoiding compound comparators. |
| Softmax cleanup | Factors shared softmax scale, safe reciprocal, top-k
normalization, warp reduction, and CUB block reduction helpers to keep
the optimized kernels consistent. |
| Tests | Adds CUDA-internal `SoftmaxTopK_*` tests covering warp
bitonic, warp merge, block merge, stable ties, normalization, `float`,
`half`, and `bfloat16`. |

### Performance

H200 measurements for the target QMoE decode scenario showed the router
cost dropping from roughly `5.56 ms/token` to `0.17 ms/token`, improving
end-to-end Qwen3.6-35B-A3B INT4 decode throughput from about `80 tok/s`
to `113 tok/s`.

Additional profiling of the `32 < num_experts <= 64` warp merge path
showed the packed `uint64_t` stable sort key is consistently faster than
a `{float, int}` struct comparator on H200:

| Experts | Sort-only packed/struct | Full softmax+top-k packed/struct |
|---:|---:|---:|
| 33 | 0.680x | 0.704x |
| 48 | 0.672x | 0.695x |
| 64 | 0.673x | 0.696x |

### Testing

- `lintrunner -a`
- `ninja onnxruntime_providers_cuda_ut`
- `ninja onnxruntime_provider_test`
- `GTEST_FILTER='CUDA_EP_Unittest.SoftmaxTopK_*'
./onnxruntime_provider_test --gtest_filter='CUDA_EP_Unittest.All'`
- `onnxruntime/test/python/transformers/test_qmoe_cuda.py -k parity`
(`44 passed`)
…ft#29022)

Bumps [shell-quote](https://github.com/ljharb/shell-quote) from 1.8.3 to
1.8.4.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/ljharb/shell-quote/blob/main/CHANGELOG.md">shell-quote's
changelog</a>.</em></p>
<blockquote>
<h2><a
href="https://github.com/ljharb/shell-quote/compare/v1.8.3...v1.8.4">v1.8.4</a>
- 2026-05-22</h2>
<h3>Commits</h3>
<ul>
<li>[Fix] <code>quote</code>: validate object-token shapes <a
href="https://github.com/ljharb/shell-quote/commit/4378a6e613db5948168684864e49b42b83134d2d"><code>4378a6e</code></a></li>
<li>[Dev Deps] update <code>@ljharb/eslint-config</code>,
<code>auto-changelog</code>, <code>eslint</code>, <code>npmignore</code>
<a
href="https://github.com/ljharb/shell-quote/commit/22ebec04349065a45ad8afc8cc8d53c4624634a6"><code>22ebec0</code></a></li>
<li>[Tests] increase coverage <a
href="https://github.com/ljharb/shell-quote/commit/9f3caa31900cc6ee64858b31134144c648ce206d"><code>9f3caa3</code></a></li>
<li>[readme] replace runkit CI badge with shields.io check-runs badge <a
href="https://github.com/ljharb/shell-quote/commit/3344a047dd1e95f71c4ca27522cbfd05c56277e0"><code>3344a04</code></a></li>
<li>[Dev Deps] update <code>@ljharb/eslint-config</code> <a
href="https://github.com/ljharb/shell-quote/commit/699c5113d135f4d4591574bebf173334ffa453d4"><code>699c511</code></a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/ljharb/shell-quote/commit/ff166e2b63eb5f932bd131a8886a99e9afdf45ae"><code>ff166e2</code></a>
v1.8.4</li>
<li><a
href="https://github.com/ljharb/shell-quote/commit/4378a6e613db5948168684864e49b42b83134d2d"><code>4378a6e</code></a>
[Fix] <code>quote</code>: validate object-token shapes</li>
<li><a
href="https://github.com/ljharb/shell-quote/commit/22ebec04349065a45ad8afc8cc8d53c4624634a6"><code>22ebec0</code></a>
[Dev Deps] update <code>@ljharb/eslint-config</code>,
<code>auto-changelog</code>, <code>eslint</code>, `npmig...</li>
<li><a
href="https://github.com/ljharb/shell-quote/commit/9f3caa31900cc6ee64858b31134144c648ce206d"><code>9f3caa3</code></a>
[Tests] increase coverage</li>
<li><a
href="https://github.com/ljharb/shell-quote/commit/3344a047dd1e95f71c4ca27522cbfd05c56277e0"><code>3344a04</code></a>
[readme] replace runkit CI badge with shields.io check-runs badge</li>
<li><a
href="https://github.com/ljharb/shell-quote/commit/699c5113d135f4d4591574bebf173334ffa453d4"><code>699c511</code></a>
[Dev Deps] update <code>@ljharb/eslint-config</code></li>
<li>See full diff in <a
href="https://github.com/ljharb/shell-quote/compare/v1.8.3...v1.8.4">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=shell-quote&package-manager=npm_and_yarn&previous-version=1.8.3&new-version=1.8.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…28982)

Closes part of microsoft#27729.

The CUDA EP registered Softplus and Softsign only for opset 1 (HFD).
Opset 22 extended the type constraint to include BFloat16, but the CUDA
EP was never updated. The BFloat16 compute kernels were already
instantiated via `UNARY_ACTIVATION_OP_HFD_WITH_BF16`; only the EP
registration was missing.

Changes:
- Cap opset-1 kernels as versioned [1, 21] (Softplus, Softsign)
- Register opset-22 kernels with BFloat16 support (HFDX)
- Add opset-22 tests: float (CPU+CUDA) and BFloat16 (CUDA, sm≥530)
- Update `docs/OperatorKernels.md`
…soft#28966)

This pull request expands support for additional ONNX opset versions in
the attention fusion optimization code, making the optimizer compatible
with newer and more diverse ONNX models. The changes primarily update
the accepted opset versions for various operators such as `Transpose`,
`Reshape`, `Squeeze`, `Unsqueeze`, `Shape`, and others across multiple
functions. This ensures broader model compatibility and improves the
robustness of the fusion logic.

**Expanded opset version support for attention fusion:**

* Updated accepted opset versions for key operators (`Transpose`,
`Reshape`, `Squeeze`, `Unsqueeze`, `Shape`, `Add`, `Mul`, `Sub`, `Div`,
`Cast`, etc.) in the main attention fusion logic
(`attention_fusion.cc`), allowing matching and fusion of newer ONNX
models using these operators at opsets up to 25.
[[1]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L352-R367)
[[2]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L382-R384)
[[3]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L394-R395)
[[4]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L405-R405)
[[5]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L463-R471)
[[6]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L500-R500)
[[7]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L514-R514)
[[8]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L923-R927)
[[9]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L956-R958)
[[10]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L1073-R1074)
[[11]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L1166-R1166)
[[12]](diffhunk://#diff-2d859229c1824649bd6a37eaefa52306394bc6c3aa341d6deff1d4f2fb9902f3L1268-R1275)

**Helper and mask subgraph matching improvements:**

* Broadened opset version checks for subgraph matching in helper
functions, including those for Gemm subgraphs, unidirectional mask
subgraphs, input mask subgraphs, and past subgraph matching, to support
additional opset versions and operator variants.
[[1]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L77-R84)
[[2]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L169-R171)
[[3]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L378-R379)
[[4]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L395-R402)
[[5]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L457-R458)
[[6]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L485-R487)
[[7]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L635-R637)
[[8]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L769-R769)
[[9]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L794-R796)
[[10]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L812-R814)
[[11]](diffhunk://#diff-97696a1ea660259af1c02da793abf7a807de115421a0ec32f1e36f39371e4e16L890-R890)

These changes collectively future-proof the attention fusion optimizer
for a wider range of ONNX models and operator versions, reducing the
likelihood of unsupported patterns during optimization.
### Description
<!-- Describe your changes. -->

Add documentation for `OrtErrorCode` enum and its values.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Provide some documentation about what the error codes mean.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This pull request introduces a new "name-based layer assignment" feature
for ONNX Runtime, allowing device assignment of model nodes based on
substring matching of node names, as an alternative to annotation-based
matching. The implementation ensures that name-based and
annotation-based assignment modes are mutually exclusive, and updates
documentation, configuration keys, and core logic to support this new
capability.

**Key changes:**

#### Name-based Layer Assignment Feature

- Added support for a new session option,
`session.name_based_layer_assignment`, which enables device assignment
using substring matching against node names (rather than metadata
annotations). The longest matching pattern wins, and all patterns are
treated as substrings (the `=` exact-match qualifier is disallowed in
this mode).
[[1]](diffhunk://#diff-10b3051b9e36eccfc7ca0f2d44ce78a9980ca573cde0f931ffd1456da2c681daR181-R218)
[[2]](diffhunk://#diff-62d211e77c575a2fec6492c9fbfe25743fac9d6d72be7c007e7f6eb8dbecc7e7R423-R437)

- Implemented the `SubstringMatcher` class and integrated it into the
`LayeringIndex` logic, enabling efficient substring-based rule matching
for node assignment. The matcher sorts patterns by length and returns
the first (longest) match.
[[1]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5R604-R627)
[[2]](diffhunk://#diff-b64b395fbb0afa67dcf493f97493f3df353486c3c8344df89f25df057ca3840fR88-R116)

#### Mutual Exclusivity and API Changes

- Enforced mutual exclusivity between annotation-based
(`session.layer_assignment_settings`) and name-based
(`session.name_based_layer_assignment`) assignment options. Attempting
to set both now results in an error, with clear messaging.
[[1]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5L338-L359)
[[2]](diffhunk://#diff-b64b395fbb0afa67dcf493f97493f3df353486c3c8344df89f25df057ca3840fR159-R177)

- Updated the `LayeringIndex::Create` API and related logic to accept
both configuration strings, select the active mode, and construct the
appropriate matcher.
[[1]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5L338-L359)
[[2]](diffhunk://#diff-b64b395fbb0afa67dcf493f97493f3df353486c3c8344df89f25df057ca3840fR186)

#### Documentation Updates

- Expanded the documentation to describe the new name-based assignment
feature, provide usage examples, highlight best practices for pattern
writing, and explain the mutual exclusivity and lack of subgraph
inheritance in name-based mode.
[[1]](diffhunk://#diff-10b3051b9e36eccfc7ca0f2d44ce78a9980ca573cde0f931ffd1456da2c681daR181-R218)
[[2]](diffhunk://#diff-10b3051b9e36eccfc7ca0f2d44ce78a9980ca573cde0f931ffd1456da2c681daL295-R357)

#### Core Logic and Maintenance

- Refactored the `LayeringIndex` and related methods to support both
matching modes, updating node assignment and update logic to branch
appropriately based on the selected mode.
[[1]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5L426-R485)
[[2]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5R537-R548)

- Added and documented the new configuration key
`kOrtSessionOptionsNameBasedLayerAssignment` in the public API headers.

These changes provide a more flexible and user-friendly way to partition
models across devices, especially for models with structured node names
but lacking explicit annotations.
… convolutions (microsoft#28565)

### Description
A path for MLAS to support NHWC Convolutions without the need for
transposes was added in PR:
microsoft#26834
This PR expands those changes to also support Depthwise Convolutions via
the same pathway



### What changed:

- The shared NHWC capability gate in
onnxruntime/core/mlas/lib/convolve.cpp:1348 stopped requiring GroupCount
== 1. It now allows GroupCount > 1 only when the op is true depthwise,
meaning filters_per_group ==
    1.
- The NHWC transformer in
onnxruntime/core/optimizer/nhwc_transformer.cc:162 was updated to pass
the real group value and compute filter_count per group instead of
hard-coding group 1. That is what lets grouped
depthwise Conv/FusedConv nodes get rewritten to
com.microsoft.NhwcFusedConv.
- The KleidiAI execution path in
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp:553 learned how
to handle grouped NHWC tensors by:
- gathering one group’s channels out of interleaved NHWC input into a
temporary contiguous buffer,
      - running the existing per-group kernel,
- scattering that group’s output channels back into interleaved NHWC
output.
- Tests were added for a working NHWC depthwise case in
onnxruntime/test/contrib_ops/fused_conv_test.cc:466, and transformer
tests were updated to verify both the new positive case and the expected
skip cases in
    onnxruntime/test/optimizer/nhwc_transformer_test.cc:416.

Added performance benchmark tests to allow for comparison between the
new NHWC path and the old NCHW default.
Sample output:
```
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                                                     Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:1/Cpg:64/Fpg:64/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time                508509 ns       508507 ns         1374
SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:1/Cpg:128/Fpg:128/I:28/28/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time              700573 ns       700386 ns          997
SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:64/Cpg:1/Fpg:1/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time                6471094 ns      6471114 ns          132
SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:72/Cpg:1/Fpg:1/I:48/80/K:3/3/P:1/1/1/1/S:2/2/D:1/1/real_time                3768969 ns      3767797 ns          217
SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:1/Cpg:64/Fpg:64/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time       414198 ns       414197 ns         1688
SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:1/Cpg:128/Fpg:128/I:28/28/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time     652454 ns       652454 ns         1074
SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:64/Cpg:1/Fpg:1/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time       6032947 ns      6032940 ns          117
SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:72/Cpg:1/Fpg:1/I:48/80/K:3/3/P:1/1/1/1/S:2/2/D:1/1/real_time       3022041 ns      3018352 ns          227
```

---------

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
…lanned device (microsoft#29013)

### Description

`SaveInitializedTensors` only used a user-supplied initializer
`OrtValue` (from
`AddExternalInitializers`) in place when the initializer was planned on
**CPU**. For any non-CPU
(e.g. CUDA) initializer it **always allocated a fresh device tensor and
copied into it**, even when
the supplied `OrtValue` already lived on the planned device.

This change uses the supplied `OrtValue` directly when its tensor's
device matches the planned
device, mirroring the existing CPU case and the `AddInitializer`
(`initializers_to_share_map`)
no-copy path:

```cpp
const auto& graph_value_device = ort_value_from_graph.Get<Tensor>().Location().device;
if (memory_info.device == default_cpu_device || graph_value_device == memory_info.device) {
  // Planned on CPU, or the supplied initializer already lives on the planned device:
  // use it in place (no per-session allocation/copy; enables cross-session sharing).
  ort_value = std::move(ort_value_from_graph);
} else {
  // existing allocate-on-device + CopyTensorFromCPUToDevice fallback (true cross-device case)
}
```

### Motivation

Two benefits:

1. **Avoids a redundant per-session device allocation + device copy**
for every externally-supplied
   initializer that is already on the target device.
2. **Enables cross-session device-memory sharing.** Supplying the *same*
device `OrtValue` to
multiple sessions (e.g. a large token embedding + `lm_head` shared
between a main decoder and an
auxiliary speculative-decoding / multi-token-prediction head) now keeps
a single device buffer
instead of one copy per session. For a large-vocab model this saves ~2
GB of VRAM.

This brings `AddExternalInitializers` in line with `AddInitializer`,
which already uses the supplied
`OrtValue` in place when its device matches the planned device.

Fixes microsoft#29009.

### Behavior / compatibility

- The CPU path is unchanged (`memory_info.device == default_cpu_device`
still short-circuits first).
- The true cross-device case (supplied tensor on a different device than
planned) still falls back to
allocate + `CopyTensorFromCPUToDevice`, so existing behavior is
preserved there.
- No public API change.

### Testing

- Existing `TestExternalInitializersInjection` (CPU) continues to pass
(CPU path untouched).
- Validated end-to-end with ONNX Runtime GenAI on CUDA: sharing a fp16
embedding + `lm_head`
(1017 MB each) between two sessions that load separate graphs drops the
second model's device
footprint by ~2145 MB (≈2 GB), with identical inference output vs the
non-shared baseline.

> Note: a device-level unit test that asserts the shared buffer is
reused (no copy) needs internal
> session-state access plus a GPU EP harness; happy to add one under
`test/providers/cuda` if
> reviewers prefer.
### Description

ADO pipeline `Nodejs_Packaging` stage now uses CFS.


### Motivation and Context

It gets network isolated otherwise, and fails.
This pull request strengthens the validation logic for the
`block_row_indices` input in the SparseAttention operator and adds a
corresponding unit test to ensure that zero-dimension cases are properly
rejected.

Validation improvements:

* Updated the check in `Status CheckInputs` (in
`sparse_attention_helper.h`) to require that the first dimension of
`block_row_indices` is greater than zero, preventing invalid
zero-dimension input.

Test coverage:

* Added a new unit test `RejectsZeroDimBlockRowIndices` in
`sparse_attention_op_test.cc` to verify that the operator correctly
rejects inputs where `block_row_indices` has a zero in its first
dimension.
… callbacks (microsoft#28824)

## Description

Lowers the minimum supported ONNX Runtime runtime version for the
standalone CUDA plugin EP from **1.26.0** to **1.24.4**, so the plugin
binary (built against the latest ORT headers) can be loaded by older ORT
runtimes. The plugin negotiates the API version at load time and only
advertises EP callbacks the negotiated runtime actually supports, so
newer features degrade gracefully on older runtimes instead of crashing.

## Motivation

The plugin is shipped as a separate package and is intended to run
against a range of base `onnxruntime` runtimes. The previous hard floor
of 1.26.0 was stricter than necessary: an audit of the `\since`
annotations shows the plugin only calls APIs introduced in 1.24 or
earlier (apart from the optional EP profiler, which is now
version-gated). 1.24.4 is also the floor already used by the WebGPU
plugin EP, so this aligns the two.

## Key Changes

| Area | Change |
|---|---|
| `plugin-ep-cuda/MIN_ONNXRUNTIME_VERSION` | `1.26.0` → `1.24.4` (single
source of truth for the floor) |
| `cmake/onnxruntime_providers_cuda_plugin.cmake` | Reads
`MIN_ONNXRUNTIME_VERSION` and bakes it into the DLL as the
`ORT_PLUGIN_EP_MIN_ORT_VERSION` compile definition |
| `cuda_plugin_ep.cc` | `CreateEpFactories()` negotiates the runtime API
version via `onnxruntime::ep::ApiInit(...)` instead of hard-coding
`GetApi(26)` |
| `cuda_plugin_utils.h` | Adds `CudaPluginEpOrtVersionSupported() =
min(CurrentOrtApiVersion(), ORT_API_VERSION)`; removes the hard-coded
min-version constant |
| 13 callback structs | Report `ort_version_supported`/`version` =
`CudaPluginEpOrtVersionSupported()` |
| `cuda_ep.cc` | **Defensive capability gating**: installs each newer
`OrtEp` callback only when the negotiated runtime is new enough —
`Sync`/`CreateProfiler` require ≥1.25, graph-capture set +
`GetAvailableResource` require ≥1.26; otherwise left null |
| `plugin-linux-cuda-test-stage.yml` | Adds a CI step that installs the
floor (`MIN_ONNXRUNTIME_VERSION`) base `onnxruntime` and runs the plugin
test against it, catching any accidental dependency on a newer API |
| Docs | New §2.6 "API Version Audit and Defensive Capability Gating" in
the design doc; QUICK_START min-version test recipe |

## API Version Audit

| API surface | Newest `\since` used |
|---|---|
| `OrtApi` direct calls | 1.23 |
| `OrtEpApi` direct calls | 1.24 |
| EP profiler API (only with `ENABLE_CUDA_PROFILING`) | 1.25 |

Apart from the optional EP profiler, every API the plugin calls is
`\since 1.24` or older, justifying the 1.24.4 floor. The profiler's
three `\since 1.25` functions are made unreachable on older runtimes by
gating the `CreateProfiler` callback.

## Testing Notes

- Incremental build on CUDA 12.8 / SM90 — clean, plugin `.so` relinked.
- `test_cuda_plugin_ep.py` against the latest runtime (1.28): **87/87
tests pass**.
- Plugin (built against latest headers) loaded into
`onnxruntime==1.24.4`: registers, enumerates all 8 GPUs, and runs
inference correctly with the newer callbacks left null.
- `lintrunner` clean on changed files.
- New CI step validates the plugin against the declared floor
automatically.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.