Skip to content

docs: comprehensive documentation audit and improvements#828

Open
DingmaomaoBJTU wants to merge 145 commits into
mainfrom
docs/draft
Open

docs: comprehensive documentation audit and improvements#828
DingmaomaoBJTU wants to merge 145 commits into
mainfrom
docs/draft

Conversation

@DingmaomaoBJTU

Copy link
Copy Markdown
Collaborator

Summary

Full documentation audit of the docs/ site with factual corrections, new pages, and structural improvements.

New Pages

  • getting-started/agent-skill.md — Copilot Coding Agent skill integration
  • tutorials/build-from-onnx.md — Bring Your Own ONNX Model tutorial
  • samples/clip-composite.md — CLIP composite model sample
  • reference/python-api.md — Python API reference
  • reference/output-layout.md — Build output directory structure
  • reference/supported-models.md — Supported models and EP compatibility table
  • contributing.md — Simplified, references repo CONTRIBUTING.md
  • troubleshooting.md — Restructured by component (Compile / Analyze / Build-Cache)

Factual Corrections

  • Compile validation: corrected from "random inputs + numerical comparison" to all-ones dummy inputs + NaN/Inf check
  • Removed --debug from global flags (hidden flag)
  • Fixed -p/--precision scope (only config and quantize)
  • Added ONNX file input entry to pipeline diagram
  • Fixed EP table: added CUDA/NvTensorRTRTX/MIGraphX, corrected QNN note to "bundled in ORT"

Structural Improvements

  • Compiler backends: all compile pages now distinguish --compiler ort (default) vs --compiler qairt
  • auto field documentation in build config schema
  • CI/CD reproducibility tips for winml_build_config.json
  • EP alias column and auto/all special values
  • Quickstart reordered: inspect → export
  • How-it-works: replaced mermaid with SVG, added Analyze section
  • Mike version control plugin + CI workflow for multi-version docs
  • Nav reorder: Datatype and Quantization before EP and Device

Preview

Build with uv run mkdocs serve to preview locally.

tezheng and others added 30 commits May 27, 2026 00:08
Adds a complete MkDocs Material documentation site for the winml-cli
project, served from /docs and built locally and via GitHub Actions
(manual dispatch).

Site infrastructure:
- mkdocs.yml with Material theme, mermaid superfences, tabbed code,
  light/dark palette toggle
- pyproject.toml dev deps: mkdocs-material, mkdocs-jupyter,
  pymdown-extensions
- .github/workflows/docs.yml (workflow_dispatch only)
- .gitignore exception for docs/superpowers/specs/

User-facing chapters:

Home — tagline + Goals/Promises bullets sourced from the MVP
transcript; describes the toolkit's three workflows (primitives,
pipeline, one-command) plus the EP × Device coverage promise

Getting Started (3 pages):
- Installation — Win 11 24H2 + Copilot+PC + Python 3.10 + uv + git
  prereqs table; 'No NPU?' callout pointing at --device auto with
  the winml eval caveat
- Quickstart — 5-minute export + inspect with
  'winml sys --list-device --list-ep' verify step
- End-to-End Tour — universal --device auto walkthrough that works
  on Copilot+ PC NPU, DirectML GPU, or CPU; tabbed example outputs
  for sys and perf so each reader sees their own machine

Concepts (12 pages in two sub-groups):
- Fundamentals (5): How winml-cli works, Graph and IR, Weight and
  Activation, EP and Device (with the full 7-EP × Device matrix),
  Datatype and Quantization (8-precision family from _KNOWN_PRECISIONS
  with w4a16 marked 'Planned — not yet supported')
- WinML CLI (7 workflow-concept pages): Primitives and pipeline,
  Load and export, Analyze and optimize, Compile and EPContext,
  Perf and monitoring, Eval and datasets, Config and build (with
  the full WinMLBuildConfig schema inline)

Commands (13 pages):
- Overview with the four user-intent groups (Discover / Configure /
  Build / Measure)
- Per-command reference for: sys, inspect, hub, analyze, config,
  optimize, export, quantize, compile, build, perf, eval

Samples (3 pages):
- ConvNeXt — Primitives Walkthrough (CPU/GPU/NPU device comparison)
- BERT — Config + Build + Perf (workflow demonstration)
- Qwen3 — Composite Models (placeholder for the in-progress feature)

Tutorials (2 pages):
- Overview
- ConvNeXt on NPU — 2200-word linear walkthrough with both QNN and
  OpenVINO compile paths shown via tabbed blocks, plus the
  'winml build' one-shot variant

P2 stubs preserved in nav: Reference, Troubleshooting, Contributing

Source-grounding:
- Every flag mentioned in user-facing docs is verified against
  src/winml/modelkit/
- Non-functional flags (--torch-module, --dynamo on export;
  --no-quant on compile) are explicitly marked
- All URLs target the canonical microsoft/winml-cli destination
- mkdocs build --strict passes with zero warnings

Internal artifacts kept under docs/superpowers/ for reference:
- Spec and plan files for the v1 and v2 design iterations
- 2026-05-26-v3-known-issues.md — fact-checked review findings

Existing internal docs (docs/design/, docs/naming-convention.md,
docs/pytest-best-practices.md) are unchanged and excluded from the
user-facing nav via exclude_docs in mkdocs.yml.
…he site

Adds a contributor-facing README at docs/README.md covering:
- uv-based dev setup
- mkdocs serve / build --strict workflow
- gh-deploy publish (local one-shot)
- .github/workflows/docs.yml CI workflow (currently workflow_dispatch only)
- Authoring conventions (winml-cli name, flag verification, admonitions,
  tabbed code blocks)
- Excluded paths reference

Updates mkdocs.yml exclude_docs to include /README.md so the new file
doesn't collide with docs/index.md as the chapter index.
…source

Six parallel review agents fact-checked all 34 user-facing doc files
against microsoft/winml-cli @ 5e25579. Output: one issue file per
source doc at docs/superpowers/2026-05-27-doc-issues/.

A validator agent then cross-checked every Critical and Important
claim and produced the consolidated, false-positive-filtered list at
docs/superpowers/2026-05-27-validated-issues.md.

Summary: 25 Critical + 22 Important kept; 6 rejected as false
positives. Major theme: docs were authored against feat/mvp source
where some symbols and defaults differ from main (e.g., _KNOWN_PRECISIONS
in _options.py vs _NAMED_PRECISIONS in precision.py; winml hub vs
winml catalog; many flag defaults flipped to 'auto'; DML/CPU no
longer produce _ctx.onnx artifacts).

Next step: per-file fix agents will apply the validated list.
…eview

5 parallel fix agents applied the validated-issues list. Net: 25 Critical
+ 22 Important defects resolved across 20 doc files + mkdocs.yml.

Major fixes by area:

Concepts (4 pages):
- quantization.md: NPU auto-precision corrected to w8a16 (was int8);
  w4a16 description corrected (rejected at validation, not 'recognized
  but raises at quantization'); _KNOWN_PRECISIONS/_options.py references
  replaced with the actual _NAMED_PRECISIONS/precision.py
- compile-and-epcontext.md: removed non-existent --no-quant flag mention
- config-and-build.md: JSON 'compile' section flattened to use
  execution_provider (not nested ep_config.provider); table expanded to
  the actual 7 sub-configs (added eval, auto)
- perf-and-monitoring.md: --device documented as accepting auto;
  output path corrected to ~/.cache/winml/perf/<slug>/<timestamp>.json;
  --monitor not NPU-specific; --op-tracing marked hidden

Commands (11 pages):
- overview.md: winml hub renamed to winml catalog throughout;
  _options.py reference replaced with cli.py
- hub.md: H1 and all invocations changed to 'winml catalog'; removed
  non-existent --model/-m flag; rewrote 'How it works' (no per-EP latency
  / accuracy-verdict columns exist); added --ep/--device filter flags
- build.md: --config marked optional (was required); --random-init and
  --qnn-sdk-root removed (don't exist); --no-compile/--compile toggle
  pair documented; --trust-remote-code added; --max-optim-iterations
  default corrected to None
- compile.md: --device default corrected to auto; --no-quant flag
  removed (doesn't exist on compile)
- config.md: --no-compile/--compile framing corrected (compile is
  EXCLUDED by default; users need --compile to include)
- eval.md: --device includes auto (default auto, not cpu); -n short
  alias removed; class reference replaced with actual evaluate function
- analyze.md: --device default corrected to auto; --ep default to
  auto; --run-unknown-op default to False; -m/-v/-q/-c flags added
- optimize.md: --preset/-p flag and entire Built-in presets table
  removed (flag doesn't exist); --verbose added; 'Configuration
  precedence' reduced from 4 levels to 3
- inspect.md: --list-tasks, --model-type, --model-class, --verbose
  flags added
- perf.md: --compare-devices removed (not registered at all); output
  path corrected; --op-tracing marked hidden
- sys.md: --verbose/-v added to flag table

Samples / Tutorials / Getting Started (5 pages):
- installation.md: Python 3.10 corrected to 3.11; 'No NPU?' callout
  no longer claims winml eval rejects auto (it accepts auto on main)
- end-to-end.md: dropped incorrect _ctx.onnx CPU/DML artifacts;
  QNNExecutionProvider mapped to NPU/GPU (not just NPU)
- convnext-primitives.md: CPU/GPU compile clarified (no _ctx.onnx
  produced; uses convnext_int8.onnx directly); winml eval auto reverted
- bert-config-build.md: build final artifact corrected to model.onnx
  (was bert-base-uncased_ctx.onnx)
- npu-convnext.md: Python 3.10 -> 3.11; OpenVINO artifact filename
  corrected to use device string (_npu_ctx.onnx not _openvino_ctx.onnx);
  CPU compile tab dropped (CPU doesn't produce _ctx.onnx)

mkdocs.yml: nav label 'hub' renamed to 'catalog' to match the actual
command name on microsoft/winml-cli main.
…meration)

The opening paragraph re-stated the project tagline (already on the
home page one click above) and enumerated 4 EPs (QNN, OpenVINO, DML,
ONNX Runtime) — which goes stale; the canonical list in
concepts/eps-and-devices.md has 7. Removing the paragraph; the page
now starts with the Prereqs table. Matches the convention used by
quickstart.md and end-to-end.md (neither re-states the tagline).
## Summary

- Rewrote `docs/concepts/analyze-and-optimize.md` with source-verified
content: SupportLevel classification table, lint vs autoconf outputs,
analysis modes, optimizer pipe architecture (4 pipes, 43 capabilities, 5
rewrite groups / 12 rules), and autoconf loop SVG diagram
- Updated `docs/commands/analyze.md` with corrected EP aliases,
exit-code table, and additional CLI examples
- Renamed `hub.md` → `catalog.md` and updated all cross-references
(inspect, overview, sys, mkdocs.yml)
- Fixed `check-yaml` pre-commit hook to support `!!python/name` tags in
mkdocs.yml (`--unsafe`)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Zhipeng Wang <zhiwang@microsoft.com>
Co-authored-by: Qiong Wu (qiowu) <qiowu@microsoft.com>
Co-authored-by: hualxie <hualxie@microsoft.com>
Co-authored-by: Charles Zhang <zhangchao@microsoft.com>
Co-authored-by: Zhenchao Ni <zhenni@microsoft.com>
## Summary

- Drop the `WindowsAppRuntimeVersion` class, attribute, property, and
`windowsAppRuntimeVersion` field in `SysInfo.to_dict()` from
`src/winml/modelkit/sysinfo/sysinfo.py`.
- Remove the now-unused `import re`.

Nothing else in the codebase referenced these symbols. Integration
`runtime_checker` fixtures still contain the field inside their stored
`sys_info` blob, but the test helper ignores `sys_info` during
comparison, and the field will disappear naturally next time those
fixtures are regenerated.
…763)

## Summary
- **VitisAI EP ordering**: Move `VitisAIExecutionProvider` to end of
`EP_SUPPORTED_DEVICES` so it appears last in `analyze --ep all` output,
since it is not yet fully supported.
- **Catalog table width**: Set `expand=False` on both `Table` and
`Panel` in `_build_list_renderable` so the catalog table fits its
content width instead of stretching to the full terminal width.
…tection (#779)

Also update scripts/e2e_eval/run_pytorch_baseline.py to include pytorch
model latency

---------

Co-authored-by: hualxie <hualxie@microsoft.com>
## Summary

- Reorganized README into 5 sections: Title + Description, Features /
Scope, Getting Started, Commands, Contributing + License
- Updated status badge to `preview`, rewrote description and Features (✅
bullets)
- Scope section: added supported EPs, built-in model catalog reference,
accepted inputs; removed verbose LLM/not-supported block
- Getting Started: consolidated Prerequisites + Installation + Quick
Start; added Config-Build Pipeline and Step-by-step through primitive
commands walkthroughs
- Commands: BYOM workflow with pipeline diagram, command table +
collapsible details, comparison table (Config-Driven first)
- Reference tables at end: Supported Hardware, Supported Tasks,
Supported Model Types, Built-in Models

---------

Co-authored-by: Qiong Wu (qiowu) <qiowu@microsoft.com>
Co-authored-by: Zhipeng Wang <zhiwang@microsoft.com>
## Summary

- Removed the duplicated `WinML CLI (Python wheel) | [Releases]` row in
the Prerequisites table.
- Updated the install step from `uv pip install
winml_cli-<version>-py3-none-any.whl` to `pip install winml-cli`.
- Updated the Prerequisites entry to point at PyPI instead of GitHub
Releases, keeping the table and install instructions consistent.
## Summary

- Adds `resolve_check_device_ep` helper that validates a (device, EP)
combination without requiring the device/EP to actually exist on the
system. Closes #765.
- `commands/config.py` and `config/build.py` now use
`resolve_check_device_ep` instead of `resolve_device` so `winml config`
no longer hard-fails on hosts where the requested EP isn't installed.
- When `device=auto` or `ep=None`, the helper delegates to the existing
`resolve_device` + `resolve_eps` flow (system-aware behavior preserved).
When both `device` and `ep` are explicit, it only validates against the
static `EP_SUPPORTED_DEVICES` mapping.
- CLI cleanup: `-m/--model`, `-c/--config`, `--device` for the config
command now use the shared `cli_utils.*_option` decorators.

## Tests

- New `TestResolveCheckDeviceEp` class in
`tests/unit/sysinfo/test_device.py` covering both code paths (delegation
and static-only) plus error cases (unknown EP, unsupported device,
case-insensitivity).
- Existing config-test mocks updated from `resolve_device` to
`resolve_check_device_ep` (`tests/unit/config/conftest.py`,
`tests/unit/config/test_build.py`,
`tests/unit/config/test_build_onnx.py`,
`tests/unit/commands/test_config_cli.py`) so the lazy import in
`config/build.py` is intercepted.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: hualxie <hualxie@microsoft.com>
Co-authored-by: hualxie <hualxie@microsoft.com>
…gs) (#785)

Adds curated recipe configs for the 12 builtin models — those that pass
fp16 eval on all 9 (EP, device) buckets.
## Summary

Fixes `scripts/e2e_eval/run_eval.py` crashing on VitisAI EP (AMD Ryzen
AI NPU) and a latent bug in `winml build` that prevented the script's
`--no-quant` workaround from actually taking effect.

The crash: VitisAI ships its own internal quantizer and runs it at
session-create time. Layering winml's generic QDQ quantization pass on
top produces a model VitisAI cannot consume, which manifests as
`DpuKernelRunner.cpp:1920 DPU timeout` during `winml perf`. The fix is
to tell winml to skip its own quantization when the selected EP
quantizes natively.

## Changes

### `src/winml/modelkit/commands/build.py` — root-cause fix (1 line)

When `--device` was passed to `winml build`, the internal
`_patch_device` helper unconditionally re-populated `cfg.quant` with the
device's default quantization config, silently undoing any prior
`--no-quant`. The condition now respects `no_quant`:

```python
if no_quant or resolved_quant is None:
    cfg.quant = None
```

Without this, `winml build … --device npu --no-quant` still produced a
`_quantized.onnx` artifact.

### `scripts/e2e_eval/run_eval.py` — script wiring

- New canonical-name set `_NATIVE_QUANT_EPS =
{"VitisAIExecutionProvider"}` plus a helper `_ep_quantizes_natively(ep)`
that funnels both canonical names and user aliases (e.g. `vitisai`)
through `winml.modelkit.utils.constants.normalize_ep_name`. No hardcoded
aliases.
- `_resolve_precision(...)` gained an `ep` parameter; for native-quant
EPs it returns `None` so no precision flag is sent.
- `_run_build` now passes `--no-quant` to **both** `winml config` (so
the persisted `build_config.json` has `quant: null` up-front) and `winml
build` (defense in depth) when the EP quantizes natively.
- Call sites in `run_model` and `main` updated to thread `ep` through
`_resolve_precision`.

## Why the earlier commits in this branch weren't enough

The first attempt (`fix(run_eval): skip quantize when VitisAI EP is
selected`) wired `--no-quant` only into `winml build`. That didn't take
effect because of the `_patch_device` bug above. The second attempt
(`fix(vitisai): resolve auto-precision to w8a8 for VitisAI NPU`) tried
to switch precision instead of skipping — also wrong, since VitisAI
wants an fp32 input and quantizes it itself. The final state keeps the
script clean (`--no-quant`, no precision override) and fixes the actual
`winml build` bug.

## Verification

Manual end-to-end on AMD Ryzen AI (VitisAI NPU), with a clean
`~/.cache/winml/artifacts/...` and output dir:

```pwsh
uv run --no-sync python scripts/e2e_eval/run_eval.py `
  --hf-model facebook/convnext-tiny-224 `
  --task image-classification `
  --device npu --ep vitisai `
  --eval-type perf --no-report --verbose --timeout 1800 `
  --output-dir e2e-test\vitisai_npu
```

Before: `winml perf` crashed with `DpuKernelRunner.cpp:1920 DPU
timeout`.
After:
- Cached `imgcls_*_winml_build_config.json` has `"quant": null`.
- No `_quantized.onnx` artifact produced.
- Perf step: **PASS** in ~120 s.
…771)

## Summary

Closes #546.

`winml inspect --task bogus-task` was leaking optimum's internal
`TasksManager` class name and pointing users to optimum docs:

> Error: Inspection error: Task 'bogus-task' not supported by
TasksManager. Check optimum documentation for supported tasks.

Now the value is validated at Click parse time against the hand-coded
`KNOWN_TASKS` set, before any heavy imports:

```
$ winml inspect -m microsoft/resnet-50 --task bogus-task
Usage: winml inspect [OPTIONS]
Try 'winml inspect --help' for help.

Error: Invalid task 'bogus-task'. Valid: audio-classification, audio-frame-classification, audio-xvector, automatic-speech-recognition, depth-estimation, ... (35 total). See 'winml inspect --list-tasks' for the full list.
```

- Exit code 2 (Click UsageError)
- No third-party class names; no optimum-docs pointer
- Callback imports only `..loader.task.KNOWN_TASKS` — avoids the ~10s
optimum/transformers cold start, so the fail-fast stays fast
- `--list-tasks` and valid `--task` paths unchanged

Co-authored-by: Ziyuan Guo (WE TEAM) <ziyuanguo@microsoft.com>
…#772)

## Summary

Fixes #541.

`winml catalog` was the only command where `-t` did NOT mean `--task`:

| Command   | `-t` means       |
|-----------|------------------|
| `inspect` | `--task`         |
| `export`  | `--task`         |
| `config`  | `--task`         |
| `catalog` | `--model-type` (inconsistent) |

A user who has memorized `-t` to mean `--task` in 3 commands would type
`-t image-classification` against `winml catalog` and silently get
`--model-type=image-classification` (no such model type) instead.

## Change

In `src/winml/modelkit/commands/catalog.py`:
- Dropped the `-t` short from `--model-type` (no short alias now).
- Moved `-t` to `--task` (replacing the previous `-k`).

`--model-type` is still fully supported via its long form.

Adds a regression guard test (`test_model_type_has_no_short_flag`) that
checks both the `--help` output AND that passing a model_type via `-t`
is interpreted as a task. All 115 catalog tests pass.

Co-authored-by: Ziyuan Guo (WE TEAM) <ziyuanguo@microsoft.com>
**Skips compilation related cases**

There are some model fail to be compiled in VitisAI Execution Provider.
The error is an "Access Violation" error which causes the python process
to crash. This would be an EP side problem. To unblock our e2e test, I
have skipped them for VitisAI

**Skips npu usage assertion for small model**

Running small mock model can be super fast. For this case, the NPU usage
is zero. However, our assertion logic still expectes to have some NPU
usage. This makes the e2e not stable. Considering that we have already
this assertion on real model e2e test cases, I skip this assertion for
small model only.

**Skips eval metric value range assertion**

The eval e2e test only uses 10 samples because we aim to see the eval
pipeline is working rather than truly eval a model in e2e. In assertion
logic, we have a metric range. But the metric range is calcuated on qnn
device, which may not be the same for other devices. Using the same
range may cause e2e instable. Therefore, I only assert the metric range
for qnn. For other device, I just assert the metric value is available.
uv run
~\ModelKit\examples\microsoft-swin-large-patch4-window7-224\example.py
--onnx
~\.cache\winml\artifacts\microsoft_swin-large-patch4-window7-224\imgcls_ec485f4653d962b9_quantized.onnx
True label: house finch, linnet, Carpodacus mexicanus (synset=n01532829,
id=12)

Top 5 predictions:
  1. house finch, linnet, Carpodacus mexicanus (0.9127)
  2. brambling, Fringilla montifringilla (0.0122)
  3. goldfinch, Carduelis carduelis (0.0028)
  4. chickadee (0.0013)
  5. junco, snowbird (0.0013)

Verdict (top-1): PASS

Annotated image written to prediction.png

---------

Co-authored-by: hualxie <hualxie@microsoft.com>
…ng (#790)

## Summary

timm checkpoints load through transformers'' generic `TimmWrapper`
(`model_type="timm_wrapper"`) and previously failed in **every** `winml`
command with *"Cannot detect task: config has no ''architectures''
field"*. Two gaps:

1. **Task/class detection** — timm repos load as `TimmWrapperConfig`
with `architectures=None`, so auto-detection could not resolve a task or
class.
2. **OnnxConfig location** — Optimum registers timm''s config
(`TimmDefaultOnnxConfig`) only under `library_name="timm"`, but every
`winml` lookup defaults to `transformers`.

`timm_wrapper` is transformers'' generic bridge for the whole timm
library — not a model architecture — so it is resolved at the **shared
resolution layer**, not as a per-model config. Only the library is
recorded; the task is derived from Optimum.

## Changes (no `models/hf/` entry)

- **`loader/task.py`** — `WRAPPED_LIBRARY_MODEL_TYPES` (`model_type ->
optimum_library`) + `resolve_optimum_library()`. When a config has no
`architectures`, `_detect_task_and_class_from_config` derives the task
from Optimum''s task list for the library
(`get_supported_tasks("timm_wrapper", "timm")` ->
`["image-classification"]`) and the class from
`get_model_class_for_task` (generic `AutoModelForImageClassification`,
which transformers dispatches to `TimmWrapper` at load). The task is not
hardcoded; the branch imports `optimum.exporters.onnx.model_configs`
first to populate Optimum''s registry (scoped so normal model loading
never pays for it).
- **`export/io.py`** — `_get_onnx_config` routes the library via
`resolve_optimum_library`, so `timm_wrapper` resolves Optimum''s
`TimmDefaultOnnxConfig` from every call site
(config/build/export/inspect) with no `--library` flag.
- **`commands/inspect.py`** + **`inspect/resolver.py`** — route both the
CLI inspect path and the public `inspect_model` path the same way:
library routing for the OnnxConfig lookup, plus wrapped-library task
detection so the task is not mislabeled.
- Tests: `resolve_optimum_library` + wrapped-library architectures
fallback with task derivation (loader); timm library routing for
`resolve_io_specs` / `_get_onnx_config` (export); public inspect path
`detect_task` / `resolve_exporter` for timm (inspect).

## Validation

**Functional (end-to-end)** on a timm image-classification model:

| Command | Before | After |
|---|---|---|
| `winml config` | exit 2 — *no ''architectures'' field* |
task=image-classification, 1 input |
| `winml export` | exit 2 — same | `model.onnx` (pixel_values to logits)
|
| `winml inspect` | exit 1 — same | `AutoModelForImageClassification` +
`TimmDefaultOnnxConfig`, full I/O table |

`config` -> `export` -> `optimize` -> `model.onnx` validated end-to-end
for multiple timm CNN classifiers. Also resolves on a timm ViT backbone
(`num_labels=0`) -> task=image-classification, matching Optimum''s own
`infer_task_from_model`, so it generalizes across timm architectures
(CNN + ViT).

**No impact on existing models** — scanned all 439 entries / 401 unique
models in `scripts/e2e_eval/testsets/models_all.json`: **0** are
`timm_wrapper` (by JSON metadata and by loaded config; 330 loadable).
Since `timm_wrapper` is the only trigger of the new branch, no existing
model changes behavior. (71 fail to load a config — custom/GGUF/tabular
types that fail at `AutoConfig` regardless; 7 have empty `architectures`
but are not timm — a pre-existing "Cannot detect task", identical before
and after the PR.)

**No overhead for normal (non-timm) models** — `winml config` on a
standard non-timm model: this branch vs base, min ~12.6s vs ~12.5s
(within run-to-run noise). Non-timm configs have `architectures`, so
they skip the new branch; the only added cost is one dict lookup.

**Unit tests** — `tests/unit/loader` + `tests/unit/export` +
`tests/unit/inspect`: green.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Yi Ren <reny@microsoft.com>
## Fix model-task inconsistency for vision feature-extraction models

Fixes #777, #778, #782.

### Principle

`winml inspect` is the source of truth for valid `(model_id, task)`
pairs. Both `feature-extraction` and `image-feature-extraction` are
valid ways to address an image-embedding model like
`facebook/dinov2-base`. Downstream commands must accept whichever name
`winml inspect` accepts, then use `(model_id, task)` to locate the
concrete class to act on.

### Root cause

Optimum's `TasksManager.get_exporter_config_constructor` only knows
canonical Optimum task names. Several call sites passed the raw
user-supplied task straight through, so HF aliases like
`image-feature-extraction` were rejected with "Unsupported". The
evaluator additionally needs to know which HF pipeline name to dispatch
on, which the canonical Optimum task name doesn't carry by itself for
bimodal tasks like `feature-extraction`.

### Fix

- **Inspect / export / HTP exporter**: normalize via
`_map_task_synonym(task)` (in `export/io.py`) before any `TasksManager`
lookup because it requires normalized task input. This is a single
function reused at each `TasksManager` boundary — no new global table.
- **Quantize**: `_resolve_dataset_class(task, io_config)` in
`datasets/__init__.py` dispatches to `TextDataset` / `ImageDataset`
based on the actual ONNX input names. No `AutoConfig.from_pretrained`
round-trip. Bimodal io_configs fall back to `RandomDataset` with a
warning.
- **Evaluate**: Because HF pipeline and evaluate library have their task
name convention, `to_hf_pipeline_task(task, model_id)` in
`eval/evaluate.py` translates to the HF pipeline name the underlying
`evaluate` library expects. Uses `OnnxConfig.inputs` (no weights loaded)
to pick the modality. Bimodal models (e.g. CLIP combined: both
`pixel_values` and `input_ids`) keep the task unchanged via a `len(hits)
== 1` guard, preserving the explicit user task.

### Validation

`facebook/dinov2-base`:

| Command | Before | After |
|---|---|---|
| `winml inspect -m facebook/dinov2-base --task
image-feature-extraction` | "Unsupported" | Resolves via
`Dinov2OnnxConfig` |
| `winml export -m facebook/dinov2-base -t image-feature-extraction` |
KeyError on TasksManager | Valid ONNX with `last_hidden_state` |
| `winml eval -m facebook/dinov2-base --task feature-extraction` |
`RuntimeError: Failed to create feature-extraction dataset` | kNN
metrics on mini-imagenet |
| `winml quantize <onnx> --task feature-extraction -m
facebook/dinov2-small` | Failure by using TextDataset | Routes to
`ImageDataset` |

`openai/clip-vit-base-patch32` (bimodal, regression check):

- `winml eval -m openai/clip-vit-base-patch32 --task feature-extraction`
→ stays `feature-extraction` (text STS evaluator); not silently rerouted
to image.
- `winml eval -m openai/clip-vit-base-patch32` (auto-detect) → resolves
to `feature-extraction` (text).

### Tests

Unit:
- `tests/unit/eval/test_eval.py::TestResolveTask` — auto-detect,
explicit task, bimodal guard, HF pipeline translation.
- test_random_dataset.py — `TASK_DATASET_MAPPING` covers all registered
tasks, including bimodal dict-of-dict.

E2E (`-m e2e`, dinov2 chosen because it isn't in `MODEL_BUILD_CONFIGS`
and so actually exercises the `TasksManager` path):
- `tests/e2e/test_inspect_e2e.py::TestInspectDinoV2` — both
`image-feature-extraction` and `feature-extraction` resolve.
-
`tests/e2e/test_export_e2e.py::TestExportDinoV2::test_image_feature_extraction`.
-
`tests/e2e/test_eval_e2e.py::TestEvalPerTask::test_image_feature_extraction`
parameterized over both task names.
-
`tests/e2e/test_quantize_e2e.py::test_feature_extraction_with_pixel_values_uses_image_dataset`.
Table for stub

```
  ┌──────────────┬──────────┬────────────────────────────────┬────────────────────────────────────────────┐
  │     Lib      │ py.typed │            Reality             │              Override status               │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ torch        │ yes      │ Has inline types (v2.11)       │ Override is a no-op — mypy uses real types │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ torchvision  │ no       │ No types, no community stubs   │ Genuinely needed                           │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ onnx         │ yes      │ Has inline types (v1.18)       │ Override is a no-op                        │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ onnxruntime  │ no       │ Untyped; no community stubs    │ Genuinely needed                           │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ transformers │ yes      │ Inline types but partial/loose │ Override is a no-op — types ARE used       │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ datasets     │ no       │ Untyped                        │ Genuinely needed                           │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ optimum      │ no       │ Untyped                        │ Genuinely needed                           │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ timm         │ yes      │ Has inline types (v1.0.26)     │ Override is a no-op                        │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ onnxscript   │ yes      │ Has inline types (v0.7)        │ Override is a no-op                        │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ snakemd      │ no       │ Untyped                        │ Genuinely needed                           │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ openvino     │ n/a      │ Not installed locally          │ n/a                                        │
  └──────────────┴──────────┴────────────────────────────────┴────────────────────────────────────────────┘
```

plotext added to ignore_missing_imports (no community stubs, untyped
library)

---------

Co-authored-by: Hualiang Xie <hualxie@microsoft.com>
…m_task (#801)

## What

PR1 of #800. Relocate `map_task_synonym` ->
`loader/task.py::to_optimum_task` to establish a single WinML->Optimum
task-collapse boundary.

## Changes

- `loader/task.py`: add `to_optimum_task` + `TASK_SYNONYM_EXTENSIONS`
(moved from `export/io.py`); exported via `loader/__init__.py`.
- `export/io.py`: local implementation removed; `map_task_synonym` kept
as a backward-compatible alias (`= to_optimum_task`); internal use
repointed.
- Optimum-boundary call sites repointed to `to_optimum_task`:
`commands/inspect.py`, `export/htp/exporter.py`, `inspect/resolver.py`.
- `commands/build.py`: `TASK_SYNONYM_EXTENSIONS` now imported from
`loader`.
- New `tests/unit/loader/test_task_boundary.py` pins the collapse
contract.

## Behavior

No behavior change. `map_task_synonym` stays importable from
`export.io`; the collapse semantics (`image-feature-extraction` ->
`feature-extraction`, WinML extensions preserved) are byte-identical.
Existing synonym and #777/#782 regression tests stay green.

Sets up PR2 (#800), which adds the modality-aware `detect_task` and
relies on this single collapse boundary.
…#793)

Fixes #566.

## Problem

- Top-level group declared ``-v/--verbose`` (count) and ``-q/--quiet``,
but 12 of 13 subcommands redeclared ``--verbose`` as ``is_flag=True``,
so ``winml export -vv …`` errored with ``extra argument``.
- No subcommand exposed ``-q/--quiet``, so ``winml export --quiet …``
failed with ``no such option``.
- Each command wired logging differently; DEBUG/INFO lines interleaved
with Rich tables on stdout, breaking ``cmd > out 2> log.txt``.

## Changes

- ``utils/cli.py``: ``verbosity_options`` decorator (``-v`` count,
``-q`` flag) + new ``resolve_verbosity(ctx, verbose, quiet)`` helper
that merges top-level and subcommand-level values (max of verbose, OR of
quiet). Honors the legacy ``ctx.obj[""debug""]`` so tests that bypass
``main()`` still raise the verbosity floor.
- ``utils/logging.py``: format ``[%(asctime)s %(levelname)-7s %(name)s]
%(message)s`` with ``datefmt=%H:%M:%S``, ``stream=sys.stderr``.
Idempotent — re-creates the WinML handler bound to the current
``sys.stderr`` on each call so Click ``CliRunner`` stream redirection
keeps working, and leaves non-WinML handlers (notably pytest ``caplog``)
intact.
- ``cli.py``: top-level group uses ``@verbosity_options`` (replaces
inline declarations); ``--debug`` alias preserved.
- 12 subcommands (``build``, ``compile``, ``config``, ``eval``,
``export``, ``inspect``, ``optimize``, ``perf``, ``quantize``, ``sys``,
plus ``analyze`` cleanup): replace ad-hoc ``--verbose``
(``is_flag=True``) with ``@cli_utils.verbosity_options``, add ``quiet:
bool`` param, call ``configure_logging(verbosity=verbose, quiet=quiet)``
after ``resolve_verbosity``. Removes the legacy ``if
ctx.obj.get(""debug""): verbose = True`` blocks (folded into the
helper).
- ``serve/app.py``: pre-existing latent bug — module-level
``logging.getLogger(""winml.modelkit"").setLevel(INFO)`` ran at import,
which muted DEBUG capture in unrelated tests that got collected
alongside the serve test module. Split into ``_attach_log_handler()``
(idempotent, called from ``_register_routes``) and a paired
``_ensure_log_capture_level`` / ``_restore_log_capture_level`` invoked
from the production lifespan. Tests that build the app via
``_register_routes`` + a mock lifespan no longer leak global logger
state.

## Behavior

Both flag positions work; subcommand value wins when both are passed
(max/OR merge):

```text
winml -v export -m … -o …            # top-level: works
winml export -vv -m … -o …            # subcommand: now works (was: extra argument)
winml --quiet export -m … -o …        # top-level: works
winml export --quiet -m … -o …        # subcommand: now works (was: no such option)
winml inspect -vv -m … --format json > out 2> log.txt   # clean stdout/stderr split
```

## Tests

- ``tests/cli/`` (23): pass
- ``tests/unit/`` (5061 collected): **5058 pass**, 3 fail — all 3
pre-existing on main and unrelated to this change:
-
``test_winml_session.py::TestOpenVINODeviceRouting::test_compile_openvino_cpu_device_succeeds``
-
``test_winml_session.py::TestOpenVINODeviceRouting::test_compile_openvino_cpu_provider_not_npu``
(both env: no OpenVINO EP installed)
-
``test_config_utils.py::TestMergeConfigNoneHandling::test_none_to_value_transition``
(test isolation, passes alone)

---------

Co-authored-by: hualxie <hualxie@microsoft.com>
## Summary
- Replace hardcoded 4-EP list in `analyze_from_proto(ep=None)` with
dynamic lookup from `EP_SUPPORTED_DEVICES`, filtered by target device
- Remove `max_length=4` constraint on `AnalysisOutput.results` to
support more than 4 EPs per device
- Change uniqueness validator from IHV type to EP type (multiple EPs can
share the same IHV, e.g. CUDA and DML both map to MICROSOFT)

**Before:** `analyzer.analyze(ep=None)` always analyzed QNN, OpenVINO,
VitisAI, NvTensorRTRTX regardless of device — NvTensorRTRTX was analyzed
on NPU even though it only supports GPU.

**After:** EP list is derived from `EP_SUPPORTED_DEVICES` filtered by
the target device, matching the CLI `--ep all` behavior exactly.
Resolves #326.

Adds `WinMLDepthEstimationEvaluator` and `DepthMetric` (Absolute
Relative error, RMSE, delta-1) following the NYU/KITTI evaluation
protocol. HuggingFace `evaluate` doesn't ship a depth-estimation
evaluator, so the metric loop is implemented manually.

### Background

Depth-estimation models fall into a few groups, and the same input image
gives wildly different prediction scales depending on which group the
model belongs to.

- Metric-depth models (ZoeDepth, DepthPro) predict depth in meters
directly.
- Relative-depth models (Depth-Anything, Marigold) predict depth up to
an unknown scale and shift.
- Disparity models (DPT, MiDaS) predict `1 / depth` (inverse depth) up
to scale and shift.

Comparing predictions against the NYU ground truth therefore requires
(1) optionally inverting disparity into depth and (2) aligning the
prediction to the ground truth before computing metrics. This is what
AbsRel/RMSE/delta-1 benchmarks in the literature do, and what this PR
adds as user-selectable options.

### Options

Two `columns_mapping` keys, both overridable via `--column`, and both
visible in `winml eval --schema --task depth-estimation`.

`align` controls how each prediction is rescaled against the ground
truth depth map before metrics are computed:

- `affine` (default): per-image least-squares fit of `pred_aligned = s *
pred + t`, where `s` is a scalar scale and `t` is a scalar shift, solved
on the valid pixels (those passing the depth range mask). Suitable for
relative-depth and disparity models.
- `median`: scale-only alignment, `pred_aligned = (median(gt) /
median(pred)) * pred`. No shift. Cheaper but less accurate when the
model has a non-zero offset.
- `none`: use the prediction as-is. Suitable for metric-depth models
that already output meters.

`depth_kind` indicates what the model outputs:

- `depth` (default): prediction is interpreted as depth.
- `disparity`: prediction is interpreted as inverse depth, so it is
inverted (`pred := 1 / pred`) before alignment. Needed for
DPT/MiDaS-style outputs.

The depth range used for the valid-pixel mask is also overridable:
`min_depth` (default 1e-3, NYU convention) and `max_depth` (default 10.0
meters, NYU convention). Only pixels with `min_depth <= gt <= max_depth`
contribute to the metrics.

### Default dataset and testset

Default dataset is `sayakpaul/nyu_depth_v2`. All 11 depth-estimation
entries from `models_all.json` are added to `models_with_acc.json`, with
per-model overrides only where the defaults don't match the model
family:

- `Intel/zoedepth-nyu-kitti` and `apple/DepthPro-hf` set `align=none`
(metric-depth).
- `Intel/dpt-hybrid-midas` and `Intel/dpt-large` set
`depth_kind=disparity`.
- The remaining 7 entries (Depth-Anything family, Marigold, etc.) rely
on the defaults (`align=affine`, `depth_kind=depth`).

### Tests

Unit tests cover the new evaluator and the metric, including the
affine-fit path and the disparity inversion path. The slow/network
integration test runs the full pipeline end-to-end on Depth-Anything V2,
ZoeDepth, and DPT.
DingmaomaoBJTU and others added 30 commits June 9, 2026 19:13
- JSON key 'avg' -> 'mean' (matches actual output)
- Add missing JSON fields: task, precision, timestamp, std, warmup_mean, batches_per_sec
- Fix terminal label 'Precision' -> 'Model Precision'
- Add missing 'Task:' line in terminal example
- Remove false claim about --module using ONNX hierarchy tags
  (it uses torchinfo to discover PyTorch submodules, not ONNX metadata)
- Remove 'per-operator timings' from intro (op-tracing not ready)
- Add model_info block to JSON example (always emitted)
- Soften --monitor 'no effect' to acknowledge small system overhead
- Change 'not executing' to 'strong signal to investigate'
- Add 'monitor' field to NPU JSON example
- Fix 'on-chip memory' -> 'dedicated adapter memory'
- Note that JSON always includes device_memory even for CPU (zeroed)
Fix docs for eval, compile and quantize
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.