docs: comprehensive documentation audit and improvements by DingmaomaoBJTU · Pull Request #828 · microsoft/winml-cli

DingmaomaoBJTU · 2026-06-08T06:46:00Z

Summary

Full documentation audit of the docs/ site with factual corrections, new pages, and structural improvements.

New Pages

getting-started/agent-skill.md — Copilot Coding Agent skill integration
tutorials/build-from-onnx.md — Bring Your Own ONNX Model tutorial
samples/clip-composite.md — CLIP composite model sample
reference/python-api.md — Python API reference
reference/output-layout.md — Build output directory structure
reference/supported-models.md — Supported models and EP compatibility table
contributing.md — Simplified, references repo CONTRIBUTING.md
troubleshooting.md — Restructured by component (Compile / Analyze / Build-Cache)

Factual Corrections

Compile validation: corrected from "random inputs + numerical comparison" to all-ones dummy inputs + NaN/Inf check
Removed --debug from global flags (hidden flag)
Fixed -p/--precision scope (only config and quantize)
Added ONNX file input entry to pipeline diagram
Fixed EP table: added CUDA/NvTensorRTRTX/MIGraphX, corrected QNN note to "bundled in ORT"

Structural Improvements

Compiler backends: all compile pages now distinguish --compiler ort (default) vs --compiler qairt
auto field documentation in build config schema
CI/CD reproducibility tips for winml_build_config.json
EP alias column and auto/all special values
Quickstart reordered: inspect → export
How-it-works: replaced mermaid with SVG, added Analyze section
Mike version control plugin + CI workflow for multi-version docs
Nav reorder: Datatype and Quantization before EP and Device

Preview

Build with uv run mkdocs serve to preview locally.

Adds a complete MkDocs Material documentation site for the winml-cli project, served from /docs and built locally and via GitHub Actions (manual dispatch). Site infrastructure: - mkdocs.yml with Material theme, mermaid superfences, tabbed code, light/dark palette toggle - pyproject.toml dev deps: mkdocs-material, mkdocs-jupyter, pymdown-extensions - .github/workflows/docs.yml (workflow_dispatch only) - .gitignore exception for docs/superpowers/specs/ User-facing chapters: Home — tagline + Goals/Promises bullets sourced from the MVP transcript; describes the toolkit's three workflows (primitives, pipeline, one-command) plus the EP × Device coverage promise Getting Started (3 pages): - Installation — Win 11 24H2 + Copilot+PC + Python 3.10 + uv + git prereqs table; 'No NPU?' callout pointing at --device auto with the winml eval caveat - Quickstart — 5-minute export + inspect with 'winml sys --list-device --list-ep' verify step - End-to-End Tour — universal --device auto walkthrough that works on Copilot+ PC NPU, DirectML GPU, or CPU; tabbed example outputs for sys and perf so each reader sees their own machine Concepts (12 pages in two sub-groups): - Fundamentals (5): How winml-cli works, Graph and IR, Weight and Activation, EP and Device (with the full 7-EP × Device matrix), Datatype and Quantization (8-precision family from _KNOWN_PRECISIONS with w4a16 marked 'Planned — not yet supported') - WinML CLI (7 workflow-concept pages): Primitives and pipeline, Load and export, Analyze and optimize, Compile and EPContext, Perf and monitoring, Eval and datasets, Config and build (with the full WinMLBuildConfig schema inline) Commands (13 pages): - Overview with the four user-intent groups (Discover / Configure / Build / Measure) - Per-command reference for: sys, inspect, hub, analyze, config, optimize, export, quantize, compile, build, perf, eval Samples (3 pages): - ConvNeXt — Primitives Walkthrough (CPU/GPU/NPU device comparison) - BERT — Config + Build + Perf (workflow demonstration) - Qwen3 — Composite Models (placeholder for the in-progress feature) Tutorials (2 pages): - Overview - ConvNeXt on NPU — 2200-word linear walkthrough with both QNN and OpenVINO compile paths shown via tabbed blocks, plus the 'winml build' one-shot variant P2 stubs preserved in nav: Reference, Troubleshooting, Contributing Source-grounding: - Every flag mentioned in user-facing docs is verified against src/winml/modelkit/ - Non-functional flags (--torch-module, --dynamo on export; --no-quant on compile) are explicitly marked - All URLs target the canonical microsoft/winml-cli destination - mkdocs build --strict passes with zero warnings Internal artifacts kept under docs/superpowers/ for reference: - Spec and plan files for the v1 and v2 design iterations - 2026-05-26-v3-known-issues.md — fact-checked review findings Existing internal docs (docs/design/, docs/naming-convention.md, docs/pytest-best-practices.md) are unchanged and excluded from the user-facing nav via exclude_docs in mkdocs.yml.

…he site Adds a contributor-facing README at docs/README.md covering: - uv-based dev setup - mkdocs serve / build --strict workflow - gh-deploy publish (local one-shot) - .github/workflows/docs.yml CI workflow (currently workflow_dispatch only) - Authoring conventions (winml-cli name, flag verification, admonitions, tabbed code blocks) - Excluded paths reference Updates mkdocs.yml exclude_docs to include /README.md so the new file doesn't collide with docs/index.md as the chapter index.

…source Six parallel review agents fact-checked all 34 user-facing doc files against microsoft/winml-cli @ 5e25579. Output: one issue file per source doc at docs/superpowers/2026-05-27-doc-issues/. A validator agent then cross-checked every Critical and Important claim and produced the consolidated, false-positive-filtered list at docs/superpowers/2026-05-27-validated-issues.md. Summary: 25 Critical + 22 Important kept; 6 rejected as false positives. Major theme: docs were authored against feat/mvp source where some symbols and defaults differ from main (e.g., _KNOWN_PRECISIONS in _options.py vs _NAMED_PRECISIONS in precision.py; winml hub vs winml catalog; many flag defaults flipped to 'auto'; DML/CPU no longer produce _ctx.onnx artifacts). Next step: per-file fix agents will apply the validated list.

…eview 5 parallel fix agents applied the validated-issues list. Net: 25 Critical + 22 Important defects resolved across 20 doc files + mkdocs.yml. Major fixes by area: Concepts (4 pages): - quantization.md: NPU auto-precision corrected to w8a16 (was int8); w4a16 description corrected (rejected at validation, not 'recognized but raises at quantization'); _KNOWN_PRECISIONS/_options.py references replaced with the actual _NAMED_PRECISIONS/precision.py - compile-and-epcontext.md: removed non-existent --no-quant flag mention - config-and-build.md: JSON 'compile' section flattened to use execution_provider (not nested ep_config.provider); table expanded to the actual 7 sub-configs (added eval, auto) - perf-and-monitoring.md: --device documented as accepting auto; output path corrected to ~/.cache/winml/perf/<slug>/<timestamp>.json; --monitor not NPU-specific; --op-tracing marked hidden Commands (11 pages): - overview.md: winml hub renamed to winml catalog throughout; _options.py reference replaced with cli.py - hub.md: H1 and all invocations changed to 'winml catalog'; removed non-existent --model/-m flag; rewrote 'How it works' (no per-EP latency / accuracy-verdict columns exist); added --ep/--device filter flags - build.md: --config marked optional (was required); --random-init and --qnn-sdk-root removed (don't exist); --no-compile/--compile toggle pair documented; --trust-remote-code added; --max-optim-iterations default corrected to None - compile.md: --device default corrected to auto; --no-quant flag removed (doesn't exist on compile) - config.md: --no-compile/--compile framing corrected (compile is EXCLUDED by default; users need --compile to include) - eval.md: --device includes auto (default auto, not cpu); -n short alias removed; class reference replaced with actual evaluate function - analyze.md: --device default corrected to auto; --ep default to auto; --run-unknown-op default to False; -m/-v/-q/-c flags added - optimize.md: --preset/-p flag and entire Built-in presets table removed (flag doesn't exist); --verbose added; 'Configuration precedence' reduced from 4 levels to 3 - inspect.md: --list-tasks, --model-type, --model-class, --verbose flags added - perf.md: --compare-devices removed (not registered at all); output path corrected; --op-tracing marked hidden - sys.md: --verbose/-v added to flag table Samples / Tutorials / Getting Started (5 pages): - installation.md: Python 3.10 corrected to 3.11; 'No NPU?' callout no longer claims winml eval rejects auto (it accepts auto on main) - end-to-end.md: dropped incorrect _ctx.onnx CPU/DML artifacts; QNNExecutionProvider mapped to NPU/GPU (not just NPU) - convnext-primitives.md: CPU/GPU compile clarified (no _ctx.onnx produced; uses convnext_int8.onnx directly); winml eval auto reverted - bert-config-build.md: build final artifact corrected to model.onnx (was bert-base-uncased_ctx.onnx) - npu-convnext.md: Python 3.10 -> 3.11; OpenVINO artifact filename corrected to use device string (_npu_ctx.onnx not _openvino_ctx.onnx); CPU compile tab dropped (CPU doesn't produce _ctx.onnx) mkdocs.yml: nav label 'hub' renamed to 'catalog' to match the actual command name on microsoft/winml-cli main.

…meration) The opening paragraph re-stated the project tagline (already on the home page one click above) and enumerated 4 EPs (QNN, OpenVINO, DML, ONNX Runtime) — which goes stale; the canonical list in concepts/eps-and-devices.md has 7. Removing the paragraph; the page now starts with the Prereqs table. Matches the convention used by quickstart.md and end-to-end.md (neither re-states the tagline).

## Summary - Rewrote `docs/concepts/analyze-and-optimize.md` with source-verified content: SupportLevel classification table, lint vs autoconf outputs, analysis modes, optimizer pipe architecture (4 pipes, 43 capabilities, 5 rewrite groups / 12 rules), and autoconf loop SVG diagram - Updated `docs/commands/analyze.md` with corrected EP aliases, exit-code table, and additional CLI examples - Renamed `hub.md` → `catalog.md` and updated all cross-references (inspect, overview, sys, mkdocs.yml) - Fixed `check-yaml` pre-commit hook to support `!!python/name` tags in mkdocs.yml (`--unsafe`) 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Zhipeng Wang <zhiwang@microsoft.com> Co-authored-by: Qiong Wu (qiowu) <qiowu@microsoft.com> Co-authored-by: hualxie <hualxie@microsoft.com> Co-authored-by: Charles Zhang <zhangchao@microsoft.com> Co-authored-by: Zhenchao Ni <zhenni@microsoft.com>

only unit test _skip_winml_ep_init

## Summary - Drop the `WindowsAppRuntimeVersion` class, attribute, property, and `windowsAppRuntimeVersion` field in `SysInfo.to_dict()` from `src/winml/modelkit/sysinfo/sysinfo.py`. - Remove the now-unused `import re`. Nothing else in the codebase referenced these symbols. Integration `runtime_checker` fixtures still contain the field inside their stored `sys_info` blob, but the test helper ignores `sys_info` during comparison, and the field will disappear naturally next time those fixtures are regenerated.

…763) ## Summary - **VitisAI EP ordering**: Move `VitisAIExecutionProvider` to end of `EP_SUPPORTED_DEVICES` so it appears last in `analyze --ep all` output, since it is not yet fully supported. - **Catalog table width**: Set `expand=False` on both `Table` and `Panel` in `_build_list_renderable` so the catalog table fits its content width instead of stretching to the full terminal width.

…tection (#779) Also update scripts/e2e_eval/run_pytorch_baseline.py to include pytorch model latency --------- Co-authored-by: hualxie <hualxie@microsoft.com>

## Summary - Reorganized README into 5 sections: Title + Description, Features / Scope, Getting Started, Commands, Contributing + License - Updated status badge to `preview`, rewrote description and Features (✅ bullets) - Scope section: added supported EPs, built-in model catalog reference, accepted inputs; removed verbose LLM/not-supported block - Getting Started: consolidated Prerequisites + Installation + Quick Start; added Config-Build Pipeline and Step-by-step through primitive commands walkthroughs - Commands: BYOM workflow with pipeline diagram, command table + collapsible details, comparison table (Config-Driven first) - Reference tables at end: Supported Hardware, Supported Tasks, Supported Model Types, Built-in Models --------- Co-authored-by: Qiong Wu (qiowu) <qiowu@microsoft.com> Co-authored-by: Zhipeng Wang <zhiwang@microsoft.com>

## Summary - Removed the duplicated `WinML CLI (Python wheel) | [Releases]` row in the Prerequisites table. - Updated the install step from `uv pip install winml_cli-<version>-py3-none-any.whl` to `pip install winml-cli`. - Updated the Prerequisites entry to point at PyPI instead of GitHub Releases, keeping the table and install instructions consistent.

## Summary - Adds `resolve_check_device_ep` helper that validates a (device, EP) combination without requiring the device/EP to actually exist on the system. Closes #765. - `commands/config.py` and `config/build.py` now use `resolve_check_device_ep` instead of `resolve_device` so `winml config` no longer hard-fails on hosts where the requested EP isn't installed. - When `device=auto` or `ep=None`, the helper delegates to the existing `resolve_device` + `resolve_eps` flow (system-aware behavior preserved). When both `device` and `ep` are explicit, it only validates against the static `EP_SUPPORTED_DEVICES` mapping. - CLI cleanup: `-m/--model`, `-c/--config`, `--device` for the config command now use the shared `cli_utils.*_option` decorators. ## Tests - New `TestResolveCheckDeviceEp` class in `tests/unit/sysinfo/test_device.py` covering both code paths (delegation and static-only) plus error cases (unknown EP, unsupported device, case-insensitivity). - Existing config-test mocks updated from `resolve_device` to `resolve_check_device_ep` (`tests/unit/config/conftest.py`, `tests/unit/config/test_build.py`, `tests/unit/config/test_build_onnx.py`, `tests/unit/commands/test_config_cli.py`) so the lazy import in `config/build.py` is intercepted. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: hualxie <hualxie@microsoft.com>

Co-authored-by: hualxie <hualxie@microsoft.com>

…gs) (#785) Adds curated recipe configs for the 12 builtin models — those that pass fp16 eval on all 9 (EP, device) buckets.

## Summary Fixes `scripts/e2e_eval/run_eval.py` crashing on VitisAI EP (AMD Ryzen AI NPU) and a latent bug in `winml build` that prevented the script's `--no-quant` workaround from actually taking effect. The crash: VitisAI ships its own internal quantizer and runs it at session-create time. Layering winml's generic QDQ quantization pass on top produces a model VitisAI cannot consume, which manifests as `DpuKernelRunner.cpp:1920 DPU timeout` during `winml perf`. The fix is to tell winml to skip its own quantization when the selected EP quantizes natively. ## Changes ### `src/winml/modelkit/commands/build.py` — root-cause fix (1 line) When `--device` was passed to `winml build`, the internal `_patch_device` helper unconditionally re-populated `cfg.quant` with the device's default quantization config, silently undoing any prior `--no-quant`. The condition now respects `no_quant`: ```python if no_quant or resolved_quant is None: cfg.quant = None ``` Without this, `winml build … --device npu --no-quant` still produced a `_quantized.onnx` artifact. ### `scripts/e2e_eval/run_eval.py` — script wiring - New canonical-name set `_NATIVE_QUANT_EPS = {"VitisAIExecutionProvider"}` plus a helper `_ep_quantizes_natively(ep)` that funnels both canonical names and user aliases (e.g. `vitisai`) through `winml.modelkit.utils.constants.normalize_ep_name`. No hardcoded aliases. - `_resolve_precision(...)` gained an `ep` parameter; for native-quant EPs it returns `None` so no precision flag is sent. - `_run_build` now passes `--no-quant` to **both** `winml config` (so the persisted `build_config.json` has `quant: null` up-front) and `winml build` (defense in depth) when the EP quantizes natively. - Call sites in `run_model` and `main` updated to thread `ep` through `_resolve_precision`. ## Why the earlier commits in this branch weren't enough The first attempt (`fix(run_eval): skip quantize when VitisAI EP is selected`) wired `--no-quant` only into `winml build`. That didn't take effect because of the `_patch_device` bug above. The second attempt (`fix(vitisai): resolve auto-precision to w8a8 for VitisAI NPU`) tried to switch precision instead of skipping — also wrong, since VitisAI wants an fp32 input and quantizes it itself. The final state keeps the script clean (`--no-quant`, no precision override) and fixes the actual `winml build` bug. ## Verification Manual end-to-end on AMD Ryzen AI (VitisAI NPU), with a clean `~/.cache/winml/artifacts/...` and output dir: ```pwsh uv run --no-sync python scripts/e2e_eval/run_eval.py ` --hf-model facebook/convnext-tiny-224 ` --task image-classification ` --device npu --ep vitisai ` --eval-type perf --no-report --verbose --timeout 1800 ` --output-dir e2e-test\vitisai_npu ``` Before: `winml perf` crashed with `DpuKernelRunner.cpp:1920 DPU timeout`. After: - Cached `imgcls_*_winml_build_config.json` has `"quant": null`. - No `_quantized.onnx` artifact produced. - Perf step: **PASS** in ~120 s.

…771) ## Summary Closes #546. `winml inspect --task bogus-task` was leaking optimum's internal `TasksManager` class name and pointing users to optimum docs: > Error: Inspection error: Task 'bogus-task' not supported by TasksManager. Check optimum documentation for supported tasks. Now the value is validated at Click parse time against the hand-coded `KNOWN_TASKS` set, before any heavy imports: ``` $ winml inspect -m microsoft/resnet-50 --task bogus-task Usage: winml inspect [OPTIONS] Try 'winml inspect --help' for help. Error: Invalid task 'bogus-task'. Valid: audio-classification, audio-frame-classification, audio-xvector, automatic-speech-recognition, depth-estimation, ... (35 total). See 'winml inspect --list-tasks' for the full list. ``` - Exit code 2 (Click UsageError) - No third-party class names; no optimum-docs pointer - Callback imports only `..loader.task.KNOWN_TASKS` — avoids the ~10s optimum/transformers cold start, so the fail-fast stays fast - `--list-tasks` and valid `--task` paths unchanged Co-authored-by: Ziyuan Guo (WE TEAM) <ziyuanguo@microsoft.com>

…#772) ## Summary Fixes #541. `winml catalog` was the only command where `-t` did NOT mean `--task`: | Command | `-t` means | |-----------|------------------| | `inspect` | `--task` | | `export` | `--task` | | `config` | `--task` | | `catalog` | `--model-type` (inconsistent) | A user who has memorized `-t` to mean `--task` in 3 commands would type `-t image-classification` against `winml catalog` and silently get `--model-type=image-classification` (no such model type) instead. ## Change In `src/winml/modelkit/commands/catalog.py`: - Dropped the `-t` short from `--model-type` (no short alias now). - Moved `-t` to `--task` (replacing the previous `-k`). `--model-type` is still fully supported via its long form. Adds a regression guard test (`test_model_type_has_no_short_flag`) that checks both the `--help` output AND that passing a model_type via `-t` is interpreted as a task. All 115 catalog tests pass. Co-authored-by: Ziyuan Guo (WE TEAM) <ziyuanguo@microsoft.com>

**Skips compilation related cases** There are some model fail to be compiled in VitisAI Execution Provider. The error is an "Access Violation" error which causes the python process to crash. This would be an EP side problem. To unblock our e2e test, I have skipped them for VitisAI **Skips npu usage assertion for small model** Running small mock model can be super fast. For this case, the NPU usage is zero. However, our assertion logic still expectes to have some NPU usage. This makes the e2e not stable. Considering that we have already this assertion on real model e2e test cases, I skip this assertion for small model only. **Skips eval metric value range assertion** The eval e2e test only uses 10 samples because we aim to see the eval pipeline is working rather than truly eval a model in e2e. In assertion logic, we have a metric range. But the metric range is calcuated on qnn device, which may not be the same for other devices. Using the same range may cause e2e instable. Therefore, I only assert the metric range for qnn. For other device, I just assert the metric value is available.

uv run ~\ModelKit\examples\microsoft-swin-large-patch4-window7-224\example.py --onnx ~\.cache\winml\artifacts\microsoft_swin-large-patch4-window7-224\imgcls_ec485f4653d962b9_quantized.onnx True label: house finch, linnet, Carpodacus mexicanus (synset=n01532829, id=12) Top 5 predictions: 1. house finch, linnet, Carpodacus mexicanus (0.9127) 2. brambling, Fringilla montifringilla (0.0122) 3. goldfinch, Carduelis carduelis (0.0028) 4. chickadee (0.0013) 5. junco, snowbird (0.0013) Verdict (top-1): PASS Annotated image written to prediction.png --------- Co-authored-by: hualxie <hualxie@microsoft.com>

…ng (#790) ## Summary timm checkpoints load through transformers'' generic `TimmWrapper` (`model_type="timm_wrapper"`) and previously failed in **every** `winml` command with *"Cannot detect task: config has no ''architectures'' field"*. Two gaps: 1. **Task/class detection** — timm repos load as `TimmWrapperConfig` with `architectures=None`, so auto-detection could not resolve a task or class. 2. **OnnxConfig location** — Optimum registers timm''s config (`TimmDefaultOnnxConfig`) only under `library_name="timm"`, but every `winml` lookup defaults to `transformers`. `timm_wrapper` is transformers'' generic bridge for the whole timm library — not a model architecture — so it is resolved at the **shared resolution layer**, not as a per-model config. Only the library is recorded; the task is derived from Optimum. ## Changes (no `models/hf/` entry) - **`loader/task.py`** — `WRAPPED_LIBRARY_MODEL_TYPES` (`model_type -> optimum_library`) + `resolve_optimum_library()`. When a config has no `architectures`, `_detect_task_and_class_from_config` derives the task from Optimum''s task list for the library (`get_supported_tasks("timm_wrapper", "timm")` -> `["image-classification"]`) and the class from `get_model_class_for_task` (generic `AutoModelForImageClassification`, which transformers dispatches to `TimmWrapper` at load). The task is not hardcoded; the branch imports `optimum.exporters.onnx.model_configs` first to populate Optimum''s registry (scoped so normal model loading never pays for it). - **`export/io.py`** — `_get_onnx_config` routes the library via `resolve_optimum_library`, so `timm_wrapper` resolves Optimum''s `TimmDefaultOnnxConfig` from every call site (config/build/export/inspect) with no `--library` flag. - **`commands/inspect.py`** + **`inspect/resolver.py`** — route both the CLI inspect path and the public `inspect_model` path the same way: library routing for the OnnxConfig lookup, plus wrapped-library task detection so the task is not mislabeled. - Tests: `resolve_optimum_library` + wrapped-library architectures fallback with task derivation (loader); timm library routing for `resolve_io_specs` / `_get_onnx_config` (export); public inspect path `detect_task` / `resolve_exporter` for timm (inspect). ## Validation **Functional (end-to-end)** on a timm image-classification model: | Command | Before | After | |---|---|---| | `winml config` | exit 2 — *no ''architectures'' field* | task=image-classification, 1 input | | `winml export` | exit 2 — same | `model.onnx` (pixel_values to logits) | | `winml inspect` | exit 1 — same | `AutoModelForImageClassification` + `TimmDefaultOnnxConfig`, full I/O table | `config` -> `export` -> `optimize` -> `model.onnx` validated end-to-end for multiple timm CNN classifiers. Also resolves on a timm ViT backbone (`num_labels=0`) -> task=image-classification, matching Optimum''s own `infer_task_from_model`, so it generalizes across timm architectures (CNN + ViT). **No impact on existing models** — scanned all 439 entries / 401 unique models in `scripts/e2e_eval/testsets/models_all.json`: **0** are `timm_wrapper` (by JSON metadata and by loaded config; 330 loadable). Since `timm_wrapper` is the only trigger of the new branch, no existing model changes behavior. (71 fail to load a config — custom/GGUF/tabular types that fail at `AutoConfig` regardless; 7 have empty `architectures` but are not timm — a pre-existing "Cannot detect task", identical before and after the PR.) **No overhead for normal (non-timm) models** — `winml config` on a standard non-timm model: this branch vs base, min ~12.6s vs ~12.5s (within run-to-run noise). Non-timm configs have `architectures`, so they skip the new branch; the only added cost is one dict lookup. **Unit tests** — `tests/unit/loader` + `tests/unit/export` + `tests/unit/inspect`: green. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Yi Ren <reny@microsoft.com>

## Fix model-task inconsistency for vision feature-extraction models Fixes #777, #778, #782. ### Principle `winml inspect` is the source of truth for valid `(model_id, task)` pairs. Both `feature-extraction` and `image-feature-extraction` are valid ways to address an image-embedding model like `facebook/dinov2-base`. Downstream commands must accept whichever name `winml inspect` accepts, then use `(model_id, task)` to locate the concrete class to act on. ### Root cause Optimum's `TasksManager.get_exporter_config_constructor` only knows canonical Optimum task names. Several call sites passed the raw user-supplied task straight through, so HF aliases like `image-feature-extraction` were rejected with "Unsupported". The evaluator additionally needs to know which HF pipeline name to dispatch on, which the canonical Optimum task name doesn't carry by itself for bimodal tasks like `feature-extraction`. ### Fix - **Inspect / export / HTP exporter**: normalize via `_map_task_synonym(task)` (in `export/io.py`) before any `TasksManager` lookup because it requires normalized task input. This is a single function reused at each `TasksManager` boundary — no new global table. - **Quantize**: `_resolve_dataset_class(task, io_config)` in `datasets/__init__.py` dispatches to `TextDataset` / `ImageDataset` based on the actual ONNX input names. No `AutoConfig.from_pretrained` round-trip. Bimodal io_configs fall back to `RandomDataset` with a warning. - **Evaluate**: Because HF pipeline and evaluate library have their task name convention, `to_hf_pipeline_task(task, model_id)` in `eval/evaluate.py` translates to the HF pipeline name the underlying `evaluate` library expects. Uses `OnnxConfig.inputs` (no weights loaded) to pick the modality. Bimodal models (e.g. CLIP combined: both `pixel_values` and `input_ids`) keep the task unchanged via a `len(hits) == 1` guard, preserving the explicit user task. ### Validation `facebook/dinov2-base`: | Command | Before | After | |---|---|---| | `winml inspect -m facebook/dinov2-base --task image-feature-extraction` | "Unsupported" | Resolves via `Dinov2OnnxConfig` | | `winml export -m facebook/dinov2-base -t image-feature-extraction` | KeyError on TasksManager | Valid ONNX with `last_hidden_state` | | `winml eval -m facebook/dinov2-base --task feature-extraction` | `RuntimeError: Failed to create feature-extraction dataset` | kNN metrics on mini-imagenet | | `winml quantize <onnx> --task feature-extraction -m facebook/dinov2-small` | Failure by using TextDataset | Routes to `ImageDataset` | `openai/clip-vit-base-patch32` (bimodal, regression check): - `winml eval -m openai/clip-vit-base-patch32 --task feature-extraction` → stays `feature-extraction` (text STS evaluator); not silently rerouted to image. - `winml eval -m openai/clip-vit-base-patch32` (auto-detect) → resolves to `feature-extraction` (text). ### Tests Unit: - `tests/unit/eval/test_eval.py::TestResolveTask` — auto-detect, explicit task, bimodal guard, HF pipeline translation. - test_random_dataset.py — `TASK_DATASET_MAPPING` covers all registered tasks, including bimodal dict-of-dict. E2E (`-m e2e`, dinov2 chosen because it isn't in `MODEL_BUILD_CONFIGS` and so actually exercises the `TasksManager` path): - `tests/e2e/test_inspect_e2e.py::TestInspectDinoV2` — both `image-feature-extraction` and `feature-extraction` resolve. - `tests/e2e/test_export_e2e.py::TestExportDinoV2::test_image_feature_extraction`. - `tests/e2e/test_eval_e2e.py::TestEvalPerTask::test_image_feature_extraction` parameterized over both task names. - `tests/e2e/test_quantize_e2e.py::test_feature_extraction_with_pixel_values_uses_image_dataset`.

Table for stub ``` ┌──────────────┬──────────┬────────────────────────────────┬────────────────────────────────────────────┐ │ Lib │ py.typed │ Reality │ Override status │ ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤ │ torch │ yes │ Has inline types (v2.11) │ Override is a no-op — mypy uses real types │ ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤ │ torchvision │ no │ No types, no community stubs │ Genuinely needed │ ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤ │ onnx │ yes │ Has inline types (v1.18) │ Override is a no-op │ ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤ │ onnxruntime │ no │ Untyped; no community stubs │ Genuinely needed │ ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤ │ transformers │ yes │ Inline types but partial/loose │ Override is a no-op — types ARE used │ ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤ │ datasets │ no │ Untyped │ Genuinely needed │ ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤ │ optimum │ no │ Untyped │ Genuinely needed │ ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤ │ timm │ yes │ Has inline types (v1.0.26) │ Override is a no-op │ ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤ │ onnxscript │ yes │ Has inline types (v0.7) │ Override is a no-op │ ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤ │ snakemd │ no │ Untyped │ Genuinely needed │ ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤ │ openvino │ n/a │ Not installed locally │ n/a │ └──────────────┴──────────┴────────────────────────────────┴────────────────────────────────────────────┘ ``` plotext added to ignore_missing_imports (no community stubs, untyped library) --------- Co-authored-by: Hualiang Xie <hualxie@microsoft.com>

…m_task (#801) ## What PR1 of #800. Relocate `map_task_synonym` -> `loader/task.py::to_optimum_task` to establish a single WinML->Optimum task-collapse boundary. ## Changes - `loader/task.py`: add `to_optimum_task` + `TASK_SYNONYM_EXTENSIONS` (moved from `export/io.py`); exported via `loader/__init__.py`. - `export/io.py`: local implementation removed; `map_task_synonym` kept as a backward-compatible alias (`= to_optimum_task`); internal use repointed. - Optimum-boundary call sites repointed to `to_optimum_task`: `commands/inspect.py`, `export/htp/exporter.py`, `inspect/resolver.py`. - `commands/build.py`: `TASK_SYNONYM_EXTENSIONS` now imported from `loader`. - New `tests/unit/loader/test_task_boundary.py` pins the collapse contract. ## Behavior No behavior change. `map_task_synonym` stays importable from `export.io`; the collapse semantics (`image-feature-extraction` -> `feature-extraction`, WinML extensions preserved) are byte-identical. Existing synonym and #777/#782 regression tests stay green. Sets up PR2 (#800), which adds the modality-aware `detect_task` and relies on this single collapse boundary.

…#793) Fixes #566. ## Problem - Top-level group declared ``-v/--verbose`` (count) and ``-q/--quiet``, but 12 of 13 subcommands redeclared ``--verbose`` as ``is_flag=True``, so ``winml export -vv …`` errored with ``extra argument``. - No subcommand exposed ``-q/--quiet``, so ``winml export --quiet …`` failed with ``no such option``. - Each command wired logging differently; DEBUG/INFO lines interleaved with Rich tables on stdout, breaking ``cmd > out 2> log.txt``. ## Changes - ``utils/cli.py``: ``verbosity_options`` decorator (``-v`` count, ``-q`` flag) + new ``resolve_verbosity(ctx, verbose, quiet)`` helper that merges top-level and subcommand-level values (max of verbose, OR of quiet). Honors the legacy ``ctx.obj[""debug""]`` so tests that bypass ``main()`` still raise the verbosity floor. - ``utils/logging.py``: format ``[%(asctime)s %(levelname)-7s %(name)s] %(message)s`` with ``datefmt=%H:%M:%S``, ``stream=sys.stderr``. Idempotent — re-creates the WinML handler bound to the current ``sys.stderr`` on each call so Click ``CliRunner`` stream redirection keeps working, and leaves non-WinML handlers (notably pytest ``caplog``) intact. - ``cli.py``: top-level group uses ``@verbosity_options`` (replaces inline declarations); ``--debug`` alias preserved. - 12 subcommands (``build``, ``compile``, ``config``, ``eval``, ``export``, ``inspect``, ``optimize``, ``perf``, ``quantize``, ``sys``, plus ``analyze`` cleanup): replace ad-hoc ``--verbose`` (``is_flag=True``) with ``@cli_utils.verbosity_options``, add ``quiet: bool`` param, call ``configure_logging(verbosity=verbose, quiet=quiet)`` after ``resolve_verbosity``. Removes the legacy ``if ctx.obj.get(""debug""): verbose = True`` blocks (folded into the helper). - ``serve/app.py``: pre-existing latent bug — module-level ``logging.getLogger(""winml.modelkit"").setLevel(INFO)`` ran at import, which muted DEBUG capture in unrelated tests that got collected alongside the serve test module. Split into ``_attach_log_handler()`` (idempotent, called from ``_register_routes``) and a paired ``_ensure_log_capture_level`` / ``_restore_log_capture_level`` invoked from the production lifespan. Tests that build the app via ``_register_routes`` + a mock lifespan no longer leak global logger state. ## Behavior Both flag positions work; subcommand value wins when both are passed (max/OR merge): ```text winml -v export -m … -o … # top-level: works winml export -vv -m … -o … # subcommand: now works (was: extra argument) winml --quiet export -m … -o … # top-level: works winml export --quiet -m … -o … # subcommand: now works (was: no such option) winml inspect -vv -m … --format json > out 2> log.txt # clean stdout/stderr split ``` ## Tests - ``tests/cli/`` (23): pass - ``tests/unit/`` (5061 collected): **5058 pass**, 3 fail — all 3 pre-existing on main and unrelated to this change: - ``test_winml_session.py::TestOpenVINODeviceRouting::test_compile_openvino_cpu_device_succeeds`` - ``test_winml_session.py::TestOpenVINODeviceRouting::test_compile_openvino_cpu_provider_not_npu`` (both env: no OpenVINO EP installed) - ``test_config_utils.py::TestMergeConfigNoneHandling::test_none_to_value_transition`` (test isolation, passes alone) --------- Co-authored-by: hualxie <hualxie@microsoft.com>

## Summary - Replace hardcoded 4-EP list in `analyze_from_proto(ep=None)` with dynamic lookup from `EP_SUPPORTED_DEVICES`, filtered by target device - Remove `max_length=4` constraint on `AnalysisOutput.results` to support more than 4 EPs per device - Change uniqueness validator from IHV type to EP type (multiple EPs can share the same IHV, e.g. CUDA and DML both map to MICROSOFT) **Before:** `analyzer.analyze(ep=None)` always analyzed QNN, OpenVINO, VitisAI, NvTensorRTRTX regardless of device — NvTensorRTRTX was analyzed on NPU even though it only supports GPU. **After:** EP list is derived from `EP_SUPPORTED_DEVICES` filtered by the target device, matching the CLI `--ep all` behavior exactly.

Resolves #326. Adds `WinMLDepthEstimationEvaluator` and `DepthMetric` (Absolute Relative error, RMSE, delta-1) following the NYU/KITTI evaluation protocol. HuggingFace `evaluate` doesn't ship a depth-estimation evaluator, so the metric loop is implemented manually. ### Background Depth-estimation models fall into a few groups, and the same input image gives wildly different prediction scales depending on which group the model belongs to. - Metric-depth models (ZoeDepth, DepthPro) predict depth in meters directly. - Relative-depth models (Depth-Anything, Marigold) predict depth up to an unknown scale and shift. - Disparity models (DPT, MiDaS) predict `1 / depth` (inverse depth) up to scale and shift. Comparing predictions against the NYU ground truth therefore requires (1) optionally inverting disparity into depth and (2) aligning the prediction to the ground truth before computing metrics. This is what AbsRel/RMSE/delta-1 benchmarks in the literature do, and what this PR adds as user-selectable options. ### Options Two `columns_mapping` keys, both overridable via `--column`, and both visible in `winml eval --schema --task depth-estimation`. `align` controls how each prediction is rescaled against the ground truth depth map before metrics are computed: - `affine` (default): per-image least-squares fit of `pred_aligned = s * pred + t`, where `s` is a scalar scale and `t` is a scalar shift, solved on the valid pixels (those passing the depth range mask). Suitable for relative-depth and disparity models. - `median`: scale-only alignment, `pred_aligned = (median(gt) / median(pred)) * pred`. No shift. Cheaper but less accurate when the model has a non-zero offset. - `none`: use the prediction as-is. Suitable for metric-depth models that already output meters. `depth_kind` indicates what the model outputs: - `depth` (default): prediction is interpreted as depth. - `disparity`: prediction is interpreted as inverse depth, so it is inverted (`pred := 1 / pred`) before alignment. Needed for DPT/MiDaS-style outputs. The depth range used for the valid-pixel mask is also overridable: `min_depth` (default 1e-3, NYU convention) and `max_depth` (default 10.0 meters, NYU convention). Only pixels with `min_depth <= gt <= max_depth` contribute to the metrics. ### Default dataset and testset Default dataset is `sayakpaul/nyu_depth_v2`. All 11 depth-estimation entries from `models_all.json` are added to `models_with_acc.json`, with per-model overrides only where the defaults don't match the model family: - `Intel/zoedepth-nyu-kitti` and `apple/DepthPro-hf` set `align=none` (metric-depth). - `Intel/dpt-hybrid-midas` and `Intel/dpt-large` set `depth_kind=disparity`. - The remaining 7 entries (Depth-Anything family, Marigold, etc.) rely on the defaults (`align=affine`, `depth_kind=depth`). ### Tests Unit tests cover the new evaluator and the metric, including the affine-fit path and the disparity inversion path. The slow/network integration test runs the full pipeline end-to-end on Depth-Anything V2, ZoeDepth, and DPT.

…t/winml-cli into docs/draft

…xample

- JSON key 'avg' -> 'mean' (matches actual output) - Add missing JSON fields: task, precision, timestamp, std, warmup_mean, batches_per_sec - Fix terminal label 'Precision' -> 'Model Precision' - Add missing 'Task:' line in terminal example - Remove false claim about --module using ONNX hierarchy tags (it uses torchinfo to discover PyTorch submodules, not ONNX metadata) - Remove 'per-operator timings' from intro (op-tracing not ready)

- Add model_info block to JSON example (always emitted) - Soften --monitor 'no effect' to acknowledge small system overhead - Change 'not executing' to 'strong signal to investigate' - Add 'monitor' field to NPU JSON example - Fix 'on-chip memory' -> 'dedicated adapter memory' - Note that JSON always includes device_memory even for CPU (zeroed)

Fix docs for eval, compile and quantize

tezheng and others added 30 commits May 27, 2026 00:08

fix integration test(only unit test _skip_winml_ep_init) (#760)

a30d5a9

only unit test _skip_winml_ep_init

example: add readme and example.py for microsoft/table-transformer-de…

5bdb1fb

…tection (#779) Also update scripts/e2e_eval/run_pytorch_baseline.py to include pytorch model latency --------- Co-authored-by: hualxie <hualxie@microsoft.com>

chore: enable checking types and fix analyze folder (#768)

9ec0345

Co-authored-by: hualxie <hualxie@microsoft.com>

examples: add 12 builtin model recipes (fp16 + w8a8 + w8a16, 36 confi…

80bc7e0

…gs) (#785) Adds curated recipe configs for the 12 builtin models — those that pass fp16 eval on all 9 (EP, device) buckets.

Validate model task in config. (#723)

7152b82

Fix integration tests. (#773)

2d967c1

DingmaomaoBJTU and others added 30 commits June 9, 2026 19:13

docs: restore --hierarchy/--no-hierarchy pairs per merged PR #844

744720d

adjust config-and-build.md order

32644c2

docs: update quickstart and index landing page

c846a7d

Merge branches 'main' and 'docs/draft' of https://github.com/microsof…

6be9dc1

…t/winml-cli into docs/draft

docs: update supported-models to match actual catalog data

9be4f9f

docs: remove size column from supported-models

2f383b2

docs: add UI quickstart and update index landing page

375d465

docs: remove End-to-End Tour page and update references

cb1215c

docs: add UI Quickstart to nav

e40ad9c

docs: remove ConvNeXt primitives page and fix all references

6c8f2c3

docs: rename ConvNeXt tutorial, remove site logo icon

570f9b4

docs: expand What you learned section in BERT sample

bcdd421

docs: add repo access link to index and tutorials pages

621ff23

docs: rename site to Windows ML CLI, hide logo icon

e4ecb6c

docs: expand hierarchy tagging section in load-and-export

5484ab6

docs: add concrete tag examples, mermaid diagram, and real export data

dfad05f

docs: fix inaccuracies in load-and-export tagging section

6d4f4d4

docs: move Load and export before Primitives in nav

191a8e9

docs: remove unimplemented optimizer scoping claim

cbe63fc

docs: enrich perf-and-monitoring with real output, flag table, JSON e…

facf95e

…xample

docs: fix perf output table rendering

94b6ac0

docs: remove per-operator tracing section (not ready)

29c8da6

Merge remote-tracking branch 'origin/main' into docs/draft

49c8cde

docs: add memory measurement details to perf monitoring section

6090cf8

docs(perf): separate live monitoring and memory metrics sections

e692180

docs(perf): fix hw_monitor JSON to match actual output

ddbdf04

docs(perf): add per-device metrics breakdown (CPU/GPU/NPU)

e43d8c3

Fix docs for eval, compile and quantize (#874)

559cd77

Fix docs for eval, compile and quantize

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: comprehensive documentation audit and improvements#828

docs: comprehensive documentation audit and improvements#828
DingmaomaoBJTU wants to merge 145 commits into
mainfrom
docs/draft

DingmaomaoBJTU commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

DingmaomaoBJTU commented Jun 8, 2026

Summary

New Pages

Factual Corrections

Structural Improvements

Preview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants