Add `--memory` flag to `winml perf` for per-phase memory measurement by DingmaomaoBJTU · Pull Request #861 · microsoft/winml-cli

DingmaomaoBJTU · 2026-06-10T08:13:54Z

Summary

Adds process and device memory measurement to winml perf, enabled by default via --memory/--no-memory.

Motivation

Users need to know: "Can this model fit on my device?" This requires two numbers:

Process peak memory — how much physical RAM the whole pipeline consumes
Device memory — how much NPU/GPU dedicated VRAM is allocated

Output

Memory:      425.7 MB (process peak) | 54.2 MB (device)

One line, two numbers — everything needed to decide if a model can be deployed. Full per-phase breakdown is still available in the JSON output for advanced analysis.

Implementation

session/monitor/memory_tracker.py (new): Pure ctypes K32GetProcessMemoryInfo for process memory + single-shot PDH query for device memory. Zero new dependencies.
commands/perf.py: Integrates MemoryTracker into PerfBenchmark.run() with snapshots at phase boundaries (baseline → load → compile → inference). Extends BenchmarkResult with memory_profile field.
--memory (default: True): Snapshots are taken between phases, not during the iteration loop, so there is zero overhead on latency measurements. Disable with --no-memory.

JSON output

{
  "memory": {
    "baseline": {"working_set_mb": 125.3, ...},
    "post_load": {"working_set_mb": 389.1, ...},
    "post_compile": {"working_set_mb": 405.8, ..., "device_local_mb": 54.2},
    "post_inference": {"working_set_mb": 412.4, ..., "device_local_mb": 54.2},
    "peak_working_set_mb": 425.7,
    "peak_device_local_mb": 54.2,
    "total_delta_working_set_mb": 287.1
  }
}

Testing

15 new unit tests for memory_tracker.py
All 24 existing perf CLI tests pass unchanged

Implement per-phase memory tracking that captures Working Set, Private Bytes, and device (NPU/GPU) memory at each benchmark phase boundary: baseline, post-load, post-compile, and post-inference. Key design decisions: - Default enabled (--no-memory to disable) since snapshots are taken between phases and add zero overhead to latency measurements - Pure ctypes implementation (K32GetProcessMemoryInfo) with no new dependencies - Device memory via single-shot PDH query reusing existing adapter resolution logic - Console output shows a table with per-phase deltas and peak summary - JSON output includes full memory profile under 'memory' key New files: - session/monitor/memory_tracker.py: MemoryTracker, MemorySnapshot, MemoryProfile dataclasses and Windows/Linux memory query functions - tests/unit/session/monitor/test_memory_tracker.py: unit tests

…play Shows the actual memory footprint during inference (steady state) rather than the process lifetime peak which may include transient allocations from model loading or compilation.

xieofxie · 2026-06-10T08:41:54Z

I think we could just enabled it, it doesn't take much time

xieofxie · 2026-06-10T08:42:05Z

Also consider adding this to --monitor

…emory-measurement

The test from PR #855 mocks BenchmarkResult but did not set memory_profile=None, causing MagicMock comparison failure in display_console_report.

DingmaomaoBJTU requested a review from a team as a code owner June 10, 2026 08:13

github-advanced-security AI found potential problems Jun 10, 2026

View reviewed changes

Comment thread tests/unit/session/monitor/test_memory_tracker.py Fixed

DingmaomaoBJTU added 3 commits June 10, 2026 16:23

Simplify memory output to single line with peak process + device memory

e50a39c

Remove unnecessary del statement (CodeQL)

a8bef09

Use post-inference working set instead of process peak for memory dis…

1b7d016

…play Shows the actual memory footprint during inference (steady state) rather than the process lifetime peak which may include transient allocations from model loading or compilation.

DingmaomaoBJTU added 2 commits June 10, 2026 18:32

Merge remote-tracking branch 'origin/main' into dingmaomaobjtu/perf-m…

753bec6

…emory-measurement

Fix CI: add missing memory_profile mock in TestPerfFormatJson

125c455

The test from PR #855 mocks BenchmarkResult but did not set memory_profile=None, causing MagicMock comparison failure in display_console_report.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--memory` flag to `winml perf` for per-phase memory measurement#861

Add `--memory` flag to `winml perf` for per-phase memory measurement#861
DingmaomaoBJTU wants to merge 6 commits into
mainfrom
dingmaomaobjtu/perf-memory-measurement

DingmaomaoBJTU commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

xieofxie commented Jun 10, 2026

Uh oh!

xieofxie commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DingmaomaoBJTU commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Output

Implementation

JSON output

Testing

Uh oh!

Uh oh!

xieofxie commented Jun 10, 2026

Uh oh!

xieofxie commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DingmaomaoBJTU commented Jun 10, 2026 •

edited

Loading

xieofxie commented Jun 10, 2026 •

edited

Loading