Skip to content

Add --memory flag to winml perf for per-phase memory measurement#861

Open
DingmaomaoBJTU wants to merge 6 commits into
mainfrom
dingmaomaobjtu/perf-memory-measurement
Open

Add --memory flag to winml perf for per-phase memory measurement#861
DingmaomaoBJTU wants to merge 6 commits into
mainfrom
dingmaomaobjtu/perf-memory-measurement

Conversation

@DingmaomaoBJTU

@DingmaomaoBJTU DingmaomaoBJTU commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds process and device memory measurement to winml perf, enabled by default via --memory/--no-memory.

Motivation

Users need to know: "Can this model fit on my device?" This requires two numbers:

  • Process peak memory — how much physical RAM the whole pipeline consumes
  • Device memory — how much NPU/GPU dedicated VRAM is allocated

Output

Memory:      425.7 MB (process peak) | 54.2 MB (device)

One line, two numbers — everything needed to decide if a model can be deployed. Full per-phase breakdown is still available in the JSON output for advanced analysis.

Implementation

  • session/monitor/memory_tracker.py (new): Pure ctypes K32GetProcessMemoryInfo for process memory + single-shot PDH query for device memory. Zero new dependencies.
  • commands/perf.py: Integrates MemoryTracker into PerfBenchmark.run() with snapshots at phase boundaries (baseline → load → compile → inference). Extends BenchmarkResult with memory_profile field.
  • --memory (default: True): Snapshots are taken between phases, not during the iteration loop, so there is zero overhead on latency measurements. Disable with --no-memory.

JSON output

{
  "memory": {
    "baseline": {"working_set_mb": 125.3, ...},
    "post_load": {"working_set_mb": 389.1, ...},
    "post_compile": {"working_set_mb": 405.8, ..., "device_local_mb": 54.2},
    "post_inference": {"working_set_mb": 412.4, ..., "device_local_mb": 54.2},
    "peak_working_set_mb": 425.7,
    "peak_device_local_mb": 54.2,
    "total_delta_working_set_mb": 287.1
  }
}

Testing

  • 15 new unit tests for memory_tracker.py
  • All 24 existing perf CLI tests pass unchanged

Implement per-phase memory tracking that captures Working Set, Private
Bytes, and device (NPU/GPU) memory at each benchmark phase boundary:
baseline, post-load, post-compile, and post-inference.

Key design decisions:
- Default enabled (--no-memory to disable) since snapshots are taken
  between phases and add zero overhead to latency measurements
- Pure ctypes implementation (K32GetProcessMemoryInfo) with no new
  dependencies
- Device memory via single-shot PDH query reusing existing adapter
  resolution logic
- Console output shows a table with per-phase deltas and peak summary
- JSON output includes full memory profile under 'memory' key

New files:
- session/monitor/memory_tracker.py: MemoryTracker, MemorySnapshot,
  MemoryProfile dataclasses and Windows/Linux memory query functions
- tests/unit/session/monitor/test_memory_tracker.py: unit tests
@DingmaomaoBJTU DingmaomaoBJTU requested a review from a team as a code owner June 10, 2026 08:13
Comment thread tests/unit/session/monitor/test_memory_tracker.py Fixed
…play

Shows the actual memory footprint during inference (steady state) rather
than the process lifetime peak which may include transient allocations
from model loading or compilation.
@xieofxie

Copy link
Copy Markdown
Contributor

I think we could just enabled it, it doesn't take much time

@xieofxie

xieofxie commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Also consider adding this to --monitor

The test from PR #855 mocks BenchmarkResult but did not set
memory_profile=None, causing MagicMock comparison failure in
display_console_report.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants