Skip to content

mem: NUMA-style page-migration acceptance example#433

Merged
syifan merged 2 commits into
mainfrom
page-migration-acceptance-example
Jun 25, 2026
Merged

mem: NUMA-style page-migration acceptance example#433
syifan merged 2 commits into
mainfrom
page-migration-acceptance-example

Conversation

@syifan

@syifan syifan commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What

A multi-CPU page-migration acceptance test at mem/acceptancetests/pagemigration, plus a latent datamover bug fix it depends on.

NUMA model

Several memaccessagents sit on a shared hierarchy whose physical address space is split across two memory devices. The L2 routes each physical address to the owning device's controller, so a remote access just works over the interconnect — migration is a transparent performance optimization, not a correctness requirement (the test passes with -migrate=false). Each page reserves a home slot on both devices at the same in-device offset, so migrating is a PAddr/DeviceID flip with no frame allocator.

Migration controller

A periodic (round-robin) controller relocates a page between devices, driving the existing memcontrolprotocol control ports:

drain ROBs → pause AT/L1$/L2$/L1TLB/L2TLB/MMU → flush write-back L2
  → data-mover copy (src→dst) → repoint page table → invalidate caches+TLBs → resume

The ordering is the correctness argument: drain quiesces in-flight writes, flush makes memory authoritative before the copy, invalidate drops stale mappings/lines after the repoint. The agents' value-checks are the oracle — any non-transparent migration yields a read mismatch.

Validation

  • Passes serial + parallel engines, 2/4/8 agents, various sizes and intervals (up to ~80 migrations/run).
  • Negative control: disabling the flush step makes the oracle panic (Mismatch when read), confirming the test actually catches migration corruption.
  • Registered in mem/acceptance_test.py (baseline-no-migration + migration matrix, incl. 4-agent and parallel).

datamover fix (included)

readFromSrc compared the absolute read address against the relative buffer window, so a move whose SrcAddress was at/beyond BufferSize issued zero reads and hung. Every existing data-mover test used SrcAddress=0, so it was latent. Fixed to compare in transaction-relative space + a regression test for a non-zero source address. (Happy to split this into its own PR if you'd prefer — the example just needs it to land first.)

Notes / follow-ups

  • The controller currently ticks each cycle during the idle countdown and stops by inspecting agent state. A cleaner approach (a "background event" engine property so the engine terminates when only heartbeat events remain, plus future-scheduling instead of per-cycle ticks) is planned as a separate core-engine change; this PR is the working baseline.

go build ./..., golangci-lint, and the data-mover tests pass.

🤖 Generated with Claude Code

syifan and others added 2 commits June 25, 2026 08:38
Add a multi-CPU page-migration acceptance test at
mem/acceptancetests/pagemigration. Several memory-access agents sit on a
shared hierarchy whose physical address space is split across two memory
devices; the L2 routes each address to the owning device, so a remote
access works without migration (migration is a transparent optimization).

A migration controller periodically relocates a page between devices with
the sequence drain ROBs -> pause the rest -> flush the write-back L2 ->
copy via the data mover -> repoint the page table -> invalidate caches and
TLBs -> resume. The agents' value checks are the oracle: any
non-transparent migration produces a read mismatch. Verified across
serial/parallel engines and 2/4/8 agents; disabling the flush step makes
the oracle fail, confirming the test has teeth.

Also fix a latent bug in mem/datamover: readFromSrc compared the absolute
read address against the relative buffer window, so a move whose
SrcAddress was at or beyond BufferSize issued no reads and hung. Every
existing data-mover test used SrcAddress=0, so it never surfaced. Fixed to
compare in transaction-relative space and add a regression test covering a
non-zero source address.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@syifan syifan merged commit 98dd19f into main Jun 25, 2026
4 checks passed
@syifan syifan deleted the page-migration-acceptance-example branch June 25, 2026 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant