Skip to content

Tracing instrumentation on akita beta.7: fix task leaks + add blocked-time milestones#279

Merged
syifan merged 21 commits into
mainfrom
tracing-instrumentation-beta7
Jun 25, 2026
Merged

Tracing instrumentation on akita beta.7: fix task leaks + add blocked-time milestones#279
syifan merged 21 commits into
mainfrom
tracing-instrumentation-beta7

Conversation

@syifan

@syifan syifan commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Brings MGPUSim's own timing components up to the akita v5 task-tracing
convention. The reused mem/ hierarchy was already well-instrumented inside
akita, but MGPUSim's compute/control components (CU, CP, DMA, dispatcher, RDMA)
had zero milestones/tags and several tracing-task leaks. This PR fixes
the leaks and adds blocked-time attribution where a task spans the wait.

Step 0 — bump akita to v5.0.0-beta.7

Zero source changes required. beta.7 adds the reset helpers
(EndReqInOnReset/EndTaskOnReset), the buffer-admission registry
(MsgIDAtIncomingBuffer), MilestoneKindWork, auto-attached incoming+outgoing
buffer port tracing, and tracingtest.LeakRecorder.

Step 1 — stop tracing-task leaks (correctness)

  • CU pipeline flush dropped the tasks of the work it discarded. Now
    endInflightTracingTasks ends the inst task of every running wavefront
    (covering all sub-units + the scheduler's internalExecuting), SIMDUnit.Flush
    ends the per-SIMD pipeline tasks, populateShadowBuffers ends the
    shadow-resend req_out originals, and the sampled-run path ends all
    wavefront tasks in a work-group (not just the one whose completion event fired).
  • CP FlushReq req_in was opened but never completed — now completed on the
    flush-done response.

Step 2 — blocked-time milestones

Admission waits hang on the incoming-buffer task; processing waits on the
req_in / inst task.

Component Milestone Kind
DMA processing slot / ToMem hardware_resource / data
CP dispatcher / ToDMA hardware_resource / network_busy
RDMA remote (round trip) data
CU vmem-inflightcoalescevmem/smem hardware_resource → work → data
CU s_waitcnt / s_endpgm data

A vector-memory instruction now tiles cleanly into vmem-inflight
(in-flight slot, hardware_resource) → coalesce (the coalescing /
transaction-issue work, work, backed by a pipeline child subtask) →
data (the memory round trip, aligned with the req_out transaction bars).

Step 5 — documentation, not change

Two flagged "cross-domain parentage" issues are correct by design given the
shared vis DBTracer (SIMD pipeline tasks must stay on the SIMD-unit domain for
the per-SIMD busy tracer; the dispatcher's MapWGReq parent resolves in the one
trace DB). Documented at each site so a future audit doesn't "fix" and regress.

Not in scope (per request): tags, and full control-plane instrumentation beyond
the FlushReq leak. CU pre-issue stalls (CanAcceptWave, !CanSend) are left
uninstrumented — those waits precede the instruction's task.

amd/timing/TRACING.md documents the lifecycle, milestone table, and the
intentional gaps.

Verification

  • go build ./..., go vet ./... clean; golangci-lint 0 issues on touched packages.
  • All amd/timing/... + amd/driver/... tests pass, including new
    LeakRecorder leak tests and milestone-recording tests per component.
  • End-to-end: a traced FIR run produces every new milestone, all 136 VMem
    inst tasks carry the full vmem-inflight→coalesce→data tiling with 0 trailing
    gaps and 136 pipeline subtasks paired 1:1 with the work milestones (0 leaked).

🤖 Generated with Claude Code

syifan and others added 10 commits June 24, 2026 13:06
Brings the newer tracing convention: reset-teardown helpers
(EndReqInOnReset/EndTaskOnReset), buffer-admission tracing
(MsgIDAt{Incoming,Outgoing}Buffer), MilestoneKindWork, auto-attached
incoming+outgoing buffer port tracers, and the tracingtest.LeakRecorder
helper. No source changes required; all timing tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A CU pipeline flush dropped the tracing tasks of the work it discarded,
leaking them as started-never-ended (and growing the consumer tracers'
in-flight maps). Four leaks, all now closed:

- inst tasks held in the sub-unit pipeline stages and the scheduler's
  internalExecuting slice. endInflightTracingTasks() ends the inst task
  of every WfRunning wavefront before the flush resets them to WfReady;
  this covers all sub-units uniformly. Mem-in-flight wavefronts are
  WfReady (preserved in the shadow buffers, re-sent on restart) and are
  correctly skipped.
- the per-SIMD "pipeline" tasks of the instructions dropped from the
  SIMD pipeline (SIMDUnit.Flush). These stay on the SIMD-unit domain so
  the per-SIMD busy-time tracer can still measure them.
- the req_out tasks of the shadowed memory accesses: the shadow re-send
  regenerates the message ID without re-initiating, so the originals are
  never finalized. populateShadowBuffers ends them.
- the sampled-run path opened a "wavefront" task per wavefront but
  handleWfCompletionEvent ended only the one whose completion event
  fired, leaking the other N-1 in the work-group. It now ends all.

Uses the beta.7 EndTaskOnReset helper (no-op when a task was never
started, so over-calling is safe). Tested with tracingtest.LeakRecorder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Step 1 leak: cpMiddleware.processFlushReq opened a req_in for the
driver's FlushReq but nothing ever completed it (the flush runs entirely
on the control path). ctrlMiddleware now completes it when the flush-done
response is sent to the driver.

Step 2 milestones (blocked-time attribution):
- DMA: a "data" milestone on the memcopy req_in when all bottom
  reads/writes have returned (the wait on the memory system), and a
  "hardware_resource: processing slot" admission milestone on the
  incoming-buffer task when a concurrency slot frees.
- CP: a "hardware_resource: dispatcher" admission milestone on the
  LaunchKernelReq buffer task (waited for a free dispatcher) and a
  "network_busy: ToDMA" admission milestone on the memcopy buffer task
  (waited for the downstream port to drain).

Admission waits hang on the incoming-buffer task via the beta.7
MsgIDAtIncomingBuffer registry; processing waits hang on the req_in.
Added a milestone-recording test for the DMA data milestone.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Record a 'data' milestone on the forwarding req_in task when the
forwarded request's response returns, attributing the time the
transaction spent waiting on the round trip to the other side. Added in
the shared endTransaction path, so both the inside->outside and
outside->inside directions are covered. Milestone-recording test added.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Attribute blocked time on the wavefront's instruction tasks where a task
already spans the wait:
- hardware_resource 'vmem-inflight' when the in-flight vector-memory
  budget admits an instruction's transactions.
- data 's_waitcnt' when an S_WAITCNT's outstanding accesses drain to the
  requested counts.
- data 's_endpgm' when a wavefront's outstanding accesses all return so
  S_ENDPGM can retire.

The pre-issue structural-hazard (CanAcceptWave) and port-send (!CanSend)
stalls are intentionally left uninstrumented: those waits happen before
the instruction's task exists, so there is no task to attribute them to.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The audit flagged two 'cross-domain parentage' issues; both are correct
by design given mgpusim's shared vis DBTracer, and 'fixing' either would
regress. Document why, at each site:

- SIMD 'pipeline' tasks live on the SIMD-unit domain so report.go's
  per-SIMD BusyTimeTracer (filtered to Kind==pipeline) can measure them;
  moving them to cu.comp would collapse that metric.
- the dispatcher's MapWGReq req_out (domain d) is parented to the
  LaunchKernelReq req_in (domain d.cp); the same DBTracer instance is on
  both domains, so the link resolves in one trace database.

No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- amd/timing/TRACING.md documents the task lifecycle, the milestone
  table, the reset teardown, and the intentional gaps/limitations.
- extract SchedulerImpl.markMemDrained and a buildTracedRDMA test helper
  to keep evalSEndPgm and the RDMA test under the funlen limit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A vector/scalar memory instruction's inst task ended when its last
response returned, but nothing marked that return, so the task bar showed
a long unexplained gap between the in-flight admission (or issue) and the
task end — the time the request was out to memory.

Emit a "data" milestone (vmem/smem) at each memory-response-return site
(scalar load, vector load, vector store), on the instruction's task, the
moment before it ends. The milestone lands at the task end, so the memory
round trip is now attributed and the lifetime tiles with no trailing gap.

Verified on a traced FIR run: all 136 VMem inst tasks now carry the data
milestone and no mem inst task has a trailing gap > 1 cycle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The vector-memory data milestone made the data wait appear to start at
the in-flight admission, but the transactions are not sent until several
cycles later (coalescing penalty + transaction pipeline). The interval
between admission and first send — during which no request is outstanding
— was mis-attributed to the data wait, so the data bar ran ahead of the
child req_out subtask bars.

Emit a "coalesce" milestone (hardware_resource) when the first transaction
is sent. The vmem inst task now tiles as vmem-inflight -> coalesce ->
data, and the data interval lines up exactly with the transaction
subtasks. Verified on a traced FIR run: coalesce lands at the first
subtask StartTime and data at the last subtask EndTime.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The coalescing / transaction-issue phase is the vector-memory unit doing
work (coalescing the accesses and traversing its transaction pipeline),
not blocking on a hardware resource. Reclassify the coalesce milestone
from hardware_resource to work.

Per the "work ⇒ subtask" coverage rule, a work milestone must be backed
by a child subtask that shows what the work was. Open a "pipeline"
subtask when the instruction's transactions are admitted and close it
when the first one is sent; the work milestone lands at that close. The
subtask is torn down on flush so it cannot leak.

Verified on a traced FIR run: each vmem inst now tiles as
vmem-inflight (hardware_resource) -> coalesce (work, with a pipeline
subtask spanning it) -> data, 136 work milestones paired 1:1 with 136
pipeline subtasks, 0 unended.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2810015eac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread amd/timing/cu/scheduler.go Outdated

// The wavefront's outstanding memory accesses have all returned, so
// S_ENDPGM can retire: mark the end of that drain wait.
s.markMemDrained(wf, "s_endpgm")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make s_endpgm milestone idempotent

When this S_ENDPGM is the last wavefront in the work-group and the CU's ACE port is full, sendWGCompletionMessage returns false below and the wavefront stays in internalExecuting, so the next tick re-enters this path and emits another s_endpgm data milestone for the same instruction. In congested runs this records one milestone per retry and corrupts the blocked-time attribution for the instruction; emit the milestone only once, or only on a path that cannot return false and retry.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 5917870. Moved the s_endpgm milestone emission out of the post-drain-check spot and into a single point after the work-group is disposed of (the branches are now an if/else-if), so the ACE-port-full retry returns before it and can't re-emit. As a bonus the milestone now lands at retirement (the inst-task end) with no trailing gap. Added a test that drives the ACE-full retry then the completing tick and asserts exactly one emission.

Addresses Codex review on #279. evalSEndPgm emitted the s_endpgm data
milestone right after the outstanding-memory check, but when S_ENDPGM is
the last wavefront and the ACE port is full, sendWGCompletionMessage
returns false and the wavefront stays in internalExecuting, so the next
tick re-entered the path and re-emitted the milestone — one per retry in
congested runs. (The vis DBTracer dedups by (Kind,What), but non-deduping
consumers, including the milestone-recording tests, would see duplicates.)

Restructure the completion branches into an if/else-if and emit the
milestone once, after the work-group is disposed of, on a path that can
no longer return false and retry. As a bonus the milestone now lands at
retirement (the inst-task end) with no trailing gap. New test drives the
ACE-full retry then the completing tick and asserts exactly one emission.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5917870d17

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread amd/timing/cu/scheduler.go Outdated
// S_ENDPGM has retired: mark the end of the drain wait exactly once, here,
// where the path can no longer return false and retry. An ACE-port-full
// retry returns above before reaching this point, so it does not re-emit.
s.markMemDrained(wf, "s_endpgm")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit s_endpgm before closing the inst task

When S_ENDPGM retires while other wavefronts in the work-group are still executing, this common milestone runs after the branch above has already called logInstTask(..., true) and closed the current instruction task; the barrier path has the same problem via passBarrier/setAllWfStateToReady. In those multi-wavefront cases the new s_endpgm data milestone is emitted after its inst task has ended, so the drain wait is dropped or attached outside the task instead of attributing the instruction lifetime. Move the milestone before any path can end the instruction task.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ee715a6. Decide the retry up front now — compute allCompleted and, on the work-group-completing path, return early if the ACE port is full (the only path that can retry) before emitting — then emit the milestone once, before the if/else that closes the inst task. Keeps the single-emission idempotency from the prior fix. Added a test that drives the executing branch and asserts the s_endpgm milestone is recorded before the inst task's EndTask.

Addresses the follow-up Codex review on #279. The previous fix emitted
the s_endpgm milestone after the if/else, but the barrier and executing
branches close the instruction's inst task within the branch (via
passBarrier -> setAllWfStateToReady and logInstTask), so the milestone
landed on an already-ended task.

Decide the retry up front: compute allCompleted, and on the
work-group-completing path return early if the ACE port is full (the only
path that can retry) — before emitting. Then emit the milestone once,
before the if/else that ends the task. Keeps the idempotency from the
prior fix and restores correct attribution within the inst lifetime.

New test drives the executing branch and asserts the milestone is
recorded before the inst task's EndTask.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ee715a6c76

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread amd/timing/cu/vectormemoryunit.go Outdated
// data wait, which only begins once a request is actually outstanding —
// the subtask bar and the work milestone span exactly the interval
// before the req_out subtasks start.
u.endIssueSubtask(info.Inst)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't attribute VMem port backpressure to coalesce work

In runs where the vector memory port is back-pressured after a transaction has reached postTransactionPipelineBuffer, sendRequest breaks on the !CanSend() check above and leaves the issue subtask open; when the send finally succeeds this line emits the coalesce work milestone, so all of those port-full cycles are recorded as productive coalescing/issue work instead of a port/network wait. This skews the blocked-time attribution for VMem instructions under memory-network congestion.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b1d8562. Moved the issue-subtask close (the coalesce work milestone) to when the transaction reaches the head of the send buffer — before the !CanSend() check — so the work interval ends when the coalescing/issue work is actually done. Port-backpressure cycles (waiting to send) are no longer counted as work. It stays idempotent via the map, so a port-full retry on the same transaction doesn't re-emit. Added a test that holds the port full (work still fires once, transaction unsent) then frees it (transaction sent, no second work milestone).

Addresses Codex review on #279. sendRequest closed the issue subtask and
emitted the coalesce work milestone at the actual send, but the !CanSend
break above means a port-back-pressured transaction sits in
postTransactionPipelineBuffer with the subtask still open — so those
port-full cycles were recorded as coalescing/issue work instead of a
wait, skewing VMem attribution under memory-network congestion.

Close the issue subtask (emit the coalesce work milestone) when the
transaction reaches the head of the send buffer, before the CanSend
check — the issue work is done at that point. It stays idempotent via the
map, so a port-full retry on the same transaction does not re-emit; the
work interval now ends when the work is done, not when the port frees.

New test holds the port full, asserts the work milestone still fires once
(transaction unsent), then frees the port and asserts the transaction is
sent with no second work milestone.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1d85621bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// ID — would never be finalized. End them here. EndTaskOnReset is a no-op
// for an access that was buffered but never sent (no req_out was opened).
for i := 0; i < len(cu.InFlightInstFetch); i++ {
tracing.EndTaskOnReset(cu.comp, cu.InFlightInstFetch[i].Req.ID)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-open req_out tasks when shadow requests are resent

In cache/shootdown flushes that hit an in-flight CU fetch/scalar/vector memory access, this closes the old req_out task, but the shadow resend paths (send*ShadowBufferAccesses) assign a fresh message ID and send the request without calling TraceReqInitiate; the eventual TraceReqFinalize uses that fresh ID, so there is no matching req_out task and the post-flush memory round trip disappears from the trace. Please start a replacement req_out task when the shadow request is actually re-sent, or keep the original ID/task paired with the final response.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6cad3a8. The shadow resend now re-initiates a replacement req_out task on each path (send{Scalar,Vector,InstFetch}ShadowBufferAccesses) for the freshly-generated message ID, parented to the still-open inst task (the fetch task for fetches). The original task is still ended at flush so there's no leak, and the resend's task is what TraceReqFinalize closes — the post-flush round trip is traced again. Added a test asserting the resend opens a req_out for the new ID; the existing restart test now sets Inst on the shadow mem accesses (always present in production).

syifan and others added 2 commits June 25, 2026 14:58
Mirror of #281 so this PR's CI runs on free/large runners while the
self-hosted fleet is busy with calibration. Drop on rebase once #281 lands.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses Codex review on #279. populateShadowBuffers ends an in-flight
mem access's req_out task at flush, but the shadow resend assigned a
fresh message ID and sent without TraceReqInitiate, so the post-flush
round trip had no req_out task and TraceReqFinalize matched nothing —
the memory round trip disappeared from the trace.

Re-initiate a replacement req_out on each resend path
(send{Scalar,Vector,InstFetch}ShadowBufferAccesses), parented to the
still-open inst task (fetch: the fetch task). The old task is still ended
at flush, so there is no leak; the resend gets its own task that
TraceReqFinalize closes.

Tests: new TestShadowResendOpensReplacementReqOut asserts the resend
opens a req_out for the fresh ID; the existing restart test now sets Inst
on the shadow mem accesses (always present in production), which the
resend parents on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6cad3a8155

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// End any open issue subtasks so they do not leak as started-never-ended
// (their parent inst tasks are ended by ComputeUnit.endInflightTracingTasks).
for _, id := range u.issueTaskIDs {
tracing.EndTaskOnReset(u.cu.comp, id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve coalesce milestones across vector flush

When a flush happens after executeFlatLoad/executeFlatStore has admitted the transactions but before the first one reaches sendRequest, the wavefront has already been set back to WfReady, so ComputeUnit.endInflightTracingTasks leaves the parent memory-inst task open for the shadow response. This reset-ends and forgets the open coalesce subtask, while the shadow resend path sends the request directly without calling endIssueSubtask, so flushed vector-memory instructions lose their coalesce work milestone and the remaining vmem-inflightvmem interval is attributed as data wait instead of coalescing/issue work plus the post-flush round trip.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d207a8b. VectorMemoryUnit.Flush now emits the coalesce work milestone (and ends the subtask) for each still-open issue subtask, so an instruction flushed between admission and its first send keeps its coalesce work attribution instead of folding the vmem-inflight→vmem interval into the data wait. The parent inst task is still open at that point (WfReady), so the milestone lands on it correctly. New test opens an issue subtask, flushes, and asserts the work milestone is recorded and the map cleared.

Addresses Codex review on #279. When a flush hits a vector-memory
instruction after executeFlatLoad/Store admitted its transactions but
before the first reaches sendRequest, the wavefront is already WfReady so
endInflightTracingTasks leaves the parent inst task open for the shadow
response — but VectorMemoryUnit.Flush only EndTaskOnReset'd the issue
subtask without emitting the coalesce work milestone, and the shadow
resend bypasses the unit (never calls endIssueSubtask). The instruction
thus lost its coalesce work milestone and the vmem-inflight->vmem
interval folded into the data wait.

Flush now emits the coalesce work milestone (and ends the subtask) for
each still-open issue subtask, so the issue work is attributed even when
a flush interrupts it. New test opens an issue subtask, flushes, and
asserts the work milestone is recorded and the subtask cleared.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d207a8b5c6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +212 to +216
tracing.AddMilestone(m.comp, tracing.Milestone{
TaskID: trans.RecvTaskID,
Kind: tracing.MilestoneKindData,
What: "remote",
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge End remote wait before egress-port backpressure

Because endTransaction is only reached after processIncomingRsp/processFromL2 have already passed their egress-port CanSend() checks, a returned RDMA response sitting behind a full RDMARequestInside or RDMADataOutside port does not emit this remote data milestone until that port frees. In RDMA egress congestion this charges output-port backpressure to the remote/data interval instead of ending the remote round trip when the response arrives; emit the milestone before the egress wait, with a guard so retries do not duplicate it.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f566ea5. The remote milestone now fires when the response first arrives (right after it's matched to its transaction), before the egress-port CanSend check, so egress backpressure isn't charged to the remote round-trip interval. Guarded by a new transaction.RemoteMilestoneEmitted flag so egress-full retries don't duplicate it; endTransaction no longer emits it. New test saturates the egress port and asserts the milestone fires exactly once, at arrival.

syifan added a commit that referenced this pull request Jun 25, 2026
…iew)

Addresses Codex review on #279. The "remote" data milestone was emitted
in endTransaction, which runs only after processIncomingRsp/processFromL2
pass their egress-port CanSend check — so a response stuck behind a full
RDMARequestInside/RDMADataOutside port did not end the remote round-trip
interval until the port drained, charging egress backpressure to the
remote/data interval.

Emit the milestone when the response first arrives (right after matching
it to its transaction), before the egress wait, guarded by a new
transaction.RemoteMilestoneEmitted flag so egress-full retries do not
duplicate it. New test saturates the egress port and asserts the
milestone fires exactly once, at arrival.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iew)

Addresses Codex review on #279. The "remote" data milestone was emitted
in endTransaction, which runs only after processIncomingRsp/processFromL2
pass their egress-port CanSend check — so a response stuck behind a full
RDMARequestInside/RDMADataOutside port did not end the remote round-trip
interval until the port drained, charging egress backpressure to the
remote/data interval.

Emit the milestone when the response first arrives (right after matching
it to its transaction), before the egress wait, guarded by a new
transaction.RemoteMilestoneEmitted flag so egress-full retries do not
duplicate it. New test saturates the egress port and asserts the
milestone fires exactly once, at arrival.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@syifan syifan force-pushed the tracing-instrumentation-beta7 branch from f566ea5 to 094af95 Compare June 25, 2026 20:49

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 094af95335

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

wf.OutstandingScalarMemAccess--
}

cu.markInstDataReturned(info.Inst, "vmem")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wait for actual last coalesced response

When coalesced memory responses arrive out of order (a case the existing CPI stack tracer already accounts for), CanWaitForCoalesce == false only marks the last request generated, not the last response received. Emitting the new vmem data milestone here can close the data-wait interval before earlier sibling transactions for the same instruction have returned, so those late responses are omitted or misattributed; gate the milestone on there being no remaining in-flight accesses for info.Inst instead.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f7720b7. The vmem/smem data milestone and the inst-task end are now gated on there being no remaining in-flight access for the instruction (hasInFlight{Vector,Scalar}MemFor), so out-of-order coalesced responses no longer close the data wait early. The outstanding-count decrement stays on the last-generated response (the S_WAITCNT accounting is unchanged). Applied to all three mem-return handlers. New test returns the last-generated transaction first and asserts the data milestone only fires once the sibling also returns.

syifan added a commit that referenced this pull request Jun 25, 2026
…ne (review)

Addresses Codex review on #279. Coalesced memory responses can return
out of order, so CanWaitForCoalesce==false (isLastRead) marks the last
request *generated*, not the last *received*. Emitting the vmem/smem data
milestone and ending the inst task there could close the data wait while
earlier sibling transactions for the same instruction were still in
flight, dropping or misattributing those late responses.

Keep the outstanding-count decrement on the last-generated response (the
S_WAITCNT accounting), but gate the data milestone and logInstTask on
there being no remaining in-flight access for the instruction
(hasInFlight{Vector,Scalar}MemFor), across all three mem-return handlers.

New test returns the last-generated transaction first and asserts the
data milestone only fires once its sibling also returns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7720b7296

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// hasInFlightVectorMemFor reports whether any vector-memory transaction for the
// given instruction is still in flight.
func (cu *ComputeUnit) hasInFlightVectorMemFor(inst *wavefront.Inst) bool {
for _, info := range cu.InFlightVectorMemAccess {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include shadow-buffered siblings in last-response checks

Fresh issue beyond the earlier out-of-order response case: during CU shadow resend after a flush, siblings that have not been re-sent yet live in shadowInFlightVectorMemAccess, not InFlightVectorMemAccess. Because processInput still accepts memory responses while IsSendingOutShadowBufferReqs is true and sendVectorShadowBufferAccesses only sends one shadow entry per tick, a fast response for the first resent transaction can make this helper return false, causing the vmem milestone and parent inst task to close before later coalesced siblings are even re-sent; the scalar helper below has the same shadow-buffer gap.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4404b52. hasInFlight{Vector,Scalar}MemFor now also scan shadowInFlight{Vector,Scalar}MemAccess, so the data milestone and inst-task end wait for siblings parked in the shadow buffer during a flush resend — not just those in InFlight*. A fast response to the first resent transaction can no longer close the interval before the others are re-sent. New test parks a sibling in shadowInFlightVectorMemAccess and asserts the resent transaction's response doesn't fire the milestone.

…ne (review)

Addresses Codex review on #279. Coalesced memory responses can return
out of order, so CanWaitForCoalesce==false (isLastRead) marks the last
request *generated*, not the last *received*. Emitting the vmem/smem data
milestone and ending the inst task there could close the data wait while
earlier sibling transactions for the same instruction were still in
flight, dropping or misattributing those late responses.

Keep the outstanding-count decrement on the last-generated response (the
S_WAITCNT accounting), but gate the data milestone and logInstTask on
there being no remaining in-flight access for the instruction
(hasInFlight{Vector,Scalar}MemFor), across all three mem-return handlers.

New test returns the last-generated transaction first and asserts the
data milestone only fires once its sibling also returns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@syifan syifan force-pushed the tracing-instrumentation-beta7 branch from f7720b7 to 8995e85 Compare June 25, 2026 21:10
Addresses Codex review on #279. hasInFlight{Vector,Scalar}MemFor scanned
only InFlight{Vector,Scalar}MemAccess, but during a flush resend the
coalesced siblings not yet re-sent live in shadowInFlight*. Since input
is still processed while IsSendingOutShadowBufferReqs is true and the
resend sends one shadow entry per tick, a fast response to the first
resent transaction could make the helper return false and close the vmem
data milestone and parent inst task before the other siblings are even
re-sent.

Scan the shadow buffers too, so the milestone and inst-task end wait for
every sibling — in-flight or shadow-parked. New test parks a sibling in
shadowInFlightVectorMemAccess and asserts the response to the resent
transaction does not fire the milestone.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4404b527e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

gcn3_emu_test:
name: GCN3 Emulation Test
runs-on: self-hosted
runs-on: Github-Large-1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use a live runner for acceptance jobs

In the acceptance-test section I checked, the GCN3/CDNA3 emulation and timing jobs now target Github-Large-1; the repo notes that this label is offline and previously left CI runs queued indefinitely (roadmap.md lines 199-210), while spec.md line 31 directs non-trivial jobs to the self-hosted Marin runners. On PRs this makes gcn3_emu_test sit queued and prevents the downstream timing/CDNA3 jobs from running, so the workflow no longer provides those validations.

Useful? React with 👍 / 👎.

@syifan syifan merged commit c244972 into main Jun 25, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant