Tracing instrumentation on akita beta.7: fix task leaks + add blocked-time milestones#279
Conversation
Brings the newer tracing convention: reset-teardown helpers
(EndReqInOnReset/EndTaskOnReset), buffer-admission tracing
(MsgIDAt{Incoming,Outgoing}Buffer), MilestoneKindWork, auto-attached
incoming+outgoing buffer port tracers, and the tracingtest.LeakRecorder
helper. No source changes required; all timing tests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A CU pipeline flush dropped the tracing tasks of the work it discarded, leaking them as started-never-ended (and growing the consumer tracers' in-flight maps). Four leaks, all now closed: - inst tasks held in the sub-unit pipeline stages and the scheduler's internalExecuting slice. endInflightTracingTasks() ends the inst task of every WfRunning wavefront before the flush resets them to WfReady; this covers all sub-units uniformly. Mem-in-flight wavefronts are WfReady (preserved in the shadow buffers, re-sent on restart) and are correctly skipped. - the per-SIMD "pipeline" tasks of the instructions dropped from the SIMD pipeline (SIMDUnit.Flush). These stay on the SIMD-unit domain so the per-SIMD busy-time tracer can still measure them. - the req_out tasks of the shadowed memory accesses: the shadow re-send regenerates the message ID without re-initiating, so the originals are never finalized. populateShadowBuffers ends them. - the sampled-run path opened a "wavefront" task per wavefront but handleWfCompletionEvent ended only the one whose completion event fired, leaking the other N-1 in the work-group. It now ends all. Uses the beta.7 EndTaskOnReset helper (no-op when a task was never started, so over-calling is safe). Tested with tracingtest.LeakRecorder. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Step 1 leak: cpMiddleware.processFlushReq opened a req_in for the driver's FlushReq but nothing ever completed it (the flush runs entirely on the control path). ctrlMiddleware now completes it when the flush-done response is sent to the driver. Step 2 milestones (blocked-time attribution): - DMA: a "data" milestone on the memcopy req_in when all bottom reads/writes have returned (the wait on the memory system), and a "hardware_resource: processing slot" admission milestone on the incoming-buffer task when a concurrency slot frees. - CP: a "hardware_resource: dispatcher" admission milestone on the LaunchKernelReq buffer task (waited for a free dispatcher) and a "network_busy: ToDMA" admission milestone on the memcopy buffer task (waited for the downstream port to drain). Admission waits hang on the incoming-buffer task via the beta.7 MsgIDAtIncomingBuffer registry; processing waits hang on the req_in. Added a milestone-recording test for the DMA data milestone. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Record a 'data' milestone on the forwarding req_in task when the forwarded request's response returns, attributing the time the transaction spent waiting on the round trip to the other side. Added in the shared endTransaction path, so both the inside->outside and outside->inside directions are covered. Milestone-recording test added. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Attribute blocked time on the wavefront's instruction tasks where a task already spans the wait: - hardware_resource 'vmem-inflight' when the in-flight vector-memory budget admits an instruction's transactions. - data 's_waitcnt' when an S_WAITCNT's outstanding accesses drain to the requested counts. - data 's_endpgm' when a wavefront's outstanding accesses all return so S_ENDPGM can retire. The pre-issue structural-hazard (CanAcceptWave) and port-send (!CanSend) stalls are intentionally left uninstrumented: those waits happen before the instruction's task exists, so there is no task to attribute them to. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The audit flagged two 'cross-domain parentage' issues; both are correct by design given mgpusim's shared vis DBTracer, and 'fixing' either would regress. Document why, at each site: - SIMD 'pipeline' tasks live on the SIMD-unit domain so report.go's per-SIMD BusyTimeTracer (filtered to Kind==pipeline) can measure them; moving them to cu.comp would collapse that metric. - the dispatcher's MapWGReq req_out (domain d) is parented to the LaunchKernelReq req_in (domain d.cp); the same DBTracer instance is on both domains, so the link resolves in one trace database. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- amd/timing/TRACING.md documents the task lifecycle, the milestone table, the reset teardown, and the intentional gaps/limitations. - extract SchedulerImpl.markMemDrained and a buildTracedRDMA test helper to keep evalSEndPgm and the RDMA test under the funlen limit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A vector/scalar memory instruction's inst task ended when its last response returned, but nothing marked that return, so the task bar showed a long unexplained gap between the in-flight admission (or issue) and the task end — the time the request was out to memory. Emit a "data" milestone (vmem/smem) at each memory-response-return site (scalar load, vector load, vector store), on the instruction's task, the moment before it ends. The milestone lands at the task end, so the memory round trip is now attributed and the lifetime tiles with no trailing gap. Verified on a traced FIR run: all 136 VMem inst tasks now carry the data milestone and no mem inst task has a trailing gap > 1 cycle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The vector-memory data milestone made the data wait appear to start at the in-flight admission, but the transactions are not sent until several cycles later (coalescing penalty + transaction pipeline). The interval between admission and first send — during which no request is outstanding — was mis-attributed to the data wait, so the data bar ran ahead of the child req_out subtask bars. Emit a "coalesce" milestone (hardware_resource) when the first transaction is sent. The vmem inst task now tiles as vmem-inflight -> coalesce -> data, and the data interval lines up exactly with the transaction subtasks. Verified on a traced FIR run: coalesce lands at the first subtask StartTime and data at the last subtask EndTime. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The coalescing / transaction-issue phase is the vector-memory unit doing work (coalescing the accesses and traversing its transaction pipeline), not blocking on a hardware resource. Reclassify the coalesce milestone from hardware_resource to work. Per the "work ⇒ subtask" coverage rule, a work milestone must be backed by a child subtask that shows what the work was. Open a "pipeline" subtask when the instruction's transactions are admitted and close it when the first one is sent; the work milestone lands at that close. The subtask is torn down on flush so it cannot leak. Verified on a traced FIR run: each vmem inst now tiles as vmem-inflight (hardware_resource) -> coalesce (work, with a pipeline subtask spanning it) -> data, 136 work milestones paired 1:1 with 136 pipeline subtasks, 0 unended. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2810015eac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| // The wavefront's outstanding memory accesses have all returned, so | ||
| // S_ENDPGM can retire: mark the end of that drain wait. | ||
| s.markMemDrained(wf, "s_endpgm") |
There was a problem hiding this comment.
Make s_endpgm milestone idempotent
When this S_ENDPGM is the last wavefront in the work-group and the CU's ACE port is full, sendWGCompletionMessage returns false below and the wavefront stays in internalExecuting, so the next tick re-enters this path and emits another s_endpgm data milestone for the same instruction. In congested runs this records one milestone per retry and corrupts the blocked-time attribution for the instruction; emit the milestone only once, or only on a path that cannot return false and retry.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 5917870. Moved the s_endpgm milestone emission out of the post-drain-check spot and into a single point after the work-group is disposed of (the branches are now an if/else-if), so the ACE-port-full retry returns before it and can't re-emit. As a bonus the milestone now lands at retirement (the inst-task end) with no trailing gap. Added a test that drives the ACE-full retry then the completing tick and asserts exactly one emission.
Addresses Codex review on #279. evalSEndPgm emitted the s_endpgm data milestone right after the outstanding-memory check, but when S_ENDPGM is the last wavefront and the ACE port is full, sendWGCompletionMessage returns false and the wavefront stays in internalExecuting, so the next tick re-entered the path and re-emitted the milestone — one per retry in congested runs. (The vis DBTracer dedups by (Kind,What), but non-deduping consumers, including the milestone-recording tests, would see duplicates.) Restructure the completion branches into an if/else-if and emit the milestone once, after the work-group is disposed of, on a path that can no longer return false and retry. As a bonus the milestone now lands at retirement (the inst-task end) with no trailing gap. New test drives the ACE-full retry then the completing tick and asserts exactly one emission. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5917870d17
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // S_ENDPGM has retired: mark the end of the drain wait exactly once, here, | ||
| // where the path can no longer return false and retry. An ACE-port-full | ||
| // retry returns above before reaching this point, so it does not re-emit. | ||
| s.markMemDrained(wf, "s_endpgm") |
There was a problem hiding this comment.
Emit s_endpgm before closing the inst task
When S_ENDPGM retires while other wavefronts in the work-group are still executing, this common milestone runs after the branch above has already called logInstTask(..., true) and closed the current instruction task; the barrier path has the same problem via passBarrier/setAllWfStateToReady. In those multi-wavefront cases the new s_endpgm data milestone is emitted after its inst task has ended, so the drain wait is dropped or attached outside the task instead of attributing the instruction lifetime. Move the milestone before any path can end the instruction task.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in ee715a6. Decide the retry up front now — compute allCompleted and, on the work-group-completing path, return early if the ACE port is full (the only path that can retry) before emitting — then emit the milestone once, before the if/else that closes the inst task. Keeps the single-emission idempotency from the prior fix. Added a test that drives the executing branch and asserts the s_endpgm milestone is recorded before the inst task's EndTask.
Addresses the follow-up Codex review on #279. The previous fix emitted the s_endpgm milestone after the if/else, but the barrier and executing branches close the instruction's inst task within the branch (via passBarrier -> setAllWfStateToReady and logInstTask), so the milestone landed on an already-ended task. Decide the retry up front: compute allCompleted, and on the work-group-completing path return early if the ACE port is full (the only path that can retry) — before emitting. Then emit the milestone once, before the if/else that ends the task. Keeps the idempotency from the prior fix and restores correct attribution within the inst lifetime. New test drives the executing branch and asserts the milestone is recorded before the inst task's EndTask. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ee715a6c76
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // data wait, which only begins once a request is actually outstanding — | ||
| // the subtask bar and the work milestone span exactly the interval | ||
| // before the req_out subtasks start. | ||
| u.endIssueSubtask(info.Inst) |
There was a problem hiding this comment.
Don't attribute VMem port backpressure to coalesce work
In runs where the vector memory port is back-pressured after a transaction has reached postTransactionPipelineBuffer, sendRequest breaks on the !CanSend() check above and leaves the issue subtask open; when the send finally succeeds this line emits the coalesce work milestone, so all of those port-full cycles are recorded as productive coalescing/issue work instead of a port/network wait. This skews the blocked-time attribution for VMem instructions under memory-network congestion.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in b1d8562. Moved the issue-subtask close (the coalesce work milestone) to when the transaction reaches the head of the send buffer — before the !CanSend() check — so the work interval ends when the coalescing/issue work is actually done. Port-backpressure cycles (waiting to send) are no longer counted as work. It stays idempotent via the map, so a port-full retry on the same transaction doesn't re-emit. Added a test that holds the port full (work still fires once, transaction unsent) then frees it (transaction sent, no second work milestone).
Addresses Codex review on #279. sendRequest closed the issue subtask and emitted the coalesce work milestone at the actual send, but the !CanSend break above means a port-back-pressured transaction sits in postTransactionPipelineBuffer with the subtask still open — so those port-full cycles were recorded as coalescing/issue work instead of a wait, skewing VMem attribution under memory-network congestion. Close the issue subtask (emit the coalesce work milestone) when the transaction reaches the head of the send buffer, before the CanSend check — the issue work is done at that point. It stays idempotent via the map, so a port-full retry on the same transaction does not re-emit; the work interval now ends when the work is done, not when the port frees. New test holds the port full, asserts the work milestone still fires once (transaction unsent), then frees the port and asserts the transaction is sent with no second work milestone. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b1d85621bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // ID — would never be finalized. End them here. EndTaskOnReset is a no-op | ||
| // for an access that was buffered but never sent (no req_out was opened). | ||
| for i := 0; i < len(cu.InFlightInstFetch); i++ { | ||
| tracing.EndTaskOnReset(cu.comp, cu.InFlightInstFetch[i].Req.ID) |
There was a problem hiding this comment.
Re-open req_out tasks when shadow requests are resent
In cache/shootdown flushes that hit an in-flight CU fetch/scalar/vector memory access, this closes the old req_out task, but the shadow resend paths (send*ShadowBufferAccesses) assign a fresh message ID and send the request without calling TraceReqInitiate; the eventual TraceReqFinalize uses that fresh ID, so there is no matching req_out task and the post-flush memory round trip disappears from the trace. Please start a replacement req_out task when the shadow request is actually re-sent, or keep the original ID/task paired with the final response.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 6cad3a8. The shadow resend now re-initiates a replacement req_out task on each path (send{Scalar,Vector,InstFetch}ShadowBufferAccesses) for the freshly-generated message ID, parented to the still-open inst task (the fetch task for fetches). The original task is still ended at flush so there's no leak, and the resend's task is what TraceReqFinalize closes — the post-flush round trip is traced again. Added a test asserting the resend opens a req_out for the new ID; the existing restart test now sets Inst on the shadow mem accesses (always present in production).
Addresses Codex review on #279. populateShadowBuffers ends an in-flight mem access's req_out task at flush, but the shadow resend assigned a fresh message ID and sent without TraceReqInitiate, so the post-flush round trip had no req_out task and TraceReqFinalize matched nothing — the memory round trip disappeared from the trace. Re-initiate a replacement req_out on each resend path (send{Scalar,Vector,InstFetch}ShadowBufferAccesses), parented to the still-open inst task (fetch: the fetch task). The old task is still ended at flush, so there is no leak; the resend gets its own task that TraceReqFinalize closes. Tests: new TestShadowResendOpensReplacementReqOut asserts the resend opens a req_out for the fresh ID; the existing restart test now sets Inst on the shadow mem accesses (always present in production), which the resend parents on. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6cad3a8155
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // End any open issue subtasks so they do not leak as started-never-ended | ||
| // (their parent inst tasks are ended by ComputeUnit.endInflightTracingTasks). | ||
| for _, id := range u.issueTaskIDs { | ||
| tracing.EndTaskOnReset(u.cu.comp, id) |
There was a problem hiding this comment.
Preserve coalesce milestones across vector flush
When a flush happens after executeFlatLoad/executeFlatStore has admitted the transactions but before the first one reaches sendRequest, the wavefront has already been set back to WfReady, so ComputeUnit.endInflightTracingTasks leaves the parent memory-inst task open for the shadow response. This reset-ends and forgets the open coalesce subtask, while the shadow resend path sends the request directly without calling endIssueSubtask, so flushed vector-memory instructions lose their coalesce work milestone and the remaining vmem-inflight→vmem interval is attributed as data wait instead of coalescing/issue work plus the post-flush round trip.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in d207a8b. VectorMemoryUnit.Flush now emits the coalesce work milestone (and ends the subtask) for each still-open issue subtask, so an instruction flushed between admission and its first send keeps its coalesce work attribution instead of folding the vmem-inflight→vmem interval into the data wait. The parent inst task is still open at that point (WfReady), so the milestone lands on it correctly. New test opens an issue subtask, flushes, and asserts the work milestone is recorded and the map cleared.
Addresses Codex review on #279. When a flush hits a vector-memory instruction after executeFlatLoad/Store admitted its transactions but before the first reaches sendRequest, the wavefront is already WfReady so endInflightTracingTasks leaves the parent inst task open for the shadow response — but VectorMemoryUnit.Flush only EndTaskOnReset'd the issue subtask without emitting the coalesce work milestone, and the shadow resend bypasses the unit (never calls endIssueSubtask). The instruction thus lost its coalesce work milestone and the vmem-inflight->vmem interval folded into the data wait. Flush now emits the coalesce work milestone (and ends the subtask) for each still-open issue subtask, so the issue work is attributed even when a flush interrupts it. New test opens an issue subtask, flushes, and asserts the work milestone is recorded and the subtask cleared. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d207a8b5c6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| tracing.AddMilestone(m.comp, tracing.Milestone{ | ||
| TaskID: trans.RecvTaskID, | ||
| Kind: tracing.MilestoneKindData, | ||
| What: "remote", | ||
| }) |
There was a problem hiding this comment.
End remote wait before egress-port backpressure
Because endTransaction is only reached after processIncomingRsp/processFromL2 have already passed their egress-port CanSend() checks, a returned RDMA response sitting behind a full RDMARequestInside or RDMADataOutside port does not emit this remote data milestone until that port frees. In RDMA egress congestion this charges output-port backpressure to the remote/data interval instead of ending the remote round trip when the response arrives; emit the milestone before the egress wait, with a guard so retries do not duplicate it.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in f566ea5. The remote milestone now fires when the response first arrives (right after it's matched to its transaction), before the egress-port CanSend check, so egress backpressure isn't charged to the remote round-trip interval. Guarded by a new transaction.RemoteMilestoneEmitted flag so egress-full retries don't duplicate it; endTransaction no longer emits it. New test saturates the egress port and asserts the milestone fires exactly once, at arrival.
…iew) Addresses Codex review on #279. The "remote" data milestone was emitted in endTransaction, which runs only after processIncomingRsp/processFromL2 pass their egress-port CanSend check — so a response stuck behind a full RDMARequestInside/RDMADataOutside port did not end the remote round-trip interval until the port drained, charging egress backpressure to the remote/data interval. Emit the milestone when the response first arrives (right after matching it to its transaction), before the egress wait, guarded by a new transaction.RemoteMilestoneEmitted flag so egress-full retries do not duplicate it. New test saturates the egress port and asserts the milestone fires exactly once, at arrival. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iew) Addresses Codex review on #279. The "remote" data milestone was emitted in endTransaction, which runs only after processIncomingRsp/processFromL2 pass their egress-port CanSend check — so a response stuck behind a full RDMARequestInside/RDMADataOutside port did not end the remote round-trip interval until the port drained, charging egress backpressure to the remote/data interval. Emit the milestone when the response first arrives (right after matching it to its transaction), before the egress wait, guarded by a new transaction.RemoteMilestoneEmitted flag so egress-full retries do not duplicate it. New test saturates the egress port and asserts the milestone fires exactly once, at arrival. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f566ea5 to
094af95
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 094af95335
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| wf.OutstandingScalarMemAccess-- | ||
| } | ||
|
|
||
| cu.markInstDataReturned(info.Inst, "vmem") |
There was a problem hiding this comment.
Wait for actual last coalesced response
When coalesced memory responses arrive out of order (a case the existing CPI stack tracer already accounts for), CanWaitForCoalesce == false only marks the last request generated, not the last response received. Emitting the new vmem data milestone here can close the data-wait interval before earlier sibling transactions for the same instruction have returned, so those late responses are omitted or misattributed; gate the milestone on there being no remaining in-flight accesses for info.Inst instead.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in f7720b7. The vmem/smem data milestone and the inst-task end are now gated on there being no remaining in-flight access for the instruction (hasInFlight{Vector,Scalar}MemFor), so out-of-order coalesced responses no longer close the data wait early. The outstanding-count decrement stays on the last-generated response (the S_WAITCNT accounting is unchanged). Applied to all three mem-return handlers. New test returns the last-generated transaction first and asserts the data milestone only fires once the sibling also returns.
…ne (review) Addresses Codex review on #279. Coalesced memory responses can return out of order, so CanWaitForCoalesce==false (isLastRead) marks the last request *generated*, not the last *received*. Emitting the vmem/smem data milestone and ending the inst task there could close the data wait while earlier sibling transactions for the same instruction were still in flight, dropping or misattributing those late responses. Keep the outstanding-count decrement on the last-generated response (the S_WAITCNT accounting), but gate the data milestone and logInstTask on there being no remaining in-flight access for the instruction (hasInFlight{Vector,Scalar}MemFor), across all three mem-return handlers. New test returns the last-generated transaction first and asserts the data milestone only fires once its sibling also returns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f7720b7296
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // hasInFlightVectorMemFor reports whether any vector-memory transaction for the | ||
| // given instruction is still in flight. | ||
| func (cu *ComputeUnit) hasInFlightVectorMemFor(inst *wavefront.Inst) bool { | ||
| for _, info := range cu.InFlightVectorMemAccess { |
There was a problem hiding this comment.
Include shadow-buffered siblings in last-response checks
Fresh issue beyond the earlier out-of-order response case: during CU shadow resend after a flush, siblings that have not been re-sent yet live in shadowInFlightVectorMemAccess, not InFlightVectorMemAccess. Because processInput still accepts memory responses while IsSendingOutShadowBufferReqs is true and sendVectorShadowBufferAccesses only sends one shadow entry per tick, a fast response for the first resent transaction can make this helper return false, causing the vmem milestone and parent inst task to close before later coalesced siblings are even re-sent; the scalar helper below has the same shadow-buffer gap.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 4404b52. hasInFlight{Vector,Scalar}MemFor now also scan shadowInFlight{Vector,Scalar}MemAccess, so the data milestone and inst-task end wait for siblings parked in the shadow buffer during a flush resend — not just those in InFlight*. A fast response to the first resent transaction can no longer close the interval before the others are re-sent. New test parks a sibling in shadowInFlightVectorMemAccess and asserts the resent transaction's response doesn't fire the milestone.
…ne (review) Addresses Codex review on #279. Coalesced memory responses can return out of order, so CanWaitForCoalesce==false (isLastRead) marks the last request *generated*, not the last *received*. Emitting the vmem/smem data milestone and ending the inst task there could close the data wait while earlier sibling transactions for the same instruction were still in flight, dropping or misattributing those late responses. Keep the outstanding-count decrement on the last-generated response (the S_WAITCNT accounting), but gate the data milestone and logInstTask on there being no remaining in-flight access for the instruction (hasInFlight{Vector,Scalar}MemFor), across all three mem-return handlers. New test returns the last-generated transaction first and asserts the data milestone only fires once its sibling also returns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f7720b7 to
8995e85
Compare
Addresses Codex review on #279. hasInFlight{Vector,Scalar}MemFor scanned only InFlight{Vector,Scalar}MemAccess, but during a flush resend the coalesced siblings not yet re-sent live in shadowInFlight*. Since input is still processed while IsSendingOutShadowBufferReqs is true and the resend sends one shadow entry per tick, a fast response to the first resent transaction could make the helper return false and close the vmem data milestone and parent inst task before the other siblings are even re-sent. Scan the shadow buffers too, so the milestone and inst-task end wait for every sibling — in-flight or shadow-parked. New test parks a sibling in shadowInFlightVectorMemAccess and asserts the response to the resent transaction does not fire the milestone. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4404b527e3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| gcn3_emu_test: | ||
| name: GCN3 Emulation Test | ||
| runs-on: self-hosted | ||
| runs-on: Github-Large-1 |
There was a problem hiding this comment.
Use a live runner for acceptance jobs
In the acceptance-test section I checked, the GCN3/CDNA3 emulation and timing jobs now target Github-Large-1; the repo notes that this label is offline and previously left CI runs queued indefinitely (roadmap.md lines 199-210), while spec.md line 31 directs non-trivial jobs to the self-hosted Marin runners. On PRs this makes gcn3_emu_test sit queued and prevents the downstream timing/CDNA3 jobs from running, so the workflow no longer provides those validations.
Useful? React with 👍 / 👎.
Brings MGPUSim's own timing components up to the akita v5 task-tracing
convention. The reused
mem/hierarchy was already well-instrumented insideakita, but MGPUSim's compute/control components (CU, CP, DMA, dispatcher, RDMA)
had zero milestones/tags and several tracing-task leaks. This PR fixes
the leaks and adds blocked-time attribution where a task spans the wait.
Step 0 — bump akita to v5.0.0-beta.7
Zero source changes required. beta.7 adds the reset helpers
(
EndReqInOnReset/EndTaskOnReset), the buffer-admission registry(
MsgIDAtIncomingBuffer),MilestoneKindWork, auto-attached incoming+outgoingbuffer port tracing, and
tracingtest.LeakRecorder.Step 1 — stop tracing-task leaks (correctness)
endInflightTracingTasksends the inst task of every running wavefront(covering all sub-units + the scheduler's
internalExecuting),SIMDUnit.Flushends the per-SIMD
pipelinetasks,populateShadowBuffersends theshadow-resend
req_outoriginals, and the sampled-run path ends allwavefront tasks in a work-group (not just the one whose completion event fired).
FlushReqreq_inwas opened but never completed — now completed on theflush-done response.
Step 2 — blocked-time milestones
Admission waits hang on the incoming-buffer task; processing waits on the
req_in / inst task.
processing slot/ToMemdispatcher/ToDMAremote(round trip)vmem-inflight→coalesce→vmem/smems_waitcnt/s_endpgmA vector-memory instruction now tiles cleanly into vmem-inflight
(in-flight slot,
hardware_resource) → coalesce (the coalescing /transaction-issue work,
work, backed by apipelinechild subtask) →data (the memory round trip, aligned with the
req_outtransaction bars).Step 5 — documentation, not change
Two flagged "cross-domain parentage" issues are correct by design given the
shared vis DBTracer (SIMD
pipelinetasks must stay on the SIMD-unit domain forthe per-SIMD busy tracer; the dispatcher's
MapWGReqparent resolves in the onetrace DB). Documented at each site so a future audit doesn't "fix" and regress.
Not in scope (per request): tags, and full control-plane instrumentation beyond
the FlushReq leak. CU pre-issue stalls (
CanAcceptWave,!CanSend) are leftuninstrumented — those waits precede the instruction's task.
amd/timing/TRACING.mddocuments the lifecycle, milestone table, and theintentional gaps.
Verification
go build ./...,go vet ./...clean; golangci-lint 0 issues on touched packages.amd/timing/...+amd/driver/...tests pass, including newLeakRecorderleak tests and milestone-recording tests per component.inst tasks carry the full vmem-inflight→coalesce→data tiling with 0 trailing
gaps and 136
pipelinesubtasks paired 1:1 with the work milestones (0 leaked).🤖 Generated with Claude Code