Skip to content

fix(server): snapshot queue completion results#351

Merged
dmitsh merged 1 commit into
NVIDIA:mainfrom
fallintoplace:fix/trailing-delay-queue-race
Jun 15, 2026
Merged

fix(server): snapshot queue completion results#351
dmitsh merged 1 commit into
NVIDIA:mainfrom
fallintoplace:fix/trailing-delay-queue-race

Conversation

@fallintoplace

Copy link
Copy Markdown
Contributor

Description

Split out from #350 at maintainer request.

TrailingDelayQueue.Get returned the live *Completion stored in the queue cache. The queue lock protected the lookup itself, but callers read the returned pointer after the lock was released while the timer callback could still update the same Completion. That showed up as a race between polling /v1/topology and request completion.

This PR returns a shallow snapshot of the cached completion instead. It also adds a focused regression test that proves an accepted-result snapshot does not mutate after the worker completes.

Validation:

  • go test -race ./pkg/server
  • make qualify LINTER_BIN=/Users/hoangvu/go/bin/golangci-lint

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
@fallintoplace fallintoplace requested a review from dmitsh as a code owner June 15, 2026 17:42
@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a data race in TrailingDelayQueue.Get: the method was returning the live *Completion pointer stored in the cache, so callers could read Status, Message, or Ret after the lock was released while the timer goroutine was simultaneously writing those same fields. The fix performs a struct-value copy under the lock and returns a pointer to the copy.

  • trailing_delay_queue.go: *(res.(*Completion)) copies the struct value while the mutex is still held, so the returned pointer is an independent snapshot — concurrent writes to the original entry no longer affect it.
  • trailing_delay_queue_test.go: TestGetReturnsCompletionSnapshot captures a snapshot while the worker is blocked, unblocks it, waits for the queue entry to transition to 200 OK, then asserts the original snapshot still reads 202 Accepted — directly reproducing the race scenario.

Confidence Score: 5/5

Safe to merge — the change is a minimal, well-targeted struct copy that closes a specific data race without altering any observable contract.

The fix is two lines under an existing lock, leaving all other behavior unchanged. The shallow copy is sufficient because the timer callback replaces entry.Ret with a new assignment rather than mutating the value it points to. The regression test directly reproduces the race scenario and confirms the snapshot is immutable after the worker completes.

No files require special attention.

Important Files Changed

Filename Overview
pkg/server/trailing_delay_queue.go Two-line fix: dereferences the cached *Completion before returning so callers receive a value-copy snapshot rather than the live pointer, eliminating the post-lock data race on Status/Message/Ret fields.
pkg/server/trailing_delay_queue_test.go Adds TestGetReturnsCompletionSnapshot: blocks the worker mid-flight, captures a pending snapshot, unblocks the worker, then asserts the snapshot is unchanged — a focused regression test for the race.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Get
    participant Mutex
    participant Cache
    participant TimerCallback

    Note over Caller,TimerCallback: Before fix — race window
    Caller->>Get: Get(hash)
    Get->>Mutex: Lock
    Get->>Cache: "lookup *Completion"
    Cache-->>Get: live pointer p
    Get->>Mutex: Unlock
    Note over Get,TimerCallback: lock released, p still shared
    TimerCallback->>Mutex: Lock (writes p.Status, p.Ret)
    Caller->>Get: read p.Status ← RACE

    Note over Caller,TimerCallback: After fix — snapshot returned
    Caller->>Get: Get(hash)
    Get->>Mutex: Lock
    Get->>Cache: "lookup *Completion"
    Cache-->>Get: live pointer p
    Get->>Get: "copy = *p (struct copy under lock)"
    Get->>Mutex: Unlock
    Get-->>Caller: "&copy (independent snapshot)"
    TimerCallback->>Mutex: Lock (writes p.Status, p.Ret)
    Note over Caller: copy.Status unaffected ✓
Loading

Reviews (1): Last reviewed commit: "fix(server): snapshot queue completion r..." | Re-trigger Greptile

@dmitsh

dmitsh commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

/ok-to-test 765115e

@dmitsh

dmitsh commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

@fallintoplace , thank you for your contribution!

@dmitsh dmitsh merged commit 9e6c8eb into NVIDIA:main Jun 15, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants