Skip to content

fix(server): include nodes and creds in request dedupe#350

Open
fallintoplace wants to merge 1 commit into
NVIDIA:mainfrom
fallintoplace:fix/request-dedupe-key
Open

fix(server): include nodes and creds in request dedupe#350
fallintoplace wants to merge 1 commit into
NVIDIA:mainfrom
fallintoplace:fix/request-dedupe-key

Conversation

@fallintoplace

Copy link
Copy Markdown
Contributor

Description

Fixes request aggregation collisions where requests with the same provider and engine params but different explicit node mappings or credentials could share the same request ID.

The request hash now includes canonicalized node mappings and an HMAC digest of the effective credential map. The server resolves configured credentials before enqueue and lookup, so config-backed requests use the same key for both endpoints. TrailingDelayQueue.Get also returns a snapshot of cached completion state so polling cannot race with completion updates.

Validation:

  • make qualify LINTER_BIN=/Users/hoangvu/go/bin/golangci-lint

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
@fallintoplace fallintoplace requested a review from dmitsh as a code owner June 13, 2026 15:42
@copy-pr-bot

copy-pr-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes request-deduplication collisions in the TrailingDelayQueue where requests sharing the same provider/engine parameters but carrying different explicit node mappings or credentials were incorrectly collapsed to the same queue entry. It also eliminates a polling race in TrailingDelayQueue.Get.

  • Request.Hash() now includes a per-process HMAC digest of the effective credential map and a canonically sorted representation of the node-instance mappings, so two logically distinct requests always produce distinct hashes.
  • readRequest in http_server.go now resolves the effective credentials (payload or config fallback) before hashing, ensuring /v1/generate and /v1/lookup use the same key for the same logical request.
  • TrailingDelayQueue.Get returns a shallow struct copy of *Completion instead of the live pointer, preventing a data race where a caller could observe a partially-updated Status/Message while the timer goroutine holds the mutex and writes.

Confidence Score: 4/5

The change is safe to merge. The dedup logic is now correct and the race fix in Get is sound.

The core fix is correct: credentials and node mappings are now included in the request hash, and the credential digest is deterministic (Go's json.Marshal sorts map keys). The Completion copy in Get properly eliminates the concurrent-read race. Two minor concerns: the sync.Once pattern permanently caches a key-generation error with no recovery path, and the deterministic-marshal contract for credential values is implicit. Neither affects correctness under normal operating conditions but could surface in edge cases.

pkg/topology/request.go — the lazy sync.Once initialization of the HMAC key and the implicit json.Marshal determinism assumption are worth a second look.

Important Files Changed

Filename Overview
pkg/topology/request.go Adds HMAC-based credential digest and canonicalized node sort to Hash(). The credential key is process-ephemeral (random, via sync.Once), which is appropriate for an in-memory queue. Key concern: sync.Once permanently caches a key-generation error, making all credentialed Hash() calls fail for the process lifetime if crypto/rand fails.
pkg/server/http_server.go Moves checkCredentials call into readRequest, ensuring credentials are resolved to their effective value before Hash() is called in both the /v1/generate and /v1/lookup paths. The existing call in processTopologyRequest remains and is idempotent. Change is correct and safe.
pkg/server/trailing_delay_queue.go Get now returns a shallow struct copy of *Completion instead of the original pointer. This correctly eliminates the data race where a caller could read Status/Message outside the mutex while the timer goroutine writes them. Ret (a []byte) is write-once after creation, so the shallow copy is sufficient.
pkg/topology/request_test.go Good coverage: verifies canonical node ordering, credential digest distinction/equivalence, and handles nil/empty node edge cases. Tests rely on the package-level sync.Once HMAC key being shared within a single test process, which is correct.
pkg/server/http_server_test.go Adds TestReadRequestAppliesEffectiveCredentials with two cases (config-backed creds and payload-provided creds). Uses the existing pattern of assigning the package-level srv variable directly, which works for serial test execution.
pkg/server/trailing_delay_queue_test.go TestVaryingPayloadByNodesAndCredentials verifies all three request pairs hash distinctly and all three completions eventually resolve correctly. Test correctly covers the collision scenarios described in the PR.

Sequence Diagram

sequenceDiagram
    participant C as Client
    participant S as HTTP Server
    participant Q as TrailingDelayQueue
    participant W as Timer Worker

    Note over S: readRequest resolves effective creds<br/>(payload creds or config fallback)

    C->>S: POST /v1/generate (with or without creds/nodes)
    S->>S: checkCredentials(payload, config)
    S->>Q: Submit(request) → Hash(name+params+HMAC(creds)+canonicalNodes)
    Q-->>S: uid (hash)
    S-->>C: 202 Accepted + uid

    Note over Q,W: Dedup: requests with same logical identity share a queue entry

    W->>W: processTopologyRequest
    W->>Q: "update Completion{Status:200, Ret:data}"

    C->>S: POST /v1/lookup (same effective request)
    S->>S: checkCredentials(payload, config)
    S->>S: request.Hash() → same uid
    S->>Q: Get(uid) → snapshot copy of Completion
    Q-->>S: Completion copy (no race on Status/Ret)
    S-->>C: 200 OK + topology data
Loading

Reviews (1): Last reviewed commit: "fix(server): include nodes and creds in ..." | Re-trigger Greptile

Comment thread pkg/topology/request.go
Comment on lines +265 to +273
func getCredentialHashKey() ([]byte, error) {
credentialHashKeyOnce.Do(func() {
credentialHashKey, credentialHashKeyErr = newCredentialHashKey()
})
if credentialHashKeyErr != nil {
return nil, credentialHashKeyErr
}
return credentialHashKey, nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Permanent error caching via sync.Once

If newCredentialHashKey() returns an error (e.g., crypto/rand.Read fails — possible on constrained or misconfigured systems), credentialHashKeyErr is cached permanently by sync.Once. Every subsequent call to getCredentialHashKey() returns the same error, causing every Hash() call on a request with non-empty credentials to fail. This makes the server permanently unable to process credentialed topology requests until it is restarted, with no recovery path short of a restart.

A more defensive pattern is to initialize the key eagerly during server startup (e.g., in an init() or constructor) and fail fast, rather than caching a fatal error silently in a lazy initializer.

Comment thread pkg/topology/request.go
Comment on lines +245 to +253
func getCredentialDigest(creds map[string]any) (string, error) {
if len(creds) == 0 {
return "", nil
}

data, err := json.Marshal(creds)
if err != nil {
return "", fmt.Errorf("failed to marshal credentials for hashing: %v", err)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 json.Marshal key ordering is implicit contract for digest determinism

getCredentialDigest relies on json.Marshal sorting map[string]any keys alphabetically to produce a deterministic byte string for the HMAC input. While this behaviour is stable and documented in encoding/json, it is an implicit contract with no code comment. If a credential value is ever a type that does not marshal deterministically (e.g., a map[interface{}]interface{}, a struct with unexported fields, or a type with a non-deterministic MarshalJSON implementation), two logically identical credentials could produce different digests and thus different request hashes, silently breaking deduplication. A brief comment noting the reliance on sorted key marshaling would help future maintainers.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@NVIDIA NVIDIA deleted a comment from copy-pr-bot Bot Jun 15, 2026
@dmitsh

dmitsh commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

/ok-to-test 4361989

@dmitsh

dmitsh commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Hi @fallintoplace ,
Thank you for your contribution.
I understand the rationale for including credentials and the node set in the request hash. However, we didn't identify a use case that requires this, and for the sake of simplicity, we decided not to include them.
Do you have a specific use case where this is actually required?
In the meantime, could you please submit a separate PR with your changes to pkg/server/trailing_delay_queue.go ?
Those changes fix a race condition and would be valuable independently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants