Skip to content

feat: Empty Block Complementing#343

Open
ravisoundar wants to merge 3 commits into
mainfrom
rs-complement
Open

feat: Empty Block Complementing#343
ravisoundar wants to merge 3 commits into
mainfrom
rs-complement

Conversation

@ravisoundar

Copy link
Copy Markdown
Collaborator

Description

Empty Block Complementing for Slurm Block Topology.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

@ravisoundar ravisoundar requested a review from dmitsh as a code owner June 4, 2026 06:00
@copy-pr-bot

copy-pr-bot Bot commented Jun 4, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds Empty Block Complementing for Slurm block topology: when a partition's accelerator domain set is structurally incomplete (e.g., a missing NVLink domain due to a downed node), the complement algorithm pads the block list with empty placeholder entries so Slurm sees the full power-of-two-aligned block structure it expects. It also refactors DomainMap to carry rich HostInfo structs (adding Domain, InstanceID, and HostName fields) enabling correct per-partition host filtering.

  • New complement pipeline (block_complement.go, block_tree.go): domainsForBlocks scopes the global domain map to partition-local hosts, buildBlockTree packs those domains into a padded tree shaped by BlockSizes, and collectBaseBlockSlots flattens it back into a []*blockInfo list with empty placeholder slots where needed. Both the Slurm flat-block path (toBlockTopology) and the per-partition YAML path (getBlockTopologyUnit) call this pipeline.
  • nodeInfo.blockID refresh (block.go): after complement renumbers blocks, toBlockTopology updates all nodeInfo.blockID entries so GetNodeTopologySpec returns IDs that match the emitted file.
  • validateBlockSizes (slurm.go): enforces that consecutive BlockSizes entries differ by a power-of-two factor, and changes GetTranslateConfig to return *httperr.Error for richer HTTP status propagation upstream.

Confidence Score: 5/5

Safe to merge; the complement pipeline is well-tested across both the Slurm flat-block and per-partition YAML paths, and pre-existing issues identified in earlier rounds are all addressed.

The complement pipeline is well-tested and no defects were found in the current code paths. The tree-building logic relies on an implicit uniformity invariant that holds by construction, and the new validation constraint is a deliberate design choice for the algorithm.

pkg/translate/block_tree.go (capacity uniformity assumption in buildBlockTree) and pkg/engines/slurm/slurm.go (new power-of-two validation constraint).

Important Files Changed

Filename Overview
pkg/translate/block_complement.go New file implementing the core complement logic: domainsForBlocks correctly scopes to partition-local hosts, complementBlocks drives the tree build, and blocksByName provides the byName index. Cross-partition contamination and the missing-domain cases are handled and tested.
pkg/translate/block_tree.go New file implementing the padded block tree. Core algorithm is sound: packDomainNodes guarantees uniform domain capacity, packAggregateNodes builds the top tier, collectBaseBlockSlots flattens via DFS. The uniformity invariant used in buildBlockTree at line 237 is implicit and undocumented.
pkg/translate/block.go Adds complementBlocks call and nodeInfo.blockID refresh loop so GetNodeTopologySpec returns IDs that match the emitted topology after complement renumbers blocks. Logic is correct and tested.
pkg/translate/yaml.go Per-partition path now calls complementBlocks before building the parents map, adds domain name propagation into the initial blockMap, and skips Nodes= on empty blocks.
pkg/translate/topology.go Stores graph.Domains on NetworkTopology for complement, adds nil guard around hostInfo lookup, and threads hostInfo.InstanceID correctly.
pkg/engines/slurm/slurm.go Adds validateBlockSizes and changes GetTranslateConfig to return *httperr.Error. New power-of-two ratio constraint may break users with existing non-power-of-2 BlockSizes configs.
pkg/topology/domain.go Replaces map[string]string value type with *HostInfo struct. AddHostInfo provides structured population; all prior call sites updated.
pkg/engines/slinky/engine.go Adapts to GetTranslateConfig returning *httperr.Error, removing the redundant double-wrapping.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[toBlockTopology / getBlockTopologyUnit] --> B[complementBlocks]
    B --> C{len blockSizes >= 1 AND domains != nil?}
    C -- No --> D[return original blocks unchanged]
    C -- Yes --> E[domainsForBlocks: filter to partition-local hosts]
    E --> F{domains empty?}
    F -- Yes --> D
    F -- No --> G[blocksByName index]
    G --> H[buildBlockTree]
    H --> I[groupSizeFromDomains: compute 2^n group size]
    I --> J[packDomainNodes: pad each domain to groupSize base blocks]
    J --> K{remaining blockSizes above domCapacity?}
    K -- No --> L[flat aggregateBlockNode]
    K -- Yes --> M[packAggregateNodes: build top-tier with empty padding]
    M --> L
    L --> N[collectBaseBlockSlots DFS]
    N --> O[baseBlockToBlockInfo for each slot]
    O --> P[complemented blockInfo list with empty placeholder slots]
    P --> Q{toBlockTopology path?}
    Q -- Yes --> R[refresh nodeInfo.blockID for GetNodeTopologySpec]
    Q -- No --> S[build parents map for YAML TopologyUnit]
Loading

Reviews (28): Last reviewed commit: "adding more checks" | Re-trigger Greptile

Comment thread pkg/translate/yaml.go
Comment thread pkg/translate/block_complement.go Outdated
Comment thread pkg/translate/topology.go Outdated
@ravisoundar ravisoundar force-pushed the rs-complement branch 3 times, most recently from 91ad1ef to c122256 Compare June 6, 2026 20:06
Comment thread pkg/translate/block_complement.go
@ravisoundar ravisoundar force-pushed the rs-complement branch 4 times, most recently from f048cb8 to eb4d16a Compare June 10, 2026 22:01
Comment thread pkg/translate/block.go
@ravisoundar ravisoundar force-pushed the rs-complement branch 4 times, most recently from dba4934 to 7f30438 Compare June 10, 2026 23:44
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.19481% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.09%. Comparing base (1875ab8) to head (b0585c2).
⚠️ Report is 71 commits behind head on main.

Files with missing lines Patch % Lines
pkg/translate/block_tree.go 81.25% 21 Missing and 9 partials ⚠️
pkg/translate/block_complement.go 79.76% 5 Missing and 12 partials ⚠️
pkg/engines/slurm/slurm.go 74.07% 2 Missing and 5 partials ⚠️
pkg/translate/topology.go 50.00% 2 Missing and 1 partial ⚠️
pkg/engines/slinky/engine.go 33.33% 1 Missing and 1 partial ⚠️
pkg/topology/domain.go 81.81% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #343      +/-   ##
==========================================
+ Coverage   68.46%   70.09%   +1.62%     
==========================================
  Files          82       86       +4     
  Lines        4842     5333     +491     
==========================================
+ Hits         3315     3738     +423     
+ Misses       1395     1379      -16     
- Partials      132      216      +84     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dmitsh dmitsh force-pushed the rs-complement branch 6 times, most recently from 6151b49 to 923f9d0 Compare June 12, 2026 19:34
dmitsh
dmitsh previously approved these changes Jun 12, 2026
ravisoundar and others added 2 commits June 12, 2026 12:42
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
Signed-off-by: Dmitry Shmulevich <dshmulevich@nvidia.com>
@github-actions

Copy link
Copy Markdown

dmitsh
dmitsh previously approved these changes Jun 12, 2026
@ravisoundar ravisoundar force-pushed the rs-complement branch 7 times, most recently from 4f3e6e3 to 18777af Compare June 15, 2026 17:13
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants