feat: Empty Block Complementing#343
Conversation
Greptile SummaryThis PR adds Empty Block Complementing for Slurm block topology: when a partition's accelerator domain set is structurally incomplete (e.g., a missing NVLink domain due to a downed node), the complement algorithm pads the block list with empty placeholder entries so Slurm sees the full power-of-two-aligned block structure it expects. It also refactors
Confidence Score: 5/5Safe to merge; the complement pipeline is well-tested across both the Slurm flat-block and per-partition YAML paths, and pre-existing issues identified in earlier rounds are all addressed. The complement pipeline is well-tested and no defects were found in the current code paths. The tree-building logic relies on an implicit uniformity invariant that holds by construction, and the new validation constraint is a deliberate design choice for the algorithm. pkg/translate/block_tree.go (capacity uniformity assumption in buildBlockTree) and pkg/engines/slurm/slurm.go (new power-of-two validation constraint). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[toBlockTopology / getBlockTopologyUnit] --> B[complementBlocks]
B --> C{len blockSizes >= 1 AND domains != nil?}
C -- No --> D[return original blocks unchanged]
C -- Yes --> E[domainsForBlocks: filter to partition-local hosts]
E --> F{domains empty?}
F -- Yes --> D
F -- No --> G[blocksByName index]
G --> H[buildBlockTree]
H --> I[groupSizeFromDomains: compute 2^n group size]
I --> J[packDomainNodes: pad each domain to groupSize base blocks]
J --> K{remaining blockSizes above domCapacity?}
K -- No --> L[flat aggregateBlockNode]
K -- Yes --> M[packAggregateNodes: build top-tier with empty padding]
M --> L
L --> N[collectBaseBlockSlots DFS]
N --> O[baseBlockToBlockInfo for each slot]
O --> P[complemented blockInfo list with empty placeholder slots]
P --> Q{toBlockTopology path?}
Q -- Yes --> R[refresh nodeInfo.blockID for GetNodeTopologySpec]
Q -- No --> S[build parents map for YAML TopologyUnit]
Reviews (28): Last reviewed commit: "adding more checks" | Re-trigger Greptile |
91ad1ef to
c122256
Compare
f048cb8 to
eb4d16a
Compare
dba4934 to
7f30438
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #343 +/- ##
==========================================
+ Coverage 68.46% 70.09% +1.62%
==========================================
Files 82 86 +4
Lines 4842 5333 +491
==========================================
+ Hits 3315 3738 +423
+ Misses 1395 1379 -16
- Partials 132 216 +84 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
6151b49 to
923f9d0
Compare
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
Signed-off-by: Dmitry Shmulevich <dshmulevich@nvidia.com>
|
🌿 Preview your docs: https://nvidia-preview-pull-request-343.docs.buildwithfern.com/topograph |
4f3e6e3 to
18777af
Compare
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
18777af to
f57a332
Compare
Description
Empty Block Complementing for Slurm Block Topology.
Checklist
git commit -s).