99-mellanox: fix array skew and abort on degraded NIC#271
Open
elordahl wants to merge 1 commit into
Open
Conversation
Three independent sysfs globs (infiniband_verbs, infiniband,
infiniband_mad) built the parallel arrays assuming equal counts and
aligned ordering. When a PCI function exposed a verbs device but no
infiniband/ class entry (BlueField DPU, SF/SR-IOV representor, down
port), ifaces[] ended up shorter than devices[]. The mount loop only
range-checked against ${#devices[@]}, so it dereferenced an unset
ifaces[id] and, under set -euo pipefail, aborted the hook:
/etc/enroot/hooks.d/99-mellanox.sh: line 88: ifaces[id]: unbound variable
[ERROR] /etc/enroot/hooks.d/99-mellanox.sh exited with return code 1
This killed every container launch on affected nodes. Observed breaking
NCCL alltoall_perf_mpi, all_gather_perf_mpi, all_reduce_perf_mpi, and
reduce_scatter_perf_mpi.
Fix: enumerate per PCI function anchored on infiniband_verbs and resolve
the iface and management nodes from the same <bdf> directory, so the
arrays are always index-aligned regardless of which sysfs sub-entries
are present. A missing iface is now an explicit common::err (degraded
NIC detected) rather than an unbound-variable crash, preserving the
job-blocking behavior with a clear message. umad/issm entries are
guarded with [ -n ] since their absence is less critical.
Signed-off-by: Eric Lordahl <elordahl@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three independent sysfs globs (infiniband_verbs, infiniband, infiniband_mad) built the parallel arrays assuming equal counts and aligned ordering. When a PCI function exposed a verbs device but no infiniband/ class entry (BlueField DPU, SF/SR-IOV representor, down port), ifaces[] ended up shorter than devices[]. The mount loop only range-checked against ${#devices[@]}, so it dereferenced an unset ifaces[id] and, under set -euo pipefail, aborted the hook:
/etc/enroot/hooks.d/99-mellanox.sh: line 88: ifaces[id]: unbound variable
[ERROR] /etc/enroot/hooks.d/99-mellanox.sh exited with return code 1
This killed every container launch on affected nodes. Observed breaking NCCL alltoall_perf_mpi, all_gather_perf_mpi, all_reduce_perf_mpi, and reduce_scatter_perf_mpi.
Fix: enumerate per PCI function anchored on infiniband_verbs and resolve the iface and management nodes from the same directory, so the arrays are always index-aligned regardless of which sysfs sub-entries are present. A missing iface is now an explicit common::err (degraded NIC detected) rather than an unbound-variable crash, preserving the job-blocking behavior with a clear message. umad/issm entries are guarded with [ -n ] since their absence is less critical.