Skip to content

99-mellanox: fix array skew and abort on degraded NIC#271

Open
elordahl wants to merge 1 commit into
NVIDIA:mainfrom
elordahl:fix/mellanox-hook-array-skew
Open

99-mellanox: fix array skew and abort on degraded NIC#271
elordahl wants to merge 1 commit into
NVIDIA:mainfrom
elordahl:fix/mellanox-hook-array-skew

Conversation

@elordahl

@elordahl elordahl commented Jun 8, 2026

Copy link
Copy Markdown

Three independent sysfs globs (infiniband_verbs, infiniband, infiniband_mad) built the parallel arrays assuming equal counts and aligned ordering. When a PCI function exposed a verbs device but no infiniband/ class entry (BlueField DPU, SF/SR-IOV representor, down port), ifaces[] ended up shorter than devices[]. The mount loop only range-checked against ${#devices[@]}, so it dereferenced an unset ifaces[id] and, under set -euo pipefail, aborted the hook:

/etc/enroot/hooks.d/99-mellanox.sh: line 88: ifaces[id]: unbound variable
[ERROR] /etc/enroot/hooks.d/99-mellanox.sh exited with return code 1

This killed every container launch on affected nodes. Observed breaking NCCL alltoall_perf_mpi, all_gather_perf_mpi, all_reduce_perf_mpi, and reduce_scatter_perf_mpi.

Fix: enumerate per PCI function anchored on infiniband_verbs and resolve the iface and management nodes from the same directory, so the arrays are always index-aligned regardless of which sysfs sub-entries are present. A missing iface is now an explicit common::err (degraded NIC detected) rather than an unbound-variable crash, preserving the job-blocking behavior with a clear message. umad/issm entries are guarded with [ -n ] since their absence is less critical.

Three independent sysfs globs (infiniband_verbs, infiniband,
infiniband_mad) built the parallel arrays assuming equal counts and
aligned ordering.  When a PCI function exposed a verbs device but no
infiniband/ class entry (BlueField DPU, SF/SR-IOV representor, down
port), ifaces[] ended up shorter than devices[].  The mount loop only
range-checked against ${#devices[@]}, so it dereferenced an unset
ifaces[id] and, under set -euo pipefail, aborted the hook:

  /etc/enroot/hooks.d/99-mellanox.sh: line 88: ifaces[id]: unbound variable
  [ERROR] /etc/enroot/hooks.d/99-mellanox.sh exited with return code 1

This killed every container launch on affected nodes. Observed breaking
NCCL alltoall_perf_mpi, all_gather_perf_mpi, all_reduce_perf_mpi, and
reduce_scatter_perf_mpi.

Fix: enumerate per PCI function anchored on infiniband_verbs and resolve
the iface and management nodes from the same <bdf> directory, so the
arrays are always index-aligned regardless of which sysfs sub-entries
are present. A missing iface is now an explicit common::err (degraded
NIC detected) rather than an unbound-variable crash, preserving the
job-blocking behavior with a clear message. umad/issm entries are
guarded with [ -n ] since their absence is less critical.

Signed-off-by: Eric Lordahl <elordahl@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant