Skip to content

[Ubuntu 24.04][Onboard] onboard cannot create a GPU-enabled sandbox on Docker-driver GPU host #4948

@zNeill

Description

@zNeill

Description

Description

On a Docker-driver GPU host (NVIDIA GPU auto-detected), `nemoclaw onboard` cannot bring up a GPU-enabled sandbox. While creating the sandbox, onboard enables GPU passthrough — this is the standard create-then-GPU-enable path that runs on a normal FIRST onboard whenever a GPU is present on a Docker-driver gateway (gated by NEMOCLAW_DOCKER_GPU_PATCH); The OpenShell supervisor never reconnects to the GPU-enabled container, so the sandbox enters Error phase before the GPU proof can run, the step aborts with exit 1, and onboard fails.

Reproduced on Ubuntu 24.04 (RTX PRO 6000 Blackwell) and Ubuntu 26.04 (RTX A6000) in v0.0.60.

Environment
Device:        GPU CI runners — Ubuntu 24.04 (NVIDIA RTX PRO 6000 Blackwell Server Edition, 97887 MB) and Ubuntu 26.04 (NVIDIA RTX A6000, 46068 MB)

OS: Ubuntu 24.04 / Ubuntu 26.04
Architecture: x86_64
Node.js: v22.22.2
npm: 10.9.7
Docker: docker (Docker-driver gateway; Docker CDI GPU support detected, /etc/cdi/nvidia.yaml)
OpenShell CLI: 0.0.44
NemoClaw: v0.0.60

Steps to Reproduce

1. On a Docker-driver Linux GPU host (Ubuntu 24.04 or 26.04) with an NVIDIA GPU and Docker CDI GPU support, with no existing NemoClaw sandbox.
2. Run a normal first onboard: nemoclaw onboard   (GPU auto-detected; OpenShell GPU passthrough enabled by default).
3. Onboard creates the sandbox and then enables GPU access on the Docker container (the GPU-enable step).
4. Observe the GPU-enable step result and the sandbox phase.
Expected Result
The sandbox is created with GPU access, the OpenShell supervisor reconnects to the GPU-enabled container, the GPU proof runs, the sandbox reaches Ready, and the first onboard completes.
Actual Result
The GPU-enable step fails and onboard aborts (exit 1). Product log:

  Docker-driver GPU patch active; creating sandbox first, then recreating the Docker container with GPU access.
  ...
  patched_create_option=--gpus all
  Docker GPU patch failed.
  OpenShell supervisor did not reconnect to the GPU-enabled container; pre-patch sandbox restored.
  OpenShell sandbox entered Error phase before the GPU proof could run.
  sandbox_phase=Error
  Diagnostics saved: /var/lib/gitlab-runner/.nemoclaw/onboard-failures/-my-assistant-docker-gpu-patch
  Escape hatch: set NEMOCLAW_DOCKER_GPU_PATCH=0 to skip this patch.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Onboard, NemoClaw_Sandbox, NemoClaw-SWQA-RelBlckr-Recommended

[NVB#6281494]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.area: onboardingOnboarding FSM, provider setup, sandbox launch, or first-run flowarea: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: ubuntuAffects Ubuntu Linux environments

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions