Skip to content

fix: CRI CreateContainer TOCTOU orphan + kill -9 paused-box hang (audit #20/#24)#99

Merged
ZhiXiao-Lin merged 1 commit into
mainfrom
fix/box-audit-cri-lifecycle-batch
Jun 15, 2026
Merged

fix: CRI CreateContainer TOCTOU orphan + kill -9 paused-box hang (audit #20/#24)#99
ZhiXiao-Lin merged 1 commit into
mainfrom
fix/box-audit-cri-lifecycle-batch

Conversation

@ZhiXiao-Lin

Copy link
Copy Markdown
Contributor

Two lifecycle fixes from the adversarial audit (double-skeptic verified).

#20 (MEDIUM, leak) — CreateContainer TOCTOU orphan

CreateContainer validated the sandbox Ready once, then did unbounded async work (image resolve + prepare_container_rootfs, which yields) and never re-checked before add_container. A concurrent StopPodSandbox + RemovePodSandbox could tear the sandbox (and its rootfs tree) down in that window; CreateContainer would resume, recreate the rootfs under the now-deleted sandbox tree, and register an orphan container whose sandbox is gone — which nothing reaps. → re-validate the sandbox immediately before add_container; if no longer Ready, clean up the rootfs and return failed_precondition.

#24 (MEDIUM, DoS) — kill -9 of a paused box hangs

kill -9 of a paused box skipped the resume (gated to signal != SIGKILL) and then routed SIGKILL through the guest exec server — but the guest is SIGSTOP-frozen and can never ack, and the read has no timeout, so the kill hangs before reaching the host-SIGKILL fallback. SIGKILL can't be caught, so → route it straight to the host shim (force-killing the VM, which is what -9 wants) and resume a paused box first.

cri 242 + cli kill tests pass; fmt + clippy clean.

#20/#24)

#20 (MEDIUM, leak): CreateContainer validated the sandbox Ready ONCE, then did
unbounded async work (image resolve + prepare_container_rootfs, which yields)
and never re-checked before add_container. A concurrent StopPodSandbox +
RemovePodSandbox could tear the sandbox (and its rootfs tree) down in that
window; CreateContainer would resume, recreate the rootfs under the
now-deleted sandbox tree, and register an orphan container whose sandbox is
gone — which nothing reaps. Re-validate the sandbox immediately before
add_container; if it is no longer Ready, clean up the rootfs and return
failed_precondition.

#24 (MEDIUM, dos): `kill -9` of a PAUSED box skipped the resume (gated to
signal != SIGKILL) and then routed SIGKILL through the guest exec server —
but the guest is SIGSTOP-frozen and can never ack, and the read has no
timeout, so the kill HANGS before reaching the host-SIGKILL fallback. SIGKILL
cannot be caught, so route it straight to the host shim (force-killing the VM,
which is what -9 wants) and resume a paused box first.

cri 242 + cli kill tests pass; fmt + clippy clean.

Found by the adversarial multi-agent audit (double-skeptic verified).
@ZhiXiao-Lin ZhiXiao-Lin merged commit dfc6957 into main Jun 15, 2026
3 checks passed
@ZhiXiao-Lin ZhiXiao-Lin deleted the fix/box-audit-cri-lifecycle-batch branch June 15, 2026 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant