fix: CRI CreateContainer TOCTOU orphan + kill -9 paused-box hang (audit #20/#24)#99
Merged
Merged
Conversation
#20/#24) #20 (MEDIUM, leak): CreateContainer validated the sandbox Ready ONCE, then did unbounded async work (image resolve + prepare_container_rootfs, which yields) and never re-checked before add_container. A concurrent StopPodSandbox + RemovePodSandbox could tear the sandbox (and its rootfs tree) down in that window; CreateContainer would resume, recreate the rootfs under the now-deleted sandbox tree, and register an orphan container whose sandbox is gone — which nothing reaps. Re-validate the sandbox immediately before add_container; if it is no longer Ready, clean up the rootfs and return failed_precondition. #24 (MEDIUM, dos): `kill -9` of a PAUSED box skipped the resume (gated to signal != SIGKILL) and then routed SIGKILL through the guest exec server — but the guest is SIGSTOP-frozen and can never ack, and the read has no timeout, so the kill HANGS before reaching the host-SIGKILL fallback. SIGKILL cannot be caught, so route it straight to the host shim (force-killing the VM, which is what -9 wants) and resume a paused box first. cri 242 + cli kill tests pass; fmt + clippy clean. Found by the adversarial multi-agent audit (double-skeptic verified).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two lifecycle fixes from the adversarial audit (double-skeptic verified).
#20 (MEDIUM, leak) — CreateContainer TOCTOU orphan
CreateContainervalidated the sandboxReadyonce, then did unbounded async work (image resolve +prepare_container_rootfs, which yields) and never re-checked beforeadd_container. A concurrentStopPodSandbox+RemovePodSandboxcould tear the sandbox (and its rootfs tree) down in that window;CreateContainerwould resume, recreate the rootfs under the now-deleted sandbox tree, and register an orphan container whose sandbox is gone — which nothing reaps. → re-validate the sandbox immediately beforeadd_container; if no longer Ready, clean up the rootfs and returnfailed_precondition.#24 (MEDIUM, DoS) — kill -9 of a paused box hangs
kill -9of a paused box skipped the resume (gated tosignal != SIGKILL) and then routedSIGKILLthrough the guest exec server — but the guest isSIGSTOP-frozen and can never ack, and the read has no timeout, so the kill hangs before reaching the host-SIGKILL fallback. SIGKILL can't be caught, so → route it straight to the host shim (force-killing the VM, which is what-9wants) and resume a paused box first.cri 242 + cli kill tests pass; fmt + clippy clean.