Skip to content

Bug: killProcess deadlocks for 2 seconds waiting for zombie VMM process to be reaped #790

Description

@Nachiket-Roy

Description

When a VMM process (e.g. Firecracker, Qemu, HVT) terminates, it enters a zombie (defunct) state until its parent process (the containerd shim) reaps it by calling waitpid.

During forced cleanup or stop paths (like urunc delete --force), urunc executes killProcess(pid) to terminate the process:

  1. killProcess sends SIGKILL to the VMM PID.
  2. The VMM process terminates and becomes a zombie.
  3. killProcess enters a loop polling unix.Kill(pid, 0) to check if the process is dead, waiting for it to return ESRCH.
  4. However, unix.Kill(pid, 0) returns nil (success) for zombie processes since they still exist in the process table.
  5. The parent shim is blocked synchronously waiting for urunc to exit, meaning it cannot process the SIGCHLD and reap the zombie VMM child.

This creates a deadlock: the zombie cannot be reaped until urunc exits, and urunc cannot exit because it is waiting for the zombie to disappear. killProcess eventually times out after 2 seconds and returns an error, causing the command to fail.

Steps to Reproduce

  1. Run a urunc container.
  2. Force delete the container:
urunc delete --force <container-id>
  1. Observe that the command blocks for 2 seconds and fails with timeout waiting for pid to die.

Expected Behavior

If the VMM process has already terminated (even if it is a zombie), killProcess and isRunning() should recognize it as dead/stopped immediately, rather than timing out or preventing deletion.

Suggested Fix

Read /proc//stat to check the process state. If the state is zombie (Z) or dead (X/x), treat it as terminated immediately:

func isZombieOrDead(pid int) (bool, error) {
	if err := unix.Kill(pid, 0); err != nil {
		if errors.Is(err, unix.ESRCH) {
			return true, nil
		}
		return false, err
	}
	data, err := os.ReadFile(fmt.Sprintf("/proc/%d/stat", pid))
	if err != nil {
		if errors.Is(err, os.ErrNotExist) {
			return true, nil
		}
		return false, nil // Fallback if /proc is not mounted
	}
	idx := strings.LastIndexByte(string(data), ')')
	if idx == -1 || idx+2 >= len(data) {
		return false, fmt.Errorf("invalid stat format")
	}
	state := data[idx+2]
	return state == 'Z' || state == 'X' || state == 'x', nil
}

Use this check inside killProcess and isRunning() to detect terminated processes immediately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions