Sample cgroup counters and detect action OOM by scele · Pull Request #232 · buildbarn/bb-remote-execution

scele · 2026-06-09T12:11:33Z

Add optional cgroup resource usage sampling to bb_runner. When assume_exclusive_cgroup is enabled, do best-effort validation that the runner is running in a dedicated cgroup (such as a container), and then sample memory.events, memory.peak, and PSI counters before and after each action, and attach the deltas to the runner response metadata.

The following new fields are attached as CgroupResourceUsage metadata and logged:

memory_events_low
memory_events_high
memory_events_max
memory_events_oom
memory_events_oom_kill
memory_events_oom_group_kill // Linux >= 5.17
memory_peak_bytes // resettable memory.peak, Linux >= 6.12
psi_memory_some
psi_memory_full
psi_cpu_some
psi_cpu_full
psi_io_some
psi_io_full

For more information about the meaning of each, refer to https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files and https://www.kernel.org/doc/html/latest/accounting/psi.html

The PSI counters are useful for determining what is the bottleneck of the action, is it IO-bound, memory-bound or CPU-bound.

Finally, use the sampled memory.events counters in bb_worker to report OOM outcomes more accurately. If an action process was OOM-killed after the action cgroup hit its memory limit, surface information about the OOM in ExecuteResponse.Message and force a nonzero exit code even if the top-level command exited with zero. If an action process was OOM-killed without the action cgroup reaching its memory limit, treat it as a retryable infrastructure failure with codes.Unavailable, since that likely indicates system-level memory pressure such as node memory overcommitment.

Add optional cgroup resource usage sampling to bb_runner. When assume_exclusive_cgroup is enabled, do best-effort validation that the runner is running in a dedicated cgroup (such as a container), and then sample memory.events, memory.peak, and PSI counters before and after each action, and attach the deltas to the runner response metadata. The following new fields are attached as CgroupResourceUsage metadata and logged: - memory_events_low - memory_events_high - memory_events_max - memory_events_oom - memory_events_oom_kill - memory_events_oom_group_kill // Linux >= 5.17 - memory_peak_bytes // resettable memory.peak, Linux >= 6.12 - psi_memory_some - psi_memory_full - psi_cpu_some - psi_cpu_full - psi_io_some - psi_io_full For more information about the meaning of each, refer to https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files and https://www.kernel.org/doc/html/latest/accounting/psi.html The PSI counters are useful for determining what is the bottleneck of the action, is it IO-bound, memory-bound or CPU-bound. Finally, use the sampled memory.events counters in bb_worker to report OOM outcomes more accurately. If an action process was OOM-killed after the action cgroup hit its memory limit, surface information about the OOM in ExecuteResponse.Message and force a nonzero exit code even if the top-level command exited with zero. If an action process was OOM-killed without the action cgroup reaching its memory limit, treat it as a retryable infrastructure failure with codes.Unavailable, since that likely indicates system-level memory pressure such as node memory overcommitment.

EdSchouten · 2026-06-09T14:20:57Z

+  // memory.peak: Peak memory usage in bytes since resetting
+  // memory.peak at action start. Zero if memory.peak reset/read is
+  // unavailable.
+  int64 memory_peak_bytes = 7;


Many of these fields use names that literally match the ones used by cgroups, while some do not. Would it make sense to use exactly the same names, so that it's easier for people to figure out where they come from?

Good idea, tried to align the names better!

EdSchouten · 2026-06-09T14:26:42Z

 	if runErr == nil {
 		response.Result.ExitCode = int32(runResponse.ExitCode)
 		response.Result.ExecutionMetadata.AuxiliaryMetadata = append(response.Result.ExecutionMetadata.AuxiliaryMetadata, runResponse.ResourceUsage...)
+		if cgroupUsage := getCgroupResourceUsage(response.Result); cgroupUsage != nil && cgroupUsage.MemoryEventsOomKill > 0 {


I'd rather not make changes along these lines, as bb_worker should remain oblivious of runtime details like groups. Can't bb_runner just return the correct values?

Also, why do we need this in the first place? If something gets OOM killed, shouldn't that already lead to non-zero exit codes? And even if it returns an exit code of zero, that might be intentional? (For example, you may want to write an integration test that ensures that your software does the right thing when OOM killed?)

Yeah, so this code does three things:

For an action that experienced an OOM-kill, force it to fail with ExitCode=1

Agree with your reasoning that this might be intentional, we should just remove this

For an action that experienced an OOM-kill and hit its cgroup limit, we surface this in response.Message

This is only really for UX convenience, I was thinking bb-browser could show the message next to the nonzero exit code to make it more obvious to the users why the action (likely) failed. But the CgroupResourceUsage is attached as response metadata, so the information is also available there. Would you rather not convey the OOM info in Message?

For an action that experienced an OOM-kill but did not hit its cgroup memory limit, fail it with Unavailable.

This would make it safer to run bb_runner in a memory-overcommitted environment and singleProcessOOMKill; currently such a setup can result in intermittent action failures. So I think this would still be desired, don't you think?

If (and only if) you agree that the third point is desireable, that alone could easily be done in CgroupResourceUsageSamplingRunner. Then we wouldn't need any worker changes. WDYT?

The latest commits remove ExitCode override, remove Message override, and move the "return Unavailable if the action was OOMkilled but the cgroup didn't OOM (since that's an infra issue that should't be attributed to the action)" logic to the runner. Let me know if you still want this to be changed.

EdSchouten · 2026-06-09T14:41:32Z

+	if err != nil {
+		return fmt.Errorf("failed to read cgroup.procs: %w", err)
+	}
+	allowedProcessIDs := getAncestorProcessIDs(os.Getpid())


To me it's unclear why we need to make checks like these. I can well imagine that people may want to launch workers that have some specific common processes running permanently in the background (e.g., a company specific authentication daemon that is used by tests). Is it really that harmful if they run in the same cgroup as the actions itself? It may only lead to some over reporting of resources?

I'd say, let's not have any validation like this. It's up to the user to ensure bb_runner is launched in a properly configured environment.

Agreed, removed the best-effort (and possibly harmful) validation.

EdSchouten · 2026-06-09T14:44:35Z

+}
+
+func (r *cgroupResourceUsageSamplingRunner) Run(ctx context.Context, request *runner_pb.RunRequest) (*runner_pb.RunResponse, error) {
+	if r.activeRuns.Add(1) != 1 {


This could use atomic.Bool, right? You could just do a CompareAndSwap() to see if there's already an action running.

EdSchouten · 2026-06-09T14:44:49Z

+
+	cgroupUsage, readErr := cgroupStatsReader.Read()
+	if readErr != nil {
+		log.Print("Failed to read scoped cgroup stats: ", readErr)


Please don't log if there's a way to propagate the error.

Agreed, done!

EdSchouten · 2026-06-09T14:46:22Z

+
+	cgroupStatsReader, err := r.newScopedCgroupStatsReader()
+	if err != nil {
+		return nil, status.Errorf(codes.Internal, "failed to create scoped cgroup stats reader: %s", err)


Please use util.StatusWrap()

EdSchouten · 2026-06-09T14:53:35Z

+	activeRuns                 atomic.Int32
+}
+
+type scopedCgroupStatsReader interface {


I think we should only try to use interfaces in our code if it allows us to select between multiple implementations, or if we want to aid testing. In this specific case neither apply, because scopedCgroupStatsReader is private and the instance to use is implicitly selected by NewCgroupResourceUsageSamplingRunner().

Can we either:

make this interface public and let take NewCgroupResourceUsageSamplingRunner() take an instance of this type, or

get rid of the interface entirely?

Yeah, I went back and forth with how to keep the public interface minimal, while still allowing tests to work across the runner/runner_test package boundary. The latest version made the whole feature Linux-specific in cgroup_resource_usage_sampling_runner_{linux,other,test}.go and the public API is:

NewCgroupResourceUsageSamplingRunner (with non-linux Stubs) used by main.go

NewCgroupResourceUsageSamplingRunnerWithCgroupfsPath used by tests

ResolveCurrentCgroupfsPathFromProcFiles used by tests

Hopefully this is now more contained / clean, but I'm happy to evolve it further!

- Rename CgroupResourceUsage fields to match their Linux names - Remove the best-effort startup validation for exclusive cgroups - Rename the configuration field from assume_exclusive_cgroup to sample_cgroup_resource_usage - Remove OOM classification from worker, keep "Unavailable" error code remapping on the runner side to keep failures caused by system-level memory pressure retryable - Refactor construction/injecting so that tests can now live in separate packages

Trying to have a separation between generic CgroupResourceUsageSamplingRunner and Linux-specific Cgroup reader started to feel off, let's make the whole feature Linux-only.

scele · 2026-06-16T08:07:49Z

@EdSchouten would you have time to do another review? The latest version is quite a bit more compact and doesn't touch bb_worker, hopefully you'll find it better! 🙇

scele changed the title ~~[runner] Sample cgroup counters and detect action OOM~~ Sample cgroup counters and detect action OOM Jun 9, 2026

EdSchouten reviewed Jun 9, 2026

View reviewed changes

scele added 2 commits June 10, 2026 12:40

Refactor the whole feature to be Linux-specific

8715eec

Trying to have a separation between generic CgroupResourceUsageSamplingRunner and Linux-specific Cgroup reader started to feel off, let's make the whole feature Linux-only.

Conversation

scele commented Jun 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scele Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scele commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scele Jun 10, 2026 •

edited

Loading