Sample cgroup counters and detect action OOM#232
Conversation
Add optional cgroup resource usage sampling to bb_runner. When assume_exclusive_cgroup is enabled, do best-effort validation that the runner is running in a dedicated cgroup (such as a container), and then sample memory.events, memory.peak, and PSI counters before and after each action, and attach the deltas to the runner response metadata. The following new fields are attached as CgroupResourceUsage metadata and logged: - memory_events_low - memory_events_high - memory_events_max - memory_events_oom - memory_events_oom_kill - memory_events_oom_group_kill // Linux >= 5.17 - memory_peak_bytes // resettable memory.peak, Linux >= 6.12 - psi_memory_some - psi_memory_full - psi_cpu_some - psi_cpu_full - psi_io_some - psi_io_full For more information about the meaning of each, refer to https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files and https://www.kernel.org/doc/html/latest/accounting/psi.html The PSI counters are useful for determining what is the bottleneck of the action, is it IO-bound, memory-bound or CPU-bound. Finally, use the sampled memory.events counters in bb_worker to report OOM outcomes more accurately. If an action process was OOM-killed after the action cgroup hit its memory limit, surface information about the OOM in ExecuteResponse.Message and force a nonzero exit code even if the top-level command exited with zero. If an action process was OOM-killed without the action cgroup reaching its memory limit, treat it as a retryable infrastructure failure with codes.Unavailable, since that likely indicates system-level memory pressure such as node memory overcommitment.
| // memory.peak: Peak memory usage in bytes since resetting | ||
| // memory.peak at action start. Zero if memory.peak reset/read is | ||
| // unavailable. | ||
| int64 memory_peak_bytes = 7; |
There was a problem hiding this comment.
Many of these fields use names that literally match the ones used by cgroups, while some do not. Would it make sense to use exactly the same names, so that it's easier for people to figure out where they come from?
There was a problem hiding this comment.
Good idea, tried to align the names better!
| if runErr == nil { | ||
| response.Result.ExitCode = int32(runResponse.ExitCode) | ||
| response.Result.ExecutionMetadata.AuxiliaryMetadata = append(response.Result.ExecutionMetadata.AuxiliaryMetadata, runResponse.ResourceUsage...) | ||
| if cgroupUsage := getCgroupResourceUsage(response.Result); cgroupUsage != nil && cgroupUsage.MemoryEventsOomKill > 0 { |
There was a problem hiding this comment.
I'd rather not make changes along these lines, as bb_worker should remain oblivious of runtime details like groups. Can't bb_runner just return the correct values?
Also, why do we need this in the first place? If something gets OOM killed, shouldn't that already lead to non-zero exit codes? And even if it returns an exit code of zero, that might be intentional? (For example, you may want to write an integration test that ensures that your software does the right thing when OOM killed?)
There was a problem hiding this comment.
Yeah, so this code does three things:
- For an action that experienced an OOM-kill, force it to fail with ExitCode=1
- Agree with your reasoning that this might be intentional, we should just remove this
- For an action that experienced an OOM-kill and hit its cgroup limit, we surface this in
response.Message- This is only really for UX convenience, I was thinking bb-browser could show the message next to the nonzero exit code to make it more obvious to the users why the action (likely) failed. But the CgroupResourceUsage is attached as response metadata, so the information is also available there. Would you rather not convey the OOM info in
Message?
- This is only really for UX convenience, I was thinking bb-browser could show the message next to the nonzero exit code to make it more obvious to the users why the action (likely) failed. But the CgroupResourceUsage is attached as response metadata, so the information is also available there. Would you rather not convey the OOM info in
- For an action that experienced an OOM-kill but did not hit its cgroup memory limit, fail it with
Unavailable.- This would make it safer to run bb_runner in a memory-overcommitted environment and singleProcessOOMKill; currently such a setup can result in intermittent action failures. So I think this would still be desired, don't you think?
There was a problem hiding this comment.
If (and only if) you agree that the third point is desireable, that alone could easily be done in CgroupResourceUsageSamplingRunner. Then we wouldn't need any worker changes. WDYT?
There was a problem hiding this comment.
The latest commits remove ExitCode override, remove Message override, and move the "return Unavailable if the action was OOMkilled but the cgroup didn't OOM (since that's an infra issue that should't be attributed to the action)" logic to the runner. Let me know if you still want this to be changed.
| if err != nil { | ||
| return fmt.Errorf("failed to read cgroup.procs: %w", err) | ||
| } | ||
| allowedProcessIDs := getAncestorProcessIDs(os.Getpid()) |
There was a problem hiding this comment.
To me it's unclear why we need to make checks like these. I can well imagine that people may want to launch workers that have some specific common processes running permanently in the background (e.g., a company specific authentication daemon that is used by tests). Is it really that harmful if they run in the same cgroup as the actions itself? It may only lead to some over reporting of resources?
I'd say, let's not have any validation like this. It's up to the user to ensure bb_runner is launched in a properly configured environment.
There was a problem hiding this comment.
Agreed, removed the best-effort (and possibly harmful) validation.
| } | ||
|
|
||
| func (r *cgroupResourceUsageSamplingRunner) Run(ctx context.Context, request *runner_pb.RunRequest) (*runner_pb.RunResponse, error) { | ||
| if r.activeRuns.Add(1) != 1 { |
There was a problem hiding this comment.
This could use atomic.Bool, right? You could just do a CompareAndSwap() to see if there's already an action running.
|
|
||
| cgroupUsage, readErr := cgroupStatsReader.Read() | ||
| if readErr != nil { | ||
| log.Print("Failed to read scoped cgroup stats: ", readErr) |
There was a problem hiding this comment.
Please don't log if there's a way to propagate the error.
|
|
||
| cgroupStatsReader, err := r.newScopedCgroupStatsReader() | ||
| if err != nil { | ||
| return nil, status.Errorf(codes.Internal, "failed to create scoped cgroup stats reader: %s", err) |
| activeRuns atomic.Int32 | ||
| } | ||
|
|
||
| type scopedCgroupStatsReader interface { |
There was a problem hiding this comment.
I think we should only try to use interfaces in our code if it allows us to select between multiple implementations, or if we want to aid testing. In this specific case neither apply, because scopedCgroupStatsReader is private and the instance to use is implicitly selected by NewCgroupResourceUsageSamplingRunner().
Can we either:
- make this interface public and let take
NewCgroupResourceUsageSamplingRunner()take an instance of this type, or - get rid of the interface entirely?
There was a problem hiding this comment.
Yeah, I went back and forth with how to keep the public interface minimal, while still allowing tests to work across the runner/runner_test package boundary. The latest version made the whole feature Linux-specific in cgroup_resource_usage_sampling_runner_{linux,other,test}.go and the public API is:
NewCgroupResourceUsageSamplingRunner(with non-linux Stubs) used by main.goNewCgroupResourceUsageSamplingRunnerWithCgroupfsPathused by testsResolveCurrentCgroupfsPathFromProcFilesused by tests
Hopefully this is now more contained / clean, but I'm happy to evolve it further!
- Rename CgroupResourceUsage fields to match their Linux names - Remove the best-effort startup validation for exclusive cgroups - Rename the configuration field from assume_exclusive_cgroup to sample_cgroup_resource_usage - Remove OOM classification from worker, keep "Unavailable" error code remapping on the runner side to keep failures caused by system-level memory pressure retryable - Refactor construction/injecting so that tests can now live in separate packages
Trying to have a separation between generic CgroupResourceUsageSamplingRunner and Linux-specific Cgroup reader started to feel off, let's make the whole feature Linux-only.
|
@EdSchouten would you have time to do another review? The latest version is quite a bit more compact and doesn't touch bb_worker, hopefully you'll find it better! 🙇 |
Add optional cgroup resource usage sampling to bb_runner. When assume_exclusive_cgroup is enabled, do best-effort validation that the runner is running in a dedicated cgroup (such as a container), and then sample memory.events, memory.peak, and PSI counters before and after each action, and attach the deltas to the runner response metadata.
The following new fields are attached as CgroupResourceUsage metadata and logged:
For more information about the meaning of each, refer to https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files and https://www.kernel.org/doc/html/latest/accounting/psi.html
The PSI counters are useful for determining what is the bottleneck of the action, is it IO-bound, memory-bound or CPU-bound.
Finally, use the sampled memory.events counters in bb_worker to report OOM outcomes more accurately. If an action process was OOM-killed after the action cgroup hit its memory limit, surface information about the OOM in ExecuteResponse.Message and force a nonzero exit code even if the top-level command exited with zero. If an action process was OOM-killed without the action cgroup reaching its memory limit, treat it as a retryable infrastructure failure with codes.Unavailable, since that likely indicates system-level memory pressure such as node memory overcommitment.