Skip to content

perf: reduce allocations on the chunk reading hot path#339

Open
matshch wants to merge 4 commits into
buildbarn:mainfrom
matshch:fix-buffer-reuse
Open

perf: reduce allocations on the chunk reading hot path#339
matshch wants to merge 4 commits into
buildbarn:mainfrom
matshch:fix-buffer-reuse

Conversation

@matshch

@matshch matshch commented May 14, 2026

Copy link
Copy Markdown

Note

This PR was authored with AI assistance and reviewed by me before opening.

readerBackedChunkReader previously allocated a fresh maximumChunkSizeBytes slice on every Read() call. With the default 64 KiB chunk size and production traffic around 20-30 KRPS, this allocation site accounted for the bulk of GC pressure on bb_storage.

The reader now owns a single buffer for its entire lifetime, acquired lazily on first Read() and returned to a package-private sync.Pool on Close(). The ChunkReader contract is tightened to reflect that returned slices are only valid until the next Read()/Close() call. The only place where this change mattered is gRPC server, now we use grpc.PreparedMsg to move the marshaling of the message in the main loop, which removes references to the allocated buffers.

The overall effect depends heavily on the load pattern, in our case we are seeing up to 3x CPU usage reduction on production cluster.

@EdSchouten EdSchouten left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that this is safe? In the bytestream server we place these chunks in messages that we pass to gRPC's .Send(). Is it safe to mutate/reuse these buffers immediately afterwards?

@matshch

matshch commented May 15, 2026

Copy link
Copy Markdown
Author

From the practical point of view it is safe: the current implementation of grpc-go always marshals the message before returning from Send(), that is what my AI agent noticed and used to come to the conclusion that this code is safe.

But it looks like the contract actually does not allow that, https://pkg.go.dev/github.com/grpc/grpc-go#ServerStream.SendMsg:

It is not safe to modify the message after calling SendMsg. Tracing libraries and stats handlers may use the message lazily.

(and that is the method called in https://github.com/googleapis/go-genproto/blob/3700d4141b60f18b5e65fb06d6b8ccebd3ff6384/googleapis/bytestream/bytestream.pb.go#L842)

The exact same issue was brought up in grpc/grpc-go#8186, and the maintainers do recommend using grpc.PreparedMsg as a workaround. It moves the marshaling outside of the Send call, and passes the already marshaled message around. And during marshaling all the bytes are copied into the new buffer, so it should be safe to change the old one.

Let me know if that works for you, or if you deem this too unsafe to merge.

@MarshallOfSound

Copy link
Copy Markdown
Contributor

Fwiw this was a follow up to my last perf PR that I was still bashing around a bit in a prod cluster.

I can confirm the prod wins just the safety of it gave me pause. I'll compare what I've been running to this PR and see if there's any edge case differences

@matshch

matshch commented May 18, 2026

Copy link
Copy Markdown
Author

We've rolled out pre-grpc.PreparedMsg version in production and it looks functional. We need more time to say for sure, but it looks like both #338 and this one together give us 2x performance boost.

Comment thread pkg/blobstore/grpcservers/byte_stream_server.go Outdated
@matshch matshch force-pushed the fix-buffer-reuse branch from e2abf2e to 9e0eb0f Compare May 18, 2026 20:45
@matshch matshch requested review from EdSchouten and tamird May 18, 2026 20:53

@tamird tamird left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directionally I think this PR is not right. sync.Pool is basically a hack here to attempt to improve the performance of an implementation of ChunkReader but that interface is inherently broken. It is simply not reasonable or possible in the general case to return owned byte slices from an interface - any interface - without forcing oodles of allocations.

The second problem is that the amortized allocations do not actually vanish in this PR because of the continuously allocated PreparedMsg. This PR is just moving the allocations around: before they were in readerBackedChunkReader and now they are in the grpc.PreparedMsg.Encode call.

The benchmarks in this PR are misleading; microbenchmarking around the broken interface doesn't demonstrate anything useful.

I'm planning to send another PR - which is sadly performance-neutral - that removes the broken ChunkReader interface entirely.

@matshch

matshch commented May 18, 2026

Copy link
Copy Markdown
Author

sync.Pool is basically a hack here to attempt to improve the performance of an implementation of ChunkReader but that interface is inherently broken. It is simply not reasonable or possible in the general case to return owned byte slices from an interface - any interface - without forcing oodles of allocations.

I agree, but in the program which whole job is to read some chunks from disk and then chug them into the network — there is not whole lot to do except to manage memory, so that is what we have to optimize.

I'm not quite sure how to approach this issue without owning the buffers, I would be glad to see your PR.

The second problem is that the amortized allocations do not actually vanish in this PR because of the continuously allocated PreparedMsg. This PR is just moving the allocations around: before they were in readerBackedChunkReader and now they are in the grpc.PreparedMsg.Encode call.

You are mistaking two different allocation paths:

  • Allocations from readerBackedChunkReader are gone.
  • Allocations from ServerStream.SendMsg have moved to PreparedMsg.Encode.

So this PR should be performance beneficial, but I am not sure how to demonstrate that easily. I guess we need another benchmark that will hook this server to some client to test out the whole pipeline? I'll measure it empirically on our production soon.

Comment thread pkg/blobstore/grpcservers/byte_stream_server.go Outdated
b := make([]byte, r.maximumChunkSizeBytes)
n, err := io.ReadFull(r.r, b[:])
if r.buf == nil {
r.buf = getChunkBuffer(r.maximumChunkSizeBytes)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, you're trying to solve the issue that transmitting a large file causes O(n) allocations. However, with this specific change you're trying to reduce this to 'less-than-1' allocation by using a sync.Pool. Is that really necessary?

In other words, why don't we just write this instead?

r.buf = make([]byte, r.maximumChunkSizeBytes)

Sure, it's one more allocation. But are we sure that that's actually part of the bottleneck?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue is that bb-storage on our workload is GC-bound:
Stock bb-storage CPU flamegraph

And the biggest source of allocations is this place:
Stock bb-storage allocations flamegraph

We have all kinds of blob sizes in CAS. Many of them do fit in the default chunk size, but we also have a huge tail of blobs which are in the realm of gigabytes:
image

With this distribution of sizes it looks like we need to optimize both small and huge files, that is why I've looked into sync.Pool first, and then found that we can actually reuse one buffer (except for the gRPC part, which is really unfortunate).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, can't we first do some benchmarking to see whether making a single allocation is sufficient?

I am not a fan of using sync.Pool here. I get why sync.Pool exists. Namely, you often want to use it for holding objects that are complex to initialize. But in our case that shouldn't need to apply. We're just talking about simple byte arrays here. If we already need to use sync.Pool for storing simple byte slices, then Go's memory management must be really bad.

I also suspect that a big amount of waste is coming from the fact that we always allocate maximumChunkSizeBytes instead of something like:

max(min(maximumChunkSizeBytes, digest.GetSizeBytes()), 1)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over-allocating is wasteful, but not the kind of waste that introduces GC overhead. GC overhead is sensitive to the number of allocations and not especially to their size.

I agree that sync.Pool is a hack here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get why sync.Pool exists. Namely, you often want to use it for holding objects that are complex to initialize.

That's not exactly true. https://pkg.go.dev/sync#Pool:

Pool's purpose is to cache allocated but unused items for later reuse, relieving pressure on the garbage collector. [...]

An appropriate use of a Pool is to manage a group of temporary items silently shared among and potentially reused by concurrent independent clients of a package. Pool provides a way to amortize allocation overhead across many clients.

An example of good use of a Pool is in the fmt package, which maintains a dynamically-sized store of temporary output buffers.

https://victoriametrics.com/blog/go-sync-pool/ provides several examples where the standard library utilizes pools for simple objects. https://cs.opensource.google/go/go/+/master:src/net/http/server.go;l=857-870;drc=15b9fc2659f77608548cb279c5e0565b0664cfca;bpv=1;bpt=1 is basically our exact case — just allocating a buffer of constant size.

The point of the pool here is to make sure that we allocate less buffers overall, so GC needs to do less work overall. It is usually negligible when we have other work to do, but in case of bb-storage it might be noticeable depending on the workload.

I have done some benchmarking with both sync.Pool and without it:

  • For large blobs pool does not matter, we are getting the main benefit from reusing the buffer inside the reader.
  • For small blobs pool does matter, we are saving ~30% of GC CPU, which means ~30% of total CPU usage when we are GC bound.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also suspect that a big amount of waste is coming from the fact that we always allocate maximumChunkSizeBytes instead of something like:

max(min(maximumChunkSizeBytes, digest.GetSizeBytes()), 1)

It might be a big amount of waste from the point of allocated memory, but I wouldn't say we have significant problems in that regard. And speaking about CPU usage:

  • The main driver of GC workload is the number of objects. Having smaller objects probably improves data locality slightly, but I don't think it is significant enough.
  • Having many different sizes might actually make GC/allocator work slightly more complex, as these objects will use different size classes.
  • It is not obvious how to recycle different-sized objects via sync.Pool.

But there is another interesting angle to all of that — Go has a significantly more complex data path for allocations > 32 KB. Maybe it makes sense to decrease the size of these buffers to 32 KB to get lock-free allocations.

@matshch matshch May 26, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have conducted several more experiments:

  • Decreasing buffer sizes to 32 KB (which are better supported by Go runtime) does not help, as we are getting more overhead from more gRPC messages than from heavier allocation/GC paths.
  • sync.Pool definitely helps to reduce CPU usage compared to allocating new constant-size buffers each time, especially in case of smaller blobs, as we can reuse them quicker.
  • Replacing allocations with new purpose-sized buffers helps a bit, but sync.Pool still outperforms this idea by 10-30% on blobs <64 KB.

@matshch

matshch commented May 19, 2026

Copy link
Copy Markdown
Author

Here is a new flame graph of allocations: we can see that both readerBackedChunkReader and serverStream.SendMsg are gone, now we only see grpc.PreparedMsg.Encode:
PR bb-storage allocations flame graph

Old flame graph to compare side by sideStock bb-storage allocations flame graph

That means:

  • readerBackedChunkReader is successfully optimized.
  • grpc.PreparedMsg does not introduce additional allocations, we just moved allocations outside of serverStream.SendMsg and now control the allocations.

If we speak about CPU load, here is a new CPU flame graph:
PR bb-storage CPU flame graph

Old flame graph to compare side by sideStock bb-storage CPU flame graph

Now GC takes only ~25% of CPU time. The main win here is that it is <=25%:

  • The GC mark phase is limited to 25% of allocatable CPU in Go runtime. This limit is hardcoded in the runtime, you cannot really affect it in any way.
  • If you are GC-bound and GC takes more than 25% of your CPU time, that effectively means that you cannot utilize all allocated CPU, as the GC mark phase cannot take more than 25% of allocatable CPU, and you are waiting for mark phase to finish before you can continue serving other requests.
  • That means that previously bb-storage would never reach the allocated CPU limit, but it would still throttle on GC and make some requests time out.
  • Now GC as the whole is less than 25%, so you can better control the allocated CPU.

I'll run this PR in our cluster a couple more days and report back how we feel, but overall it looks like a huge win for us.

matshch added 4 commits May 23, 2026 15:09
Two benchmarks share a Read driver and differ only in load:

BenchmarkByteStreamReadUnderGCPressure exercises the server Read path
in a regime where per-RPC allocation cost dominates: enough concurrent
Read RPCs that fresh chunk buffers are allocated faster than the GC
can reclaim them. The bb-storage instances that motivated this
benchmark exhibited a positive feedback loop where rising GC CPU
stretched out each RPC, which kept more in-flight buffers alive,
which raised GC CPU further. The benchmark does not attempt to
reproduce that runaway behavior but it does measure per-RPC
allocation cost in a regime where the Go runtime is forced to track
it continuously. It pins GOMAXPROCS so the ratio of allocator
throughput to available GC CPU is reproducible across hosts, runs
1024 in-flight RPCs (parallelism*GOMAXPROCS, with HTTP/2's
100-stream default lifted) so the allocator stays under sustained
pressure, and sets GOMEMLIMIT to a fixed multiple of the
steady-state working set so the runtime has to assist GC on every
allocation.

BenchmarkByteStreamRead exercises the same Read path without
touching any runtime knobs: GOMAXPROCS, GOMEMLIMIT and parallelism
are left at their defaults. It measures whatever the host has on
hand and serves as a regression guard for changes to the
chunk-buffer lifecycle outside the artificial pressure regime.

Both sweep blob sizes from 64 KiB to 64 MiB and report four custom
metrics via b.ReportMetric, computed over the benchmark window using
runtime/metrics:

  %gc-cpu        : GC CPU (assist + dedicated + idle + pause) as a
                   fraction of total CPU time.
  gc-pause-us/op : mean STW pause time per RPC, estimated by summing
                   midpoints of the /sched/pauses/total/gc histogram
                   and dividing by b.N.
  gc-cycles/Kop  : completed GC cycles per thousand RPCs.
  heap-MiB-peak  : max HeapInuse observed by a 10ms sampler.
Exercises readerBackedChunkReader through the public Buffer API.
readerBackedChunkReader previously allocated a fresh
maximumChunkSizeBytes slice on every Read() call. With the default
64 KiB chunk size and production traffic around 20-30 KRPS, this
allocation site accounted for the bulk of GC pressure on bb_storage,
with the runtime spending up to 90% of CPU in GC under load.

The reader now owns a single buffer for its entire lifetime, acquired
lazily on first Read() from a package-private sync.Pool and returned
on Close(). Per-reader ownership eliminates the per-chunk allocation
for large blobs; the pool prevents the runtime from re-allocating
that buffer once per RPC, which dominates GC work at small blob
sizes (64 KiB - 256 KiB) where each RPC reads only a few chunks.

The ChunkReader contract is tightened to reflect the long-standing
implicit invariant that returned slices are only valid until the next
Read()/Close() call -- every in-tree consumer already respected this.
multiplexedChunkReader's existing channel-based barrier already
serializes consumers, which keeps buffer reuse safe across
CloneStream().

A similar make() per Read() pattern exists in
pkg/blobstore/grpcclients/cas_blob_access.go's zstdByteStreamChunkReader
and is left for a follow-up PR.
… buffers

The previous commit made readerBackedChunkReader reuse its 64 KiB buffer
across Read() calls. That alone violates the grpc.SendMsg contract,
which documents that 'It is not safe to modify the message after calling
SendMsg' because stats handlers and tracing may capture the message
to process lazily.

grpc.PreparedMsg is the documented escape hatch: PreparedMsg.Encode
marshals and copies the wire bytes synchronously, and prepareMsg
short-circuits on *PreparedMsg, so the source []byte is no longer
referenced by gRPC after SendMsg returns.
@matshch matshch force-pushed the fix-buffer-reuse branch from 9e0eb0f to b993b67 Compare May 26, 2026 13:17
@matshch

matshch commented May 26, 2026

Copy link
Copy Markdown
Author

Hi, thanks for the patience! Last week we were diving deeper and deeper into our bb-storage instance performance. I now have a lot better understanding of what is happening there, and the fix in this PR definitely helps in our scenario, but it is not a silver bullet as it might have seem from initial micro-benchmarks.

The main issue that we are observing is that when a bb-storage instance is overwhelmed with requests, it responds slower to each of those requests (due to limited resources available to the instance), which means that with high RPS we are getting more and more in-flight requests, which means more allocated resources. Go GC runs when there is a certain ratio of new allocations reached, regardless of the memory still available to the process, so in this runaway scenario each allocation creates additional load on GC and makes us run GC more often, which does not help.

We have not yet found specific bottlenecks that we are hitting, but GC contention is visible in pprof, and reducing it helps with overall CPU usage. So this PR looks to me like a good optimization to have, if you agree with new contracts imposed.

Here is a chart of CPU usage over the number of requests processed in our cluster. We have deployed a version from this PR over the previous weekend, so the left half of the chart is the base branch and the right half is this PR branch. You can see that while there are still a lot of noticeable peaks in CPU usage, they are lower overall, as well as the median CPU usage.
Chart showing bb-storage CPU usage over the number of requests processed

I have also added some benchmarks that exercise the whole gRPC server loop. They still don't represent the workload ideally, as the network latency is not modeled, but at least they can be used to compare the effects of this PR more directly. One of the benchmarks tries to simulate the GC contention described above by setting low memory limit. It is not quite the same thing as happens in the real life, but it allows to see a 10x throughput improvement on 1MiB blob size (and no degradation on the rest of the cases).

Let me know what you think!

@matshch matshch requested review from EdSchouten and tamird May 26, 2026 13:59
@tamird

tamird commented May 26, 2026

Copy link
Copy Markdown

It would be great to update the PR description and remove the micro benchmark which is, as discussed, rather meaningless.

@matshch

matshch commented May 26, 2026

Copy link
Copy Markdown
Author

Updated the description to only reflect the main gist of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants