test(export): add benchmarks for histogram sample processing by bwplotka · Pull Request #324 · GoogleCloudPlatform/prometheus

bwplotka · 2026-07-01T16:53:01Z

Efficiency Assessment: GoogleCloudPlatform/prometheus PR #323

Pull Request: GoogleCloudPlatform/prometheus#323 (fix-histogram-theft-bug)
Assessment Date: 2026-07-01
Evaluated Branch: benchmark-histogram-export (based on origin/fix-histogram-theft-bug)

Executive Summary

PR #323 replaces buildDistribution (singular) with buildDistributions (plural) in transform.go. Previously, when a histogram sample was encountered, the builder consumed samples until one distribution completed and assumed it corresponded to the series of the first sample seen. On interleaved or incomplete histogram streams (such as Kong issue #14925), this caused histogram distribution theft, where buckets from one series were erroneously attached to the metadata and labels of another.

With PR #323, buildDistributions consumes all contiguous samples for a histogram metric family in one pass, groups them by series hash in b.dists, emits all completed distributions with their true identity, and immediately returns the cached distribution objects to sync.Pool.

Key Efficiency Findings:

Memory & Heap Overhead: Virtually identical (~0% impact). Both implementations operate with a near-zero allocation footprint (4 vs. 5 allocs/op, exactly +64 B/op difference).
CPU Execution Time: Minor microbenchmark overhead (~3% to 7%). Processing 500 histogram series takes ~3.44 ms (up from ~3.35 ms for 6,000 samples). This tiny overhead is due to iterating over b.dists at the end of each metric family to immediately recycle distributions back to sync.Pool via defer, which prevents memory retention across large scrape batches.
Algorithmic Complexity: Remains $O(N)$ with respect to sample count, while completely resolving distribution theft for both grouped and interleaved histogram series.

Benchmark Results (Before vs. After)

A comprehensive benchmark suite was implemented in transform_bench_test.go testing sampleBuilder.next() across 10, 100, and 500 histogram series per batch (each series consisting of 10 buckets + _sum + _count = 12 samples/series).

Two stream topologies were tested:

Grouped: Standard Prometheus exposition format where all 12 samples for series (A) appear contiguously, followed by series (B), etc.
Ungrouped (Interleaved): All _sum samples appear first, followed by all _count samples, followed by buckets across series.

Benchmark Scenario	Series Count	Before (`1330987e6`) Time	After (`pr-323`) Time	Diff (Time)	Before Memory	After Memory	Diff (Memory)
`HistogramsGrouped`	10	64.0 µs/op	67.5 µs/op	+5.4%	4952 B / 4 allocs	5016 B / 5 allocs	+64 B / +1 alloc
`HistogramsGrouped`	100	598.0 µs/op	639.9 µs/op	+7.0%	4952 B / 4 allocs	5016 B / 5 allocs	+64 B / +1 alloc
`HistogramsGrouped`	500	3.35 ms/op	3.44 ms/op	+2.6%	4952 B / 4 allocs	5016 B / 5 allocs	+64 B / +1 alloc
`HistogramsUngrouped`	10	63.6 µs/op	65.6 µs/op	+3.1%	4952 B / 4 allocs	5016 B / 5 allocs	+64 B / +1 alloc
`HistogramsUngrouped`	100	606.1 µs/op	634.8 µs/op	+4.7%	4952 B / 4 allocs	5016 B / 5 allocs	+64 B / +1 alloc
`HistogramsUngrouped`	500	3.21 ms/op	3.33 ms/op	+3.7%	4952 B / 4 allocs	5016 B / 5 allocs	+64 B / +1 alloc

Architectural & Efficiency Analysis

1. Memory Efficiency & Pool Management

In the previous implementation, when buildDistribution completed a series, the underlying *distribution object remained stored inside the sampleBuilder.dists map until builder.close() was invoked at the end of the entire scrape batch. If a batch contained 1,000 unique histogram series, the old code held 1,000 pooled objects simultaneously.

In PR #323, buildDistributions introduces immediate cleanup via defer:

defer func() {
    for _, dist := range b.dists {
        putDistribution(dist)
    }
    clear(b.dists)
}()

As soon as a contiguous metric family is processed, all of its *distribution objects are returned to sync.Pool and b.dists is cleared. When the next metric family in the scrape is evaluated, it reuses the exact same pooled objects from sync.Pool.

The +64 B / +1 alloc difference: Notice buildDistributions returns []hashedSeries (backed by the reused b.histResultsBuf slice). When next() appends this slice to result (result = append(result, histSeries...)), Go performs a single slice header/capacity adjustment worth exactly 64 bytes per batch.

2. CPU Performance & Map Iteration

The minor (~3%–7%) CPU overhead in microbenchmarks is directly attributable to the two map iterations over b.dists per metric family:

Iterating over b.dists to check dist.complete() and build monitoring_pb.TimeSeries.
Iterating over b.dists in the defer block to call putDistribution(dist) before calling clear(b.dists).

In a production scrape (e.g., 500 series taking ~3.44 ms), this delta amounts to less than 100 microseconds per scrape—a negligible trade-off for eliminating distribution theft and out-of-order errors on interleaved streams.

3. Impact of Commit `fa054e200` (`chore: remove order`)

In the initial PR commit (ec24bbba6), buildDistributions tracked series insertion order using a b.touched []uint64 slice. Commit fa054e200 removed b.touched in favor of iterating directly over b.dists (and sorting slices in test assertions instead). This optimization was effective: removing b.touched eliminated per-sample slice growth overhead and kept the heap allocations at an absolute minimum (5 allocs/op).

Running the Benchmark Suite

To execute the benchmarks locally from the repository root:

go test -run=^$ -bench=BenchmarkSampleBuilder_Histograms -benchmem ./google/export

TAG=agy
CONV=20159a34-fd32-4d67-879d-325703fa43ea

…theft Add cases to TestSampleBuilder for: - ungrouped (interleaved) histogram samples - ungrouped (interleaved) histogram samples with first group incomplete - ungrouped (interleaved) histogram samples with first group skipped due to new bucket Enforce strict _bucket, _count, _sum ordering for test samples. TAG=agy CONV=8f508481-de1c-4e6b-ad3a-718d089a2fbe

When histogram samples in a scrape batch are ungrouped (interleaved across different series label sets for the same metric name), existing buildDistribution would return as soon as any histogram in the cache completed. This caused histogram distribution theft (attaching completed distributions to the wrong series) and sample loss for other interleaved series. Replace buildDistribution with buildHistograms to consume the entire contiguous block of samples for a histogram metric name. Cache series metadata (hash, proto, lset) on distribution entries and emit all completed distributions in deterministic order when the block ends. Incomplete or skipped histogram distributions are cleanly discarded without being emitted under incorrect series label sets. TAG=agy CONV=ae038258-882b-4fbd-8b3f-15976bc590f2

Signed-off-by: bwplotka <bwplotka@gmail.com>

gemini-code-assist

Code Review

This pull request adds a new benchmark file transform_bench_test.go to measure the performance of SampleBuilder when processing grouped and ungrouped histograms. The feedback recommends instantiating and closing the sampleBuilder outside of the benchmark loop to avoid unnecessary allocation and CPU overhead, which will yield more accurate benchmark results.

gemini-code-assist · 2026-07-01T16:58:41Z

+	b.ResetTimer()
+	b.ReportAllocs()
+
+	for i := 0; i < b.N; i++ {
+		sb := newSampleBuilder(cache)
+		batch := batches[1]
+		for len(batch) > 0 {
+			_, tail, err := sb.next(metadata, externalLabels, batch, nil)
+			if err != nil {
+				b.Fatal(err)
+			}
+			batch = tail
+		}
+		sb.close()
+	}


In the benchmark loop, a new sampleBuilder is instantiated and closed on every iteration. Since newSampleBuilder allocates a map of size 128 and a slice, doing this inside the b.N loop introduces significant allocation and CPU overhead that is unrelated to the next() method being benchmarked.

Because sampleBuilder's internal state (b.dists and b.histResultsBuf) is fully cleared or reset during each call to next() (via buildDistributions), it is completely safe to reuse a single sampleBuilder instance across all iterations of the benchmark. Moving the instantiation outside the loop and calling b.ResetTimer() after it will make the benchmark results much more accurate and focused on the actual sample processing logic.

sb := newSampleBuilder(cache) defer sb.close() b.ResetTimer() b.ReportAllocs() for i := 0; i < b.N; i++ { batch := batches[1] for len(batch) > 0 { _, tail, err := sb.next(metadata, externalLabels, batch, nil) if err != nil { b.Fatal(err) } batch = tail } }

Do this, also use b.Loop()

bwplotka added 5 commits July 1, 2026 13:57

chore: remove order

fa054e2

Signed-off-by: bwplotka <bwplotka@gmail.com>

chore: add commentary

bd3aa88

Signed-off-by: bwplotka <bwplotka@gmail.com>

test(export): add benchmarks for histogram sample processing

767ca9f

bwplotka mentioned this pull request Jul 1, 2026

fix(export): Fix histogram theft bug on interleaved and incomplete histograms #323

Open

gemini-code-assist Bot reviewed Jul 1, 2026

View reviewed changes

bwplotka and others added 2 commits July 2, 2026 19:13

test(export): reuse sampleBuilder in benchmark loop

18fb12c

test(export): use b.Loop() in benchmark

13de0ce

bwplotka force-pushed the fix-histogram-theft-bug branch from bd3aa88 to 7ce3854 Compare July 3, 2026 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(export): add benchmarks for histogram sample processing#324

test(export): add benchmarks for histogram sample processing#324
bwplotka wants to merge 7 commits into
fix-histogram-theft-bugfrom
benchmark-histogram-export

bwplotka commented Jul 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

bwplotka Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bwplotka commented Jul 1, 2026

Efficiency Assessment: GoogleCloudPlatform/prometheus PR #323

Executive Summary

Key Efficiency Findings:

Benchmark Results (Before vs. After)

Architectural & Efficiency Analysis

1. Memory Efficiency & Pool Management

2. CPU Performance & Map Iteration

3. Impact of Commit fa054e200 (chore: remove order)

Running the Benchmark Suite

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

bwplotka Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

3. Impact of Commit `fa054e200` (`chore: remove order`)