Skip to content

test(export): add benchmarks for histogram sample processing#324

Draft
bwplotka wants to merge 7 commits into
fix-histogram-theft-bugfrom
benchmark-histogram-export
Draft

test(export): add benchmarks for histogram sample processing#324
bwplotka wants to merge 7 commits into
fix-histogram-theft-bugfrom
benchmark-histogram-export

Conversation

@bwplotka

@bwplotka bwplotka commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Efficiency Assessment: GoogleCloudPlatform/prometheus PR #323

Pull Request: GoogleCloudPlatform/prometheus#323 (fix-histogram-theft-bug)
Assessment Date: 2026-07-01
Evaluated Branch: benchmark-histogram-export (based on origin/fix-histogram-theft-bug)


Executive Summary

PR #323 replaces buildDistribution (singular) with buildDistributions (plural) in transform.go. Previously, when a histogram sample was encountered, the builder consumed samples until one distribution completed and assumed it corresponded to the series of the first sample seen. On interleaved or incomplete histogram streams (such as Kong issue #14925), this caused histogram distribution theft, where buckets from one series were erroneously attached to the metadata and labels of another.

With PR #323, buildDistributions consumes all contiguous samples for a histogram metric family in one pass, groups them by series hash in b.dists, emits all completed distributions with their true identity, and immediately returns the cached distribution objects to sync.Pool.

Key Efficiency Findings:

  1. Memory & Heap Overhead: Virtually identical (~0% impact). Both implementations operate with a near-zero allocation footprint (4 vs. 5 allocs/op, exactly +64 B/op difference).
  2. CPU Execution Time: Minor microbenchmark overhead (~3% to 7%). Processing 500 histogram series takes ~3.44 ms (up from ~3.35 ms for 6,000 samples). This tiny overhead is due to iterating over b.dists at the end of each metric family to immediately recycle distributions back to sync.Pool via defer, which prevents memory retention across large scrape batches.
  3. Algorithmic Complexity: Remains $O(N)$ with respect to sample count, while completely resolving distribution theft for both grouped and interleaved histogram series.

Benchmark Results (Before vs. After)

A comprehensive benchmark suite was implemented in transform_bench_test.go testing sampleBuilder.next() across 10, 100, and 500 histogram series per batch (each series consisting of 10 buckets + _sum + _count = 12 samples/series).

Two stream topologies were tested:

  • Grouped: Standard Prometheus exposition format where all 12 samples for series (A) appear contiguously, followed by series (B), etc.
  • Ungrouped (Interleaved): All _sum samples appear first, followed by all _count samples, followed by buckets across series.
Benchmark Scenario Series Count Before (1330987e6) Time After (pr-323) Time Diff (Time) Before Memory After Memory Diff (Memory)
HistogramsGrouped 10 64.0 µs/op 67.5 µs/op +5.4% 4952 B / 4 allocs 5016 B / 5 allocs +64 B / +1 alloc
HistogramsGrouped 100 598.0 µs/op 639.9 µs/op +7.0% 4952 B / 4 allocs 5016 B / 5 allocs +64 B / +1 alloc
HistogramsGrouped 500 3.35 ms/op 3.44 ms/op +2.6% 4952 B / 4 allocs 5016 B / 5 allocs +64 B / +1 alloc
HistogramsUngrouped 10 63.6 µs/op 65.6 µs/op +3.1% 4952 B / 4 allocs 5016 B / 5 allocs +64 B / +1 alloc
HistogramsUngrouped 100 606.1 µs/op 634.8 µs/op +4.7% 4952 B / 4 allocs 5016 B / 5 allocs +64 B / +1 alloc
HistogramsUngrouped 500 3.21 ms/op 3.33 ms/op +3.7% 4952 B / 4 allocs 5016 B / 5 allocs +64 B / +1 alloc

Architectural & Efficiency Analysis

1. Memory Efficiency & Pool Management

In the previous implementation, when buildDistribution completed a series, the underlying *distribution object remained stored inside the sampleBuilder.dists map until builder.close() was invoked at the end of the entire scrape batch. If a batch contained 1,000 unique histogram series, the old code held 1,000 pooled objects simultaneously.

In PR #323, buildDistributions introduces immediate cleanup via defer:

defer func() {
    for _, dist := range b.dists {
        putDistribution(dist)
    }
    clear(b.dists)
}()

As soon as a contiguous metric family is processed, all of its *distribution objects are returned to sync.Pool and b.dists is cleared. When the next metric family in the scrape is evaluated, it reuses the exact same pooled objects from sync.Pool.

  • The +64 B / +1 alloc difference: Notice buildDistributions returns []hashedSeries (backed by the reused b.histResultsBuf slice). When next() appends this slice to result (result = append(result, histSeries...)), Go performs a single slice header/capacity adjustment worth exactly 64 bytes per batch.

2. CPU Performance & Map Iteration

The minor (~3%–7%) CPU overhead in microbenchmarks is directly attributable to the two map iterations over b.dists per metric family:

  1. Iterating over b.dists to check dist.complete() and build monitoring_pb.TimeSeries.
  2. Iterating over b.dists in the defer block to call putDistribution(dist) before calling clear(b.dists).

In a production scrape (e.g., 500 series taking ~3.44 ms), this delta amounts to less than 100 microseconds per scrape—a negligible trade-off for eliminating distribution theft and out-of-order errors on interleaved streams.

3. Impact of Commit fa054e200 (chore: remove order)

In the initial PR commit (ec24bbba6), buildDistributions tracked series insertion order using a b.touched []uint64 slice. Commit fa054e200 removed b.touched in favor of iterating directly over b.dists (and sorting slices in test assertions instead). This optimization was effective: removing b.touched eliminated per-sample slice growth overhead and kept the heap allocations at an absolute minimum (5 allocs/op).


Running the Benchmark Suite

To execute the benchmarks locally from the repository root:

go test -run=^$ -bench=BenchmarkSampleBuilder_Histograms -benchmem ./google/export

TAG=agy
CONV=20159a34-fd32-4d67-879d-325703fa43ea

bwplotka added 5 commits July 1, 2026 13:57
…theft

Add cases to TestSampleBuilder for:
- ungrouped (interleaved) histogram samples
- ungrouped (interleaved) histogram samples with first group incomplete
- ungrouped (interleaved) histogram samples with first group skipped due to new bucket

Enforce strict _bucket, _count, _sum ordering for test samples.

TAG=agy
CONV=8f508481-de1c-4e6b-ad3a-718d089a2fbe
When histogram samples in a scrape batch are ungrouped (interleaved across
different series label sets for the same metric name), existing
buildDistribution would return as soon as any histogram in the cache
completed. This caused histogram distribution theft (attaching completed
distributions to the wrong series) and sample loss for other interleaved series.

Replace buildDistribution with buildHistograms to consume the entire
contiguous block of samples for a histogram metric name. Cache series
metadata (hash, proto, lset) on distribution entries and emit all completed
distributions in deterministic order when the block ends. Incomplete or
skipped histogram distributions are cleanly discarded without being emitted
under incorrect series label sets.

TAG=agy
CONV=ae038258-882b-4fbd-8b3f-15976bc590f2
Signed-off-by: bwplotka <bwplotka@gmail.com>
Signed-off-by: bwplotka <bwplotka@gmail.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new benchmark file transform_bench_test.go to measure the performance of SampleBuilder when processing grouped and ungrouped histograms. The feedback recommends instantiating and closing the sampleBuilder outside of the benchmark loop to avoid unnecessary allocation and CPU overhead, which will yield more accurate benchmark results.

Comment thread google/export/transform_bench_test.go Outdated
Comment on lines +108 to +122
b.ResetTimer()
b.ReportAllocs()

for i := 0; i < b.N; i++ {
sb := newSampleBuilder(cache)
batch := batches[1]
for len(batch) > 0 {
_, tail, err := sb.next(metadata, externalLabels, batch, nil)
if err != nil {
b.Fatal(err)
}
batch = tail
}
sb.close()
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the benchmark loop, a new sampleBuilder is instantiated and closed on every iteration. Since newSampleBuilder allocates a map of size 128 and a slice, doing this inside the b.N loop introduces significant allocation and CPU overhead that is unrelated to the next() method being benchmarked.

Because sampleBuilder's internal state (b.dists and b.histResultsBuf) is fully cleared or reset during each call to next() (via buildDistributions), it is completely safe to reuse a single sampleBuilder instance across all iterations of the benchmark. Moving the instantiation outside the loop and calling b.ResetTimer() after it will make the benchmark results much more accurate and focused on the actual sample processing logic.

	sb := newSampleBuilder(cache)
	defer sb.close()

	b.ResetTimer()
	b.ReportAllocs()

	for i := 0; i < b.N; i++ {
		batch := batches[1]
		for len(batch) > 0 {
			_, tail, err := sb.next(metadata, externalLabels, batch, nil)
			if err != nil {
				b.Fatal(err)
			}
			batch = tail
		}
	}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this, also use b.Loop()

@bwplotka bwplotka force-pushed the fix-histogram-theft-bug branch from bd3aa88 to 7ce3854 Compare July 3, 2026 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant