Performance optimizations for hot paths and stack trace handling by samber · Pull Request #247 · samber/do

samber · 2026-06-14T23:41:36Z

Summary

This PR improves performance across multiple hot paths in the dependency injection container and stack trace handling, with a focus on reducing lock contention, allocations, and unnecessary work in frequently-called operations.

Key Changes

Stack Trace Optimization

Refactored NewFrameFromCaller() to use runtime.Callers() and runtime.CallersFrames() instead of repeated runtime.Caller() calls, which re-walk the stack on each iteration
Implemented a two-phase approach: first try a small batch of frames (4), then fall back to full capture (10) only if needed
Extracted frame filtering logic into findEligibleFrame() helper for clarity
Added caching for $GOPATH directory splitting and sorting in removeGoPath() to avoid repeated work when the environment variable hasn't changed (re-reads $GOPATH each call, so runtime mutation still works)

DAG and Service Resolution

Added fast path in DAG.addDependency() with read lock check before acquiring write lock, reducing serialization for repeated invocations through transient/alias services
Introduced DAG.servicesWithoutDependents() method that batches dependency checks under a single lock, replacing repeated explainService() calls during parallel shutdown
Optimized Scope.Ancestors() to pre-allocate the result slice and walk the parent chain once, replacing recursive calls

Service Instance Caching

Added fast-path read lock in serviceLazy.getInstance() to check if instance is already built before acquiring write lock
Applied same optimization to serviceAlias.getInstance() and serviceEager.getInstance()
Added defensive atomic.LoadUint32() check before atomic.AddUint32() to avoid contended read-modify-write once frame collection cap is reached

Scope and Lifecycle

Optimized Scope.onServiceInvoke() with fast-path read lock to avoid write lock on subsequent invocations
Refactored RootScope.queueServiceHealthcheck() to handle non-expiring contexts more efficiently and avoid unnecessary channel wrapping

Utilities

Improved orderedUniq() with pre-allocated slice and map based on input length
Optimized newUUID() to use hex.Encode() with a fixed buffer instead of fmt.Sprintf(), reducing allocations

Testing

Added a comprehensive benchmark suite in a dedicated bench/ package, exercising the library exclusively through its public API (invocation, registration, lifecycle, introspection, and stack-frame capture)
A goleak TestMain keeps the goroutine-leak check on the health-check and shutdown benchmarks

Benchmark Results

Baseline (master) vs this branch, measured with the bench/ suite via benchstat. Runs were interleaved over 6 rounds (n=12 each) to average out CPU drift on the shared CI host (-benchmem -benchtime=100ms -count=2 per round; Go 1.24, linux/amd64).

Benchmark	time	allocs/op
InvokeLazySteadyState	−10.5%	6 (unchanged)
InvokeNamed	−18.7%	4 (unchanged)
InvokeTransient	−4.6%	6 (unchanged)
InvokeAlias	−18.7%	9 (unchanged)
InvokeAs	−3.5%	11 (unchanged)
InvokeNestedScopes	−8.7%	6 (unchanged)
ProviderChainTransient	−19.7%	24 (unchanged)
Provide	−18.6%	1426 → 1021 (−28%)
New	−34.1%	16 → 11 (−31%)
HealthCheck	−27.7%	625 → 475 (−24%)
Shutdown	−28.2%	946 → 546 (−42%)
ListProvidedServices	−25.3%	166 → 117 (−30%)
Ancestors	−88.8%	14 → 1 (−93%)
NewFrameFromCaller	−30.3%	6 → 2 (−67%)
NewFrameFromPC	−29.4%	1 → 0 (−100%)
geomean	−27.9%

All deltas are statistically significant (p < 0.001, except InvokeAs at p = 0.03). Steady-state invocation byte/alloc counts are unchanged — the remaining per-invoke allocations come from the invoker chain and virtual scope required for circular-dependency detection — so those wins come from reduced lock contention and cheaper hot-path work rather than fewer allocations. InvokeLazyParallel is omitted: the shared CI CPU makes the parallel measurement too noisy to report reliably (allocations are unchanged).

Notable Implementation Details

All optimizations maintain thread-safety and correctness
Fast-path optimizations use read locks before write locks to reduce contention in steady state
Stack trace improvements reduce both CPU and allocations for frame discovery
Shutdown optimization reduces lock acquisitions during parallel service shutdown by batching dependency checks

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

…cktrace hot paths Covers steady-state and parallel lazy invocation, named/transient/alias/ generic-type invocation, nested scope resolution, provider chains (DAG writes), registration, scope creation, health checks, dependency-aware shutdown, service listing, ancestor traversal, and stack frame capture. Baseline for upcoming targeted optimizations; classic b.N loops to stay compatible with the go 1.18 directive. https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

…n frame capture serviceLazy.getInstance took the write lock on every call, serializing all concurrent invocations of an already-built service. Add a double-checked fast path: read built/instance under RLock, fall back to the write lock and re-check before building. Both paths access the fields under s.mu, so the RWMutex provides the required happens-before ordering (shutdown's built=false reset is observed correctly). Also prefix the invocation-frame counter check (lazy/eager/alias) with an atomic load so steady-state calls stop performing a contended read-modify-write once the MaxInvocationFrames cap is reached; the atomic add remains the authoritative gate, so the frame cap semantics are unchanged (counter is unexported and never read elsewhere). benchstat vs previous commit (count=10, go1.24.7, linux/amd64), significant rows only: sec/op (old) sec/op (new) delta InvokeNamed-4 223.2n ± 3% 217.8n ± 6% -2.40% (p=0.006) InvokeAlias-4 941.6n ± 3% 894.7n ± 2% -4.98% (p=0.000) InvokeAs-4 1.670µ ± 5% 1.588µ ± 4% -4.91% (p=0.019) InvokeNestedScopes-4 552.5n ± 4% 523.4n ± 5% -5.27% (p=0.019) geomean -4.39% https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

NewFrameFromCaller called runtime.Caller(i) for i=0..9, re-walking the stack from scratch on every iteration, and called runtime.GOROOT() once per frame. Capture up to 10 frames with one runtime.Callers call and iterate via runtime.CallersFrames, hoisting the GOROOT lookup out of the loop. Frame-selection semantics are preserved: skip=1 makes the first examined frame NewFrameFromCaller itself (the runtime.Caller(0) referent), the frame filter expression is unchanged, an empty frame.Function maps to the old runtime.FuncForPC()==nil break, and at most 10 logical frames are examined. benchstat vs baseline (count=10, go1.24.7, linux/amd64): sec/op (base) sec/op (new) delta NewFrameFromCaller-4 2.184µ ± 3% 1.765µ ± 2% -19.18% (p=0.000) B/op -32.35%, allocs/op -33.33% (6 -> 4) https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

removeGoPath re-read $GOPATH, re-split it, and re-sorted the result on every call (i.e. for every stack frame examined during frame capture). Cache the split+sorted directories in an atomic.Value keyed by the raw environment value. $GOPATH is still read per call on purpose: it can be mutated at runtime (TestRemoveGoPath does), and the cache recomputes whenever the raw value changes. A racing Store only causes a redundant recompute. benchstat with GOPATH set (count=10, go1.24.7, linux/amd64): sec/op (base) sec/op (new) delta RemoveGoPath-4 517.2n ± 4% 390.6n ± 14% -24.5% B/op 64 -> 24, allocs/op 3 -> 1 https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

The recursive append([]*Scope{parent}, parent.Ancestors()...) pattern allocated a fresh slice at every level (O(n^2) allocations for n ancestors). Walk the immutable parentScope chain twice instead: once to count, once to fill a right-sized slice. Same order (immediate parent to root) and still a non-nil empty slice for the root scope. benchstat (count=10, go1.24.7, linux/amd64): sec/op (old) sec/op (new) delta Ancestors-4 673.20n ± 4% 69.97n ± 13% -89.61% (p=0.000) B/op 496 -> 80, allocs/op 14 -> 1 (10-deep scope chain) https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

onServiceInvoke took the scope write lock on every invocation even though the orderedInvocation entry only needs to be written once. Check presence under a read lock first and fall back to the write lock with a re-check, so steady-state invocations of the same scope no longer serialize on the scope mutex. benchstat (count=10, go1.24.7, linux/amd64): sec/op (old) sec/op (new) delta InvokeLazySteadyState-4 429.2n ± 20% 404.3n ± 7% -5.78% (p=0.003) https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

Every nested service resolution records its dependency edge in the global DAG under the write lock, even when the edge already exists — which is the common case for repeated invocations through transient and alias services. Check both directions of the edge under a read lock first; fall back to the write lock (inserts are idempotent) only when the edge is new. benchstat (count=10, go1.24.7, linux/amd64): sec/op (old) sec/op (new) delta InvokeTransient-4 447.4n ± 7% 419.9n ± 3% -6.15% (p=0.000) ProviderChainTransient-4 2.873µ ± 6% 2.352µ ± 2% -18.13% (p=0.000) https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

orderedUniq started from a zero-capacity slice and map, forcing repeated growth while deduplicating service lists. Size both to the input length (the output can never be larger). benchstat (count=10, go1.24.7, linux/amd64): sec/op (old) sec/op (new) delta ListProvidedServices-4 53.54µ ± 5% 36.45µ ± 4% -31.92% (p=0.000) B/op -21.56%, allocs/op -34.04% (3-deep scopes x 10 services) https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

newUUID spent most of its time in fmt.Sprintf reflection. Encode the same 16 random bytes (variant/version bits unchanged) into a fixed 36-byte buffer with hex.Encode and literal dashes — output format is identical (lowercase hex, 8-4-4-4-12). Runs on every scope creation. benchstat (count=10, go1.24.7, linux/amd64): sec/op (old) sec/op (new) delta New-4 1143.0n ± 6% 707.8n ± 10% -38.08% (p=0.000) B/op -14.56%, allocs/op 16 -> 11 https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

Each shutdown round called listServices() twice (two lock acquisitions and key-slice allocations) and queried dag.explainService per service — one DAG read lock plus two slice allocations each, even though only the dependent count matters. Add DAG.servicesWithoutDependents, which filters a whole round under a single read lock without allocating per-service lists, and snapshot the service list once per round. Round-by-round ordering and the circular-dependency fallback (shut down all remaining services when none are dependent-free) are unchanged. benchstat (count=10, go1.24.7, linux/amd64): sec/op (old) sec/op (new) delta Shutdown-4 304.0µ ± 12% 224.5µ ± 9% -26.16% (p=0.000) B/op -16.93%, allocs/op 880 -> 480 (20-service dependency chain) https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

The non-pooled health check path always spawned two goroutines and two channels per service to race the check against ctx.Done() — but with the default options (no HealthCheckTimeout, no HealthCheckGlobalTimeout, background context) the context can never expire and the race is dead code. Use one goroutine when ctx.Done() == nil; keep the two-goroutine race (with the same ErrHealthCheckTimeout wrapping and buffered, leak-free channels) whenever the context can be canceled. benchstat (count=10, go1.24.7, linux/amd64): sec/op (old) sec/op (new) delta HealthCheck-4 83.15µ ± 26% 60.72µ ± 10% -26.97% (p=0.000) B/op -32.05%, allocs/op 469 -> 319 (50 healthcheckable services) https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

The single full-depth runtime.Callers capture regressed registration paths: the eligible frame is almost always one of the immediate callers (the frame filter accepts do-package frames outside do/stacktrace), so walking 10 frames upfront did wasted work where the old loop exited after 2-3 frames. Capture a 4-frame batch first and re-walk the full 10 frames only when the batch is exhausted without a match. Frame-selection semantics unchanged: same filter, same 10-logical-frame cap, the fallback rescans from the same skip depth. benchstat vs baseline (count=10, go1.24.7, linux/amd64): sec/op (base) sec/op (new) delta Provide-4 273.7µ ± 2% 213.5µ ± 10% -21.98% (p=0.000) NewFrameFromCaller-4 2.184µ ± 3% 1.669µ ± 5% -23.60% (p=0.000) (Provide had regressed to +10.8% with the single-stage capture; this restores and improves it. allocs/op for Provide: 1526 -> 821.) https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

codecov · 2026-06-14T23:42:52Z

Codecov Report

❌ Patch coverage is 90.42553% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.94%. Comparing base (303f9cf) to head (0f2d855).

Files with missing lines	Patch %	Lines
stacktrace/stacktrace.go	65.00%	5 Missing and 2 partials ⚠️
dag.go	86.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #247      +/-   ##
==========================================
+ Coverage   89.93%   89.94%   +0.01%     
==========================================
  Files          23       23              
  Lines        2155     2217      +62     
==========================================
+ Hits         1938     1994      +56     
- Misses        183      188       +5     
- Partials       34       35       +1

Flag	Coverage Δ
unittests	`89.94% <90.42%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Consolidate the do and stacktrace benchmarks under bench/, exercising the library through its public API only (qualified imports, no dot imports per the revive lint rules). A goleak TestMain preserves the goroutine-leak check for the health-check and shutdown benchmarks. The stacktrace removeGoPath micro-benchmark is dropped: it covered an unexported internal that the public-API package cannot reach, and it is already exercised on the hot path by BenchmarkNewFrameFromCaller. https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

claude added 13 commits June 10, 2026 17:28

style: lint benchmark fixtures (gofumpt, unused fields)

0af4c45

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance optimizations for hot paths and stack trace handling#247

Performance optimizations for hot paths and stack trace handling#247
samber wants to merge 14 commits into
masterfrom
perf/hot-path-optimizations

samber commented Jun 14, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

samber commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Stack Trace Optimization

DAG and Service Resolution

Service Instance Caching

Scope and Lifecycle

Utilities

Testing

Benchmark Results

Notable Implementation Details

Uh oh!

codecov Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samber commented Jun 14, 2026 •

edited

Loading

codecov Bot commented Jun 14, 2026 •

edited

Loading