Skip to content

Performance optimizations for hot paths and stack trace handling#247

Draft
samber wants to merge 14 commits into
masterfrom
perf/hot-path-optimizations
Draft

Performance optimizations for hot paths and stack trace handling#247
samber wants to merge 14 commits into
masterfrom
perf/hot-path-optimizations

Conversation

@samber

@samber samber commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Summary

This PR improves performance across multiple hot paths in the dependency injection container and stack trace handling, with a focus on reducing lock contention, allocations, and unnecessary work in frequently-called operations.

Key Changes

Stack Trace Optimization

  • Refactored NewFrameFromCaller() to use runtime.Callers() and runtime.CallersFrames() instead of repeated runtime.Caller() calls, which re-walk the stack on each iteration
  • Implemented a two-phase approach: first try a small batch of frames (4), then fall back to full capture (10) only if needed
  • Extracted frame filtering logic into findEligibleFrame() helper for clarity
  • Added caching for $GOPATH directory splitting and sorting in removeGoPath() to avoid repeated work when the environment variable hasn't changed (re-reads $GOPATH each call, so runtime mutation still works)

DAG and Service Resolution

  • Added fast path in DAG.addDependency() with read lock check before acquiring write lock, reducing serialization for repeated invocations through transient/alias services
  • Introduced DAG.servicesWithoutDependents() method that batches dependency checks under a single lock, replacing repeated explainService() calls during parallel shutdown
  • Optimized Scope.Ancestors() to pre-allocate the result slice and walk the parent chain once, replacing recursive calls

Service Instance Caching

  • Added fast-path read lock in serviceLazy.getInstance() to check if instance is already built before acquiring write lock
  • Applied same optimization to serviceAlias.getInstance() and serviceEager.getInstance()
  • Added defensive atomic.LoadUint32() check before atomic.AddUint32() to avoid contended read-modify-write once frame collection cap is reached

Scope and Lifecycle

  • Optimized Scope.onServiceInvoke() with fast-path read lock to avoid write lock on subsequent invocations
  • Refactored RootScope.queueServiceHealthcheck() to handle non-expiring contexts more efficiently and avoid unnecessary channel wrapping

Utilities

  • Improved orderedUniq() with pre-allocated slice and map based on input length
  • Optimized newUUID() to use hex.Encode() with a fixed buffer instead of fmt.Sprintf(), reducing allocations

Testing

  • Added a comprehensive benchmark suite in a dedicated bench/ package, exercising the library exclusively through its public API (invocation, registration, lifecycle, introspection, and stack-frame capture)
  • A goleak TestMain keeps the goroutine-leak check on the health-check and shutdown benchmarks

Benchmark Results

Baseline (master) vs this branch, measured with the bench/ suite via benchstat. Runs were interleaved over 6 rounds (n=12 each) to average out CPU drift on the shared CI host (-benchmem -benchtime=100ms -count=2 per round; Go 1.24, linux/amd64).

Benchmark time allocs/op
InvokeLazySteadyState −10.5% 6 (unchanged)
InvokeNamed −18.7% 4 (unchanged)
InvokeTransient −4.6% 6 (unchanged)
InvokeAlias −18.7% 9 (unchanged)
InvokeAs −3.5% 11 (unchanged)
InvokeNestedScopes −8.7% 6 (unchanged)
ProviderChainTransient −19.7% 24 (unchanged)
Provide −18.6% 1426 → 1021 (−28%)
New −34.1% 16 → 11 (−31%)
HealthCheck −27.7% 625 → 475 (−24%)
Shutdown −28.2% 946 → 546 (−42%)
ListProvidedServices −25.3% 166 → 117 (−30%)
Ancestors −88.8% 14 → 1 (−93%)
NewFrameFromCaller −30.3% 6 → 2 (−67%)
NewFrameFromPC −29.4% 1 → 0 (−100%)
geomean −27.9%

All deltas are statistically significant (p < 0.001, except InvokeAs at p = 0.03). Steady-state invocation byte/alloc counts are unchanged — the remaining per-invoke allocations come from the invoker chain and virtual scope required for circular-dependency detection — so those wins come from reduced lock contention and cheaper hot-path work rather than fewer allocations. InvokeLazyParallel is omitted: the shared CI CPU makes the parallel measurement too noisy to report reliably (allocations are unchanged).

Notable Implementation Details

  • All optimizations maintain thread-safety and correctness
  • Fast-path optimizations use read locks before write locks to reduce contention in steady state
  • Stack trace improvements reduce both CPU and allocations for frame discovery
  • Shutdown optimization reduces lock acquisitions during parallel service shutdown by batching dependency checks

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ

claude added 13 commits June 10, 2026 17:28
…cktrace hot paths

Covers steady-state and parallel lazy invocation, named/transient/alias/
generic-type invocation, nested scope resolution, provider chains (DAG
writes), registration, scope creation, health checks, dependency-aware
shutdown, service listing, ancestor traversal, and stack frame capture.

Baseline for upcoming targeted optimizations; classic b.N loops to stay
compatible with the go 1.18 directive.

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
…n frame capture

serviceLazy.getInstance took the write lock on every call, serializing
all concurrent invocations of an already-built service. Add a
double-checked fast path: read built/instance under RLock, fall back to
the write lock and re-check before building. Both paths access the
fields under s.mu, so the RWMutex provides the required happens-before
ordering (shutdown's built=false reset is observed correctly).

Also prefix the invocation-frame counter check (lazy/eager/alias) with
an atomic load so steady-state calls stop performing a contended
read-modify-write once the MaxInvocationFrames cap is reached; the
atomic add remains the authoritative gate, so the frame cap semantics
are unchanged (counter is unexported and never read elsewhere).

benchstat vs previous commit (count=10, go1.24.7, linux/amd64),
significant rows only:

                       sec/op (old)  sec/op (new)  delta
InvokeNamed-4          223.2n ± 3%   217.8n ± 6%   -2.40% (p=0.006)
InvokeAlias-4          941.6n ± 3%   894.7n ± 2%   -4.98% (p=0.000)
InvokeAs-4             1.670µ ± 5%   1.588µ ± 4%   -4.91% (p=0.019)
InvokeNestedScopes-4   552.5n ± 4%   523.4n ± 5%   -5.27% (p=0.019)
geomean                                            -4.39%

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
NewFrameFromCaller called runtime.Caller(i) for i=0..9, re-walking the
stack from scratch on every iteration, and called runtime.GOROOT() once
per frame. Capture up to 10 frames with one runtime.Callers call and
iterate via runtime.CallersFrames, hoisting the GOROOT lookup out of
the loop.

Frame-selection semantics are preserved: skip=1 makes the first
examined frame NewFrameFromCaller itself (the runtime.Caller(0)
referent), the frame filter expression is unchanged, an empty
frame.Function maps to the old runtime.FuncForPC()==nil break, and at
most 10 logical frames are examined.

benchstat vs baseline (count=10, go1.24.7, linux/amd64):

                      sec/op (base)  sec/op (new)  delta
NewFrameFromCaller-4  2.184µ ± 3%    1.765µ ± 2%   -19.18% (p=0.000)

B/op -32.35%, allocs/op -33.33% (6 -> 4)

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
removeGoPath re-read $GOPATH, re-split it, and re-sorted the result on
every call (i.e. for every stack frame examined during frame capture).
Cache the split+sorted directories in an atomic.Value keyed by the raw
environment value. $GOPATH is still read per call on purpose: it can be
mutated at runtime (TestRemoveGoPath does), and the cache recomputes
whenever the raw value changes. A racing Store only causes a redundant
recompute.

benchstat with GOPATH set (count=10, go1.24.7, linux/amd64):

                sec/op (base)  sec/op (new)  delta
RemoveGoPath-4  517.2n ± 4%    390.6n ± 14%  -24.5%

B/op 64 -> 24, allocs/op 3 -> 1

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
The recursive append([]*Scope{parent}, parent.Ancestors()...) pattern
allocated a fresh slice at every level (O(n^2) allocations for n
ancestors). Walk the immutable parentScope chain twice instead: once to
count, once to fill a right-sized slice. Same order (immediate parent
to root) and still a non-nil empty slice for the root scope.

benchstat (count=10, go1.24.7, linux/amd64):

             sec/op (old)  sec/op (new)  delta
Ancestors-4  673.20n ± 4%  69.97n ± 13%  -89.61% (p=0.000)

B/op 496 -> 80, allocs/op 14 -> 1 (10-deep scope chain)

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
onServiceInvoke took the scope write lock on every invocation even
though the orderedInvocation entry only needs to be written once. Check
presence under a read lock first and fall back to the write lock with a
re-check, so steady-state invocations of the same scope no longer
serialize on the scope mutex.

benchstat (count=10, go1.24.7, linux/amd64):

                         sec/op (old)  sec/op (new)  delta
InvokeLazySteadyState-4  429.2n ± 20%  404.3n ± 7%   -5.78% (p=0.003)

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
Every nested service resolution records its dependency edge in the
global DAG under the write lock, even when the edge already exists —
which is the common case for repeated invocations through transient and
alias services. Check both directions of the edge under a read lock
first; fall back to the write lock (inserts are idempotent) only when
the edge is new.

benchstat (count=10, go1.24.7, linux/amd64):

                          sec/op (old)  sec/op (new)  delta
InvokeTransient-4         447.4n ± 7%   419.9n ± 3%    -6.15% (p=0.000)
ProviderChainTransient-4  2.873µ ± 6%   2.352µ ± 2%   -18.13% (p=0.000)

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
orderedUniq started from a zero-capacity slice and map, forcing
repeated growth while deduplicating service lists. Size both to the
input length (the output can never be larger).

benchstat (count=10, go1.24.7, linux/amd64):

                        sec/op (old)  sec/op (new)  delta
ListProvidedServices-4  53.54µ ± 5%   36.45µ ± 4%   -31.92% (p=0.000)

B/op -21.56%, allocs/op -34.04% (3-deep scopes x 10 services)

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
newUUID spent most of its time in fmt.Sprintf reflection. Encode the
same 16 random bytes (variant/version bits unchanged) into a fixed
36-byte buffer with hex.Encode and literal dashes — output format is
identical (lowercase hex, 8-4-4-4-12). Runs on every scope creation.

benchstat (count=10, go1.24.7, linux/amd64):

       sec/op (old)   sec/op (new)  delta
New-4  1143.0n ± 6%   707.8n ± 10%  -38.08% (p=0.000)

B/op -14.56%, allocs/op 16 -> 11

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
Each shutdown round called listServices() twice (two lock acquisitions
and key-slice allocations) and queried dag.explainService per service —
one DAG read lock plus two slice allocations each, even though only the
dependent count matters. Add DAG.servicesWithoutDependents, which
filters a whole round under a single read lock without allocating
per-service lists, and snapshot the service list once per round.

Round-by-round ordering and the circular-dependency fallback (shut down
all remaining services when none are dependent-free) are unchanged.

benchstat (count=10, go1.24.7, linux/amd64):

            sec/op (old)  sec/op (new)  delta
Shutdown-4  304.0µ ± 12%  224.5µ ± 9%   -26.16% (p=0.000)

B/op -16.93%, allocs/op 880 -> 480 (20-service dependency chain)

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
The non-pooled health check path always spawned two goroutines and two
channels per service to race the check against ctx.Done() — but with
the default options (no HealthCheckTimeout, no HealthCheckGlobalTimeout,
background context) the context can never expire and the race is dead
code. Use one goroutine when ctx.Done() == nil; keep the two-goroutine
race (with the same ErrHealthCheckTimeout wrapping and buffered,
leak-free channels) whenever the context can be canceled.

benchstat (count=10, go1.24.7, linux/amd64):

               sec/op (old)  sec/op (new)  delta
HealthCheck-4  83.15µ ± 26%  60.72µ ± 10%  -26.97% (p=0.000)

B/op -32.05%, allocs/op 469 -> 319 (50 healthcheckable services)

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
The single full-depth runtime.Callers capture regressed registration
paths: the eligible frame is almost always one of the immediate callers
(the frame filter accepts do-package frames outside do/stacktrace), so
walking 10 frames upfront did wasted work where the old loop exited
after 2-3 frames. Capture a 4-frame batch first and re-walk the full 10
frames only when the batch is exhausted without a match.

Frame-selection semantics unchanged: same filter, same 10-logical-frame
cap, the fallback rescans from the same skip depth.

benchstat vs baseline (count=10, go1.24.7, linux/amd64):

                      sec/op (base)  sec/op (new)  delta
Provide-4             273.7µ ± 2%    213.5µ ± 10%  -21.98% (p=0.000)
NewFrameFromCaller-4  2.184µ ± 3%    1.669µ ± 5%   -23.60% (p=0.000)

(Provide had regressed to +10.8% with the single-stage capture; this
restores and improves it. allocs/op for Provide: 1526 -> 821.)

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
@codecov

codecov Bot commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.42553% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.94%. Comparing base (303f9cf) to head (0f2d855).

Files with missing lines Patch % Lines
stacktrace/stacktrace.go 65.00% 5 Missing and 2 partials ⚠️
dag.go 86.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #247      +/-   ##
==========================================
+ Coverage   89.93%   89.94%   +0.01%     
==========================================
  Files          23       23              
  Lines        2155     2217      +62     
==========================================
+ Hits         1938     1994      +56     
- Misses        183      188       +5     
- Partials       34       35       +1     
Flag Coverage Δ
unittests 89.94% <90.42%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Consolidate the do and stacktrace benchmarks under bench/, exercising
the library through its public API only (qualified imports, no dot
imports per the revive lint rules). A goleak TestMain preserves the
goroutine-leak check for the health-check and shutdown benchmarks.

The stacktrace removeGoPath micro-benchmark is dropped: it covered an
unexported internal that the public-API package cannot reach, and it is
already exercised on the hot path by BenchmarkNewFrameFromCaller.

https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants