Performance optimizations for hot paths and stack trace handling#247
Draft
samber wants to merge 14 commits into
Draft
Performance optimizations for hot paths and stack trace handling#247samber wants to merge 14 commits into
samber wants to merge 14 commits into
Conversation
…cktrace hot paths Covers steady-state and parallel lazy invocation, named/transient/alias/ generic-type invocation, nested scope resolution, provider chains (DAG writes), registration, scope creation, health checks, dependency-aware shutdown, service listing, ancestor traversal, and stack frame capture. Baseline for upcoming targeted optimizations; classic b.N loops to stay compatible with the go 1.18 directive. https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
…n frame capture
serviceLazy.getInstance took the write lock on every call, serializing
all concurrent invocations of an already-built service. Add a
double-checked fast path: read built/instance under RLock, fall back to
the write lock and re-check before building. Both paths access the
fields under s.mu, so the RWMutex provides the required happens-before
ordering (shutdown's built=false reset is observed correctly).
Also prefix the invocation-frame counter check (lazy/eager/alias) with
an atomic load so steady-state calls stop performing a contended
read-modify-write once the MaxInvocationFrames cap is reached; the
atomic add remains the authoritative gate, so the frame cap semantics
are unchanged (counter is unexported and never read elsewhere).
benchstat vs previous commit (count=10, go1.24.7, linux/amd64),
significant rows only:
sec/op (old) sec/op (new) delta
InvokeNamed-4 223.2n ± 3% 217.8n ± 6% -2.40% (p=0.006)
InvokeAlias-4 941.6n ± 3% 894.7n ± 2% -4.98% (p=0.000)
InvokeAs-4 1.670µ ± 5% 1.588µ ± 4% -4.91% (p=0.019)
InvokeNestedScopes-4 552.5n ± 4% 523.4n ± 5% -5.27% (p=0.019)
geomean -4.39%
https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
NewFrameFromCaller called runtime.Caller(i) for i=0..9, re-walking the
stack from scratch on every iteration, and called runtime.GOROOT() once
per frame. Capture up to 10 frames with one runtime.Callers call and
iterate via runtime.CallersFrames, hoisting the GOROOT lookup out of
the loop.
Frame-selection semantics are preserved: skip=1 makes the first
examined frame NewFrameFromCaller itself (the runtime.Caller(0)
referent), the frame filter expression is unchanged, an empty
frame.Function maps to the old runtime.FuncForPC()==nil break, and at
most 10 logical frames are examined.
benchstat vs baseline (count=10, go1.24.7, linux/amd64):
sec/op (base) sec/op (new) delta
NewFrameFromCaller-4 2.184µ ± 3% 1.765µ ± 2% -19.18% (p=0.000)
B/op -32.35%, allocs/op -33.33% (6 -> 4)
https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
removeGoPath re-read $GOPATH, re-split it, and re-sorted the result on
every call (i.e. for every stack frame examined during frame capture).
Cache the split+sorted directories in an atomic.Value keyed by the raw
environment value. $GOPATH is still read per call on purpose: it can be
mutated at runtime (TestRemoveGoPath does), and the cache recomputes
whenever the raw value changes. A racing Store only causes a redundant
recompute.
benchstat with GOPATH set (count=10, go1.24.7, linux/amd64):
sec/op (base) sec/op (new) delta
RemoveGoPath-4 517.2n ± 4% 390.6n ± 14% -24.5%
B/op 64 -> 24, allocs/op 3 -> 1
https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
The recursive append([]*Scope{parent}, parent.Ancestors()...) pattern
allocated a fresh slice at every level (O(n^2) allocations for n
ancestors). Walk the immutable parentScope chain twice instead: once to
count, once to fill a right-sized slice. Same order (immediate parent
to root) and still a non-nil empty slice for the root scope.
benchstat (count=10, go1.24.7, linux/amd64):
sec/op (old) sec/op (new) delta
Ancestors-4 673.20n ± 4% 69.97n ± 13% -89.61% (p=0.000)
B/op 496 -> 80, allocs/op 14 -> 1 (10-deep scope chain)
https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
onServiceInvoke took the scope write lock on every invocation even
though the orderedInvocation entry only needs to be written once. Check
presence under a read lock first and fall back to the write lock with a
re-check, so steady-state invocations of the same scope no longer
serialize on the scope mutex.
benchstat (count=10, go1.24.7, linux/amd64):
sec/op (old) sec/op (new) delta
InvokeLazySteadyState-4 429.2n ± 20% 404.3n ± 7% -5.78% (p=0.003)
https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
Every nested service resolution records its dependency edge in the
global DAG under the write lock, even when the edge already exists —
which is the common case for repeated invocations through transient and
alias services. Check both directions of the edge under a read lock
first; fall back to the write lock (inserts are idempotent) only when
the edge is new.
benchstat (count=10, go1.24.7, linux/amd64):
sec/op (old) sec/op (new) delta
InvokeTransient-4 447.4n ± 7% 419.9n ± 3% -6.15% (p=0.000)
ProviderChainTransient-4 2.873µ ± 6% 2.352µ ± 2% -18.13% (p=0.000)
https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
orderedUniq started from a zero-capacity slice and map, forcing
repeated growth while deduplicating service lists. Size both to the
input length (the output can never be larger).
benchstat (count=10, go1.24.7, linux/amd64):
sec/op (old) sec/op (new) delta
ListProvidedServices-4 53.54µ ± 5% 36.45µ ± 4% -31.92% (p=0.000)
B/op -21.56%, allocs/op -34.04% (3-deep scopes x 10 services)
https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
newUUID spent most of its time in fmt.Sprintf reflection. Encode the
same 16 random bytes (variant/version bits unchanged) into a fixed
36-byte buffer with hex.Encode and literal dashes — output format is
identical (lowercase hex, 8-4-4-4-12). Runs on every scope creation.
benchstat (count=10, go1.24.7, linux/amd64):
sec/op (old) sec/op (new) delta
New-4 1143.0n ± 6% 707.8n ± 10% -38.08% (p=0.000)
B/op -14.56%, allocs/op 16 -> 11
https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
Each shutdown round called listServices() twice (two lock acquisitions
and key-slice allocations) and queried dag.explainService per service —
one DAG read lock plus two slice allocations each, even though only the
dependent count matters. Add DAG.servicesWithoutDependents, which
filters a whole round under a single read lock without allocating
per-service lists, and snapshot the service list once per round.
Round-by-round ordering and the circular-dependency fallback (shut down
all remaining services when none are dependent-free) are unchanged.
benchstat (count=10, go1.24.7, linux/amd64):
sec/op (old) sec/op (new) delta
Shutdown-4 304.0µ ± 12% 224.5µ ± 9% -26.16% (p=0.000)
B/op -16.93%, allocs/op 880 -> 480 (20-service dependency chain)
https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
The non-pooled health check path always spawned two goroutines and two
channels per service to race the check against ctx.Done() — but with
the default options (no HealthCheckTimeout, no HealthCheckGlobalTimeout,
background context) the context can never expire and the race is dead
code. Use one goroutine when ctx.Done() == nil; keep the two-goroutine
race (with the same ErrHealthCheckTimeout wrapping and buffered,
leak-free channels) whenever the context can be canceled.
benchstat (count=10, go1.24.7, linux/amd64):
sec/op (old) sec/op (new) delta
HealthCheck-4 83.15µ ± 26% 60.72µ ± 10% -26.97% (p=0.000)
B/op -32.05%, allocs/op 469 -> 319 (50 healthcheckable services)
https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
The single full-depth runtime.Callers capture regressed registration
paths: the eligible frame is almost always one of the immediate callers
(the frame filter accepts do-package frames outside do/stacktrace), so
walking 10 frames upfront did wasted work where the old loop exited
after 2-3 frames. Capture a 4-frame batch first and re-walk the full 10
frames only when the batch is exhausted without a match.
Frame-selection semantics unchanged: same filter, same 10-logical-frame
cap, the fallback rescans from the same skip depth.
benchstat vs baseline (count=10, go1.24.7, linux/amd64):
sec/op (base) sec/op (new) delta
Provide-4 273.7µ ± 2% 213.5µ ± 10% -21.98% (p=0.000)
NewFrameFromCaller-4 2.184µ ± 3% 1.669µ ± 5% -23.60% (p=0.000)
(Provide had regressed to +10.8% with the single-stage capture; this
restores and improves it. allocs/op for Provide: 1526 -> 821.)
https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #247 +/- ##
==========================================
+ Coverage 89.93% 89.94% +0.01%
==========================================
Files 23 23
Lines 2155 2217 +62
==========================================
+ Hits 1938 1994 +56
- Misses 183 188 +5
- Partials 34 35 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Consolidate the do and stacktrace benchmarks under bench/, exercising the library through its public API only (qualified imports, no dot imports per the revive lint rules). A goleak TestMain preserves the goroutine-leak check for the health-check and shutdown benchmarks. The stacktrace removeGoPath micro-benchmark is dropped: it covered an unexported internal that the public-API package cannot reach, and it is already exercised on the hot path by BenchmarkNewFrameFromCaller. https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves performance across multiple hot paths in the dependency injection container and stack trace handling, with a focus on reducing lock contention, allocations, and unnecessary work in frequently-called operations.
Key Changes
Stack Trace Optimization
NewFrameFromCaller()to useruntime.Callers()andruntime.CallersFrames()instead of repeatedruntime.Caller()calls, which re-walk the stack on each iterationfindEligibleFrame()helper for clarity$GOPATHdirectory splitting and sorting inremoveGoPath()to avoid repeated work when the environment variable hasn't changed (re-reads$GOPATHeach call, so runtime mutation still works)DAG and Service Resolution
DAG.addDependency()with read lock check before acquiring write lock, reducing serialization for repeated invocations through transient/alias servicesDAG.servicesWithoutDependents()method that batches dependency checks under a single lock, replacing repeatedexplainService()calls during parallel shutdownScope.Ancestors()to pre-allocate the result slice and walk the parent chain once, replacing recursive callsService Instance Caching
serviceLazy.getInstance()to check if instance is already built before acquiring write lockserviceAlias.getInstance()andserviceEager.getInstance()atomic.LoadUint32()check beforeatomic.AddUint32()to avoid contended read-modify-write once frame collection cap is reachedScope and Lifecycle
Scope.onServiceInvoke()with fast-path read lock to avoid write lock on subsequent invocationsRootScope.queueServiceHealthcheck()to handle non-expiring contexts more efficiently and avoid unnecessary channel wrappingUtilities
orderedUniq()with pre-allocated slice and map based on input lengthnewUUID()to usehex.Encode()with a fixed buffer instead offmt.Sprintf(), reducing allocationsTesting
bench/package, exercising the library exclusively through its public API (invocation, registration, lifecycle, introspection, and stack-frame capture)goleakTestMainkeeps the goroutine-leak check on the health-check and shutdown benchmarksBenchmark Results
Baseline (
master) vs this branch, measured with thebench/suite viabenchstat. Runs were interleaved over 6 rounds (n=12 each) to average out CPU drift on the shared CI host (-benchmem -benchtime=100ms -count=2per round; Go 1.24, linux/amd64).All deltas are statistically significant (p < 0.001, except InvokeAs at p = 0.03). Steady-state invocation byte/alloc counts are unchanged — the remaining per-invoke allocations come from the invoker chain and virtual scope required for circular-dependency detection — so those wins come from reduced lock contention and cheaper hot-path work rather than fewer allocations.
InvokeLazyParallelis omitted: the shared CI CPU makes the parallel measurement too noisy to report reliably (allocations are unchanged).Notable Implementation Details
https://claude.ai/code/session_019ESFVztuAbfjQv9N9dJnXZ