perf(tokens): seed celo balances prior_balances from a current-balance state table#9846
Draft
a-monteiro wants to merge 1 commit into
Draft
perf(tokens): seed celo balances prior_balances from a current-balance state table#9846a-monteiro wants to merge 1 commit into
a-monteiro wants to merge 1 commit into
Conversation
…e state table Replace the full-history prior_balances self-scan in balances_daily_agg_from_transfers with a small key-grained current-balance state table (tokens_celo_balances_current), read by the base model via adapter.get_relation so there is no dbt dependency cycle. Falls back to the original full-history aggregation when the state table does not yet exist (first deploy and CI). Equivalence proven by the max_by split-at-cutoff identity (EXCEPT=0 both ways on prod celo data).
Member
Author
This stack of pull requests is managed by Graphite. Learn more about stacking. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

What
tokens_celo_balances_daily_agg_baserebuilds each wallet's running token balance every hour by seeding from aprior_balancesCTE that does a full-history self-scan of its own output:max_by(balance_raw, day)over{{ this }}whereblock_time < window_start. On celo that scan reads 2.13B rows / 54.4 GB every run and its hash-aggregation is the model's dominant cost.This PR seeds
prior_balancesfrom a small, key-grained current-balance state table (tokens_celo_balances_current) — one row per(address, token_address, token_standard)holding that key's latest balance below the incremental window — instead of re-aggregating all history. The base readsstate(< T-5)UNION a bounded recent gap[T-7, T-3)from its own output and takesmax_byover the union, which is exactly equal to the old full-history aggregation.The macro change is backward-compatible:
balances_daily_agg_from_transfersgains an optionalcurrent_balances_identifierparam. Only celo passes it; unichain (the only other caller, ~0.4 CPU-hrs/day) is unchanged and keeps the original code path.Why it is correct
Equivalence rests on the
max_bysplit-at-cutoff identity:max_by(v, day)overA ∪ Bequalsmax_by(v, day)over(max_by-per-key(A) ∪ B)— the global argmax-by-day is the argmax among partial argmaxes, regardless of any overlap betweenAandB.Proven read-only on prod celo data (
spellbook-hourly, bounded to history[2026-06-01, window_start)for cost, UTC), both with a clean split and with the exact deployed predicates (state< T-5, gap[T-7, T-3), overlap[T-7, T-5)):prior_balancesis the only thing this PR changes; the rest of the model (cumulative_flows, the seed join, the clamp/cast) is untouched, so a bit-identicalprior_balanceyields a bit-identicalbalance_raw.Design / cycle-break (please review)
tokens_celo_balances_currentref()s the base table, so dbt builds the base first and the state table after. The base reads the state table viaadapter.get_relation()(noref, hence no dbt cycle) and falls back to the original full-history aggregation when the state table does not exist yet — i.e. on first deploy and in CI, where the base builds before the state table is present. This keeps the critical base model and its CI regression test behaving exactly as today until the state table is in place.Because the base reads the previous run's state, the state cutoff (
< T-5) deliberately lags the base window (< T-3); the base unions the[T-7, T-3)gap from its own output so the result is exact even if the state table is up to ~2 days stale. The state table maintains itself incrementally with a self-healing catch-up (day >= (select max(day) from {{ this }})), so a missed run is absorbed automatically.Measured baseline vs predicted impact
Baseline (celo
balances_daily_agg_base, 24 MERGE builds/day, fromanalyze_query): prior_balances self-scan 2.13B rows / 54.4 GB, its hash-agg is ~73% of model CPU, peak 92.8 GB (growing with history; an OOM and cluster-memory-floor risk). Model total ≈ 28.5 CPU-hrs/day.Predicted (the magnitude cannot be measured fully read-only — the state read and the new peak only exist once the table is deployed; equivalence above is what is proven now): the agg input drops from 2.13B rows to ~166M (state ~146M + gap ~20M), so the ~237 GB exchange that drives the 92.8 GB peak shrinks ~15-20x. Expect IO ~54 GB -> ~12 GB/build, peak 92.8 GB -> well under node limits, per-build wall (~12 min) materially lower, less the new (bounded, mostly no-op within a day) state-maintenance build. To be confirmed on the first post-deploy incremental.
Rollout notes / caveats
prior_balances) before the state table exists. The fast-path SQL is the EXCEPT-proven query above; monitor the first incremental run.tokens_celo_balances_current(the state is derived from the base).tokens_celo_balances_currentis unpartitioned; if the per-run merge into the ~146M-row target proves heavy, clustering/Z-ordering the state onaddressis a follow-up.Fixes CUR2-2723