Skip to content

perf(tokens): seed celo balances prior_balances from a current-balance state table#9846

Draft
a-monteiro wants to merge 1 commit into
mainfrom
andre/cur2-2723-tokens-celo-balances-current
Draft

perf(tokens): seed celo balances prior_balances from a current-balance state table#9846
a-monteiro wants to merge 1 commit into
mainfrom
andre/cur2-2723-tokens-celo-balances-current

Conversation

@a-monteiro

@a-monteiro a-monteiro commented Jun 29, 2026

Copy link
Copy Markdown
Member

What

tokens_celo_balances_daily_agg_base rebuilds each wallet's running token balance every hour by seeding from a prior_balances CTE that does a full-history self-scan of its own output: max_by(balance_raw, day) over {{ this }} where block_time < window_start. On celo that scan reads 2.13B rows / 54.4 GB every run and its hash-aggregation is the model's dominant cost.

This PR seeds prior_balances from a small, key-grained current-balance state table (tokens_celo_balances_current) — one row per (address, token_address, token_standard) holding that key's latest balance below the incremental window — instead of re-aggregating all history. The base reads state(< T-5) UNION a bounded recent gap [T-7, T-3) from its own output and takes max_by over the union, which is exactly equal to the old full-history aggregation.

The macro change is backward-compatible: balances_daily_agg_from_transfers gains an optional current_balances_identifier param. Only celo passes it; unichain (the only other caller, ~0.4 CPU-hrs/day) is unchanged and keeps the original code path.

Why it is correct

Equivalence rests on the max_by split-at-cutoff identity: max_by(v, day) over A ∪ B equals max_by(v, day) over (max_by-per-key(A) ∪ B) — the global argmax-by-day is the argmax among partial argmaxes, regardless of any overlap between A and B.

Proven read-only on prod celo data (spellbook-hourly, bounded to history [2026-06-01, window_start) for cost, UTC), both with a clean split and with the exact deployed predicates (state < T-5, gap [T-7, T-3), overlap [T-7, T-5)):

old_prior EXCEPT new_prior = 0
new_prior EXCEPT old_prior = 0   (32,249,278 keys, identical)

prior_balances is the only thing this PR changes; the rest of the model (cumulative_flows, the seed join, the clamp/cast) is untouched, so a bit-identical prior_balance yields a bit-identical balance_raw.

Design / cycle-break (please review)

tokens_celo_balances_current ref()s the base table, so dbt builds the base first and the state table after. The base reads the state table via adapter.get_relation() (no ref, hence no dbt cycle) and falls back to the original full-history aggregation when the state table does not exist yet — i.e. on first deploy and in CI, where the base builds before the state table is present. This keeps the critical base model and its CI regression test behaving exactly as today until the state table is in place.

Because the base reads the previous run's state, the state cutoff (< T-5) deliberately lags the base window (< T-3); the base unions the [T-7, T-3) gap from its own output so the result is exact even if the state table is up to ~2 days stale. The state table maintains itself incrementally with a self-healing catch-up (day >= (select max(day) from {{ this }})), so a missed run is absorbed automatically.

Measured baseline vs predicted impact

Baseline (celo balances_daily_agg_base, 24 MERGE builds/day, from analyze_query): prior_balances self-scan 2.13B rows / 54.4 GB, its hash-agg is ~73% of model CPU, peak 92.8 GB (growing with history; an OOM and cluster-memory-floor risk). Model total ≈ 28.5 CPU-hrs/day.

Predicted (the magnitude cannot be measured fully read-only — the state read and the new peak only exist once the table is deployed; equivalence above is what is proven now): the agg input drops from 2.13B rows to ~166M (state ~146M + gap ~20M), so the ~237 GB exchange that drives the 92.8 GB peak shrinks ~15-20x. Expect IO ~54 GB -> ~12 GB/build, peak 92.8 GB -> well under node limits, per-build wall (~12 min) materially lower, less the new (bounded, mostly no-op within a day) state-maintenance build. To be confirmed on the first post-deploy incremental.

Rollout notes / caveats

  • The fast path (state UNION gap) first executes in prod, since CI builds the base full-refresh (no prior_balances) before the state table exists. The fast-path SQL is the EXCEPT-proven query above; monitor the first incremental run.
  • A full-refresh of the base should be followed by a rebuild of tokens_celo_balances_current (the state is derived from the base).
  • tokens_celo_balances_current is unpartitioned; if the per-run merge into the ~146M-row target proves heavy, clustering/Z-ordering the state on address is a follow-up.

Fixes CUR2-2723

…e state table

Replace the full-history prior_balances self-scan in
balances_daily_agg_from_transfers with a small key-grained current-balance
state table (tokens_celo_balances_current), read by the base model via
adapter.get_relation so there is no dbt dependency cycle. Falls back to the
original full-history aggregation when the state table does not yet exist
(first deploy and CI). Equivalence proven by the max_by split-at-cutoff
identity (EXCEPT=0 both ways on prod celo data).
@github-actions github-actions Bot added WIP work in progress dbt: tokens covers the Tokens dbt subproject labels Jun 29, 2026

Copy link
Copy Markdown
Member Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@a-monteiro a-monteiro requested a review from 0xRobin June 30, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dbt: tokens covers the Tokens dbt subproject WIP work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant