delta_cdf incremental strategy — Delta CDF POC (draft)#9840
Draft
a-monteiro wants to merge 2 commits into
Draft
Conversation
A custom dbt-trino incremental strategy that applies a Delta Change Data Feed change-set (table_changes) instead of a wall-clock lookback + MERGE. - cdf/ : source_changes (bootstrap snapshot vs table_changes feed, with uint256/int256 re-decode since table_changes drops the logical type to varbinary) and the watermark macros (read/advance/current source version, stored as the dune.cdf.source_version table property). - get_incremental_delta_cdf_sql: the strategy macro - dedup the feed to the latest change per unique_key, MERGE upsert, prune the target scan to the change-set partition range, then stamp the watermark in one MERGE;ALTER. - adapters: forward change_data_feed_enabled into the CTAS, and use a connector-managed location for CDF tables (an explicit location breaks credential vending and the stats-on-write read).
Throwaway POC artifacts under _poc_cdf/ that exercise the delta_cdf strategy against a bounded clone of tokens_bnb.base_transfers, plus a baseline target for an apples-to-apples A/B. transfers_enrich_cdf is a CDF-aware fork of transfers_enrich (change feed source, prices bounded to the change-set time range, CDF metadata carried only on the incremental path). Delete once the strategy is productionized.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Draft / POC — not for merge. A working proof-of-concept for a Delta Change Data Feed (CDF) incremental strategy that replaces the wall-clock-lookback + MERGE pattern with reading only the upstream's change feed.
MERGE is ~83% of fleet Trino CPU, and the incremental models that drive it routinely scan millions of input rows to write a few hundred (e.g.
tokens_bnb.transfersaverages 1.68M input rows → 406 output). CDF reads only what changed since the model's last run.Built and measured live against a bounded clone of
tokens_bnb.base_transfers. Applying the same change set, CDF vs the wall-clock-lookback baseline: 7.1× less CPU (161.8s vs 1,144.5s), 12× less I/O (1.17GB vs 14.1GB), ~490× fewer rows written. A target-scan partition-pruning optimization adds ~4.8× on a 4-partition target (scans only the touched partition; grows to 12×+ on ablock_monthtable with history).amount_rawwas bit-exact; the only divergence wasamount_usdin one hour where an upstream price was revised between runs — the expected semantic limit (CDF doesn't restate already-loaded rows).Shippable infra (commit 1): the
delta_cdfstrategy macro, thecdf/macros (source_changes+ watermark), and theadapters.sqlchanges. Throwaway POC (commit 2): everything under_poc_cdf/andtransfers_enrich_cdf.Opportunity sizing (full analysis in
~/Dune/docs/cdf-candidate-models.md)Per-subproject daily MERGE cost vs estimated CDF savings (P1 = usable now, P2 = needs CDF on raw ingestion):
Totals: ~223 CPU-h/day usable now (P1), ~581 CPU-h/day full (P1+P2), ~72 TB/day I/O.
Capacity, not runtime. The CDF cost concentrates on the 24/7 clusters → it's a worker-capacity/downsize win, not a job-runtime win. The scale-to-zero builds (daily/solana) are gated by non-CDF long-poles, so their wall-clock lever is DAG parallelism (daily runs
threads=8, thread-capped at its 04:00 trough; ~1.5–2h recoverable) — separate from CDF. curated-data is a weak CDF target (~53% of its cost is stateful aggregations/windows that need a recompute-affected-groups pattern, not this merge).Prerequisites before any real CDF spell ships
system.table_changesisPERMISSION_DENIEDfor thespellbookTrino role (works only asadmin) — needs a function-execution grant.CREATE OR REPLACEis unsupported on CDF tables, so--full-refreshneeds aDROP-first materialization tweak.table_changesdrops Duneuint256/int256to rawvarbinary;source_changesre-decodes them in SQL the same way Dune's raw views already do (_unsafe_uint256-style) — this is the permanent fix, not a connector change.Towards CUR2-2963