Skip to content

delta_cdf incremental strategy — Delta CDF POC (draft)#9840

Draft
a-monteiro wants to merge 2 commits into
mainfrom
andre/cur2-2963-delta-cdf
Draft

delta_cdf incremental strategy — Delta CDF POC (draft)#9840
a-monteiro wants to merge 2 commits into
mainfrom
andre/cur2-2963-delta-cdf

Conversation

@a-monteiro

@a-monteiro a-monteiro commented Jun 27, 2026

Copy link
Copy Markdown
Member

Draft / POC — not for merge. A working proof-of-concept for a Delta Change Data Feed (CDF) incremental strategy that replaces the wall-clock-lookback + MERGE pattern with reading only the upstream's change feed.

MERGE is ~83% of fleet Trino CPU, and the incremental models that drive it routinely scan millions of input rows to write a few hundred (e.g. tokens_bnb.transfers averages 1.68M input rows → 406 output). CDF reads only what changed since the model's last run.

Built and measured live against a bounded clone of tokens_bnb.base_transfers. Applying the same change set, CDF vs the wall-clock-lookback baseline: 7.1× less CPU (161.8s vs 1,144.5s), 12× less I/O (1.17GB vs 14.1GB), ~490× fewer rows written. A target-scan partition-pruning optimization adds ~4.8× on a 4-partition target (scans only the touched partition; grows to 12×+ on a block_month table with history). amount_raw was bit-exact; the only divergence was amount_usd in one hour where an upstream price was revised between runs — the expected semantic limit (CDF doesn't restate already-loaded rows).

Shippable infra (commit 1): the delta_cdf strategy macro, the cdf/ macros (source_changes + watermark), and the adapters.sql changes. Throwaway POC (commit 2): everything under _poc_cdf/ and transfers_enrich_cdf.

Opportunity sizing (full analysis in ~/Dune/docs/cdf-candidate-models.md)

Per-subproject daily MERGE cost vs estimated CDF savings (P1 = usable now, P2 = needs CDF on raw ingestion):

subproject build MERGE CPU-h/day CDF now CDF full I/O saved/day
tokens 24/7 466 −17% −52% 19 TB
hourly 24/7 275 −10% −45% 10 TB
dex 24/7 227 −27% −51% 27 TB
curated-data 24/7 250 ~−12% ~−17% ~4 TB
solana scale-to-zero 75 −25% −38% 5 TB
daily scale-to-zero 110 −4% −27% 6 TB

Totals: ~223 CPU-h/day usable now (P1), ~581 CPU-h/day full (P1+P2), ~72 TB/day I/O.

Capacity, not runtime. The CDF cost concentrates on the 24/7 clusters → it's a worker-capacity/downsize win, not a job-runtime win. The scale-to-zero builds (daily/solana) are gated by non-CDF long-poles, so their wall-clock lever is DAG parallelism (daily runs threads=8, thread-capped at its 04:00 trough; ~1.5–2h recoverable) — separate from CDF. curated-data is a weak CDF target (~53% of its cost is stateful aggregations/windows that need a recompute-affected-groups pattern, not this merge).

Prerequisites before any real CDF spell ships

  • system.table_changes is PERMISSION_DENIED for the spellbook Trino role (works only as admin) — needs a function-execution grant.
  • CREATE OR REPLACE is unsupported on CDF tables, so --full-refresh needs a DROP-first materialization tweak.
  • A restatement-pass policy for price/metadata-enriched models.
  • table_changes drops Dune uint256/int256 to raw varbinary; source_changes re-decodes them in SQL the same way Dune's raw views already do (_unsafe_uint256-style) — this is the permanent fix, not a connector change.

Towards CUR2-2963

A custom dbt-trino incremental strategy that applies a Delta Change Data
Feed change-set (table_changes) instead of a wall-clock lookback + MERGE.

- cdf/ : source_changes (bootstrap snapshot vs table_changes feed, with
  uint256/int256 re-decode since table_changes drops the logical type to
  varbinary) and the watermark macros (read/advance/current source version,
  stored as the dune.cdf.source_version table property).
- get_incremental_delta_cdf_sql: the strategy macro - dedup the feed to the
  latest change per unique_key, MERGE upsert, prune the target scan to the
  change-set partition range, then stamp the watermark in one MERGE;ALTER.
- adapters: forward change_data_feed_enabled into the CTAS, and use a
  connector-managed location for CDF tables (an explicit location breaks
  credential vending and the stats-on-write read).
Throwaway POC artifacts under _poc_cdf/ that exercise the delta_cdf strategy
against a bounded clone of tokens_bnb.base_transfers, plus a baseline target
for an apples-to-apples A/B. transfers_enrich_cdf is a CDF-aware fork of
transfers_enrich (change feed source, prices bounded to the change-set time
range, CDF metadata carried only on the incremental path). Delete once the
strategy is productionized.
@github-actions github-actions Bot added WIP work in progress dbt: tokens covers the Tokens dbt subproject labels Jun 27, 2026

Copy link
Copy Markdown
Member Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@a-monteiro a-monteiro changed the title Add delta_cdf incremental strategy and Delta CDF macros delta_cdf incremental strategy — Delta CDF POC (draft) Jun 27, 2026
@a-monteiro a-monteiro requested a review from 0xRobin June 28, 2026 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dbt: tokens covers the Tokens dbt subproject WIP work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant