injective keys - 75%+ faster expansion, 50% reduction in file size, massive allocation reduction && Soundness#993
injective keys - 75%+ faster expansion, 50% reduction in file size, massive allocation reduction && Soundness#993kans wants to merge 7 commits into
Conversation
7eb0dda to
9616b70
Compare
General PR Review: injective keys - 75%+ faster expansion, 50% reduction in file size, massive allocation reduction && SoundnessBlocking Issues: 1 | Suggestions: 3 | Threads Resolved: 0 Review SummaryRe-reviewed the full PR diff (115 files, ~13.3k additions) for security and correctness, with focused verification of the wire/proto surfaces, the on-disk id-index migration, and the serialized-state paths. Two candidate blocking issues were evaluated and dismissed as false positives: the build-then-replace id-index migration is crash-safe because each range is swapped atomically via Security IssuesNone found. Correctness Issues
Suggestions
Prompt for AI agents``` Correctness IssuesIn `pkg/dotc1z/to_pebble.go` and `pkg/dotc1z/engine/pebble/bulk_import.go`:
SuggestionsIn `pkg/sdk/version.go`:
In `pkg/dotc1z/engine/pebble/adapter.go` / `bulk_import.go`:
In `pkg/dotc1z/engine/pebble/deferred_index.go`:
|
9616b70 to
93646ec
Compare
93646ec to
84500e6
Compare
| message Grant { | ||
| c1.connector.v2.Entitlement entitlement = 1 [(validate.rules).message = {required: true}]; | ||
| c1.connector.v2.Resource principal = 2 [(validate.rules).message = {required: true}]; | ||
| // These ids may not map one to one with the grant itself. |
There was a problem hiding this comment.
🟡 Suggestion: The generated pb/c1/connector/v2/grant.pb.go rawDesc in this PR marks field id = 3 as deprecated=true (the added \x18\x01 option bytes), but this .proto source only adds a comment — it does not declare deprecated = true on the field. This is generated-artifact drift: running make protogen would drop the deprecated flag. If the intent is to deprecate id, add deprecated = true to the field options here and regenerate; otherwise the generated file shouldn't carry it. Confidence: high (mismatch), low blast radius.
84500e6 to
a37841f
Compare
…iously trip compaction thresholds hundreds of times due to data volume alone
| // runs) but competes with those phases for CPU and IO, so stop granting | ||
| // new compactions. StartNewSync/SetCurrentSync resume the scheduler if | ||
| // the store is written to again. | ||
| a.engine.PauseCompactions() |
There was a problem hiding this comment.
🟡 Suggestion (low confidence): Compactions are paused for the whole EndSync-to-close window, but that window is exactly where the deferred by_principal index build ingests its SSTs. Flushes still drain memtables to L0 and ingested SSTs that overlap existing key ranges can also land in L0; with granting paused, L0 can accumulate toward L0StopWritesThreshold and stall the very index build this is meant to speed up. Worth confirming (via the whale run's LogCompactionMetrics) that read-amp/L0 stays bounded through the index build, or resuming compactions if a write phase follows the pause. The code comment acknowledges this tradeoff, so this is a verification ask rather than a confirmed bug.
| // isGranting serializes granting from Done and the periodic granter, | ||
| // and lets Unregister wait out an in-flight grant loop. | ||
| isGranting bool | ||
| isGrantingCond *sync.Cond |
There was a problem hiding this comment.
🟡 Suggestion: pausableCompactionScheduler reimplements pebble's ConcurrencyLimitScheduler with non-trivial concurrency (the isGranting handoff, Unregister waiting out an in-flight grant loop, Done/periodic-granter/TrySchedule all mutating runningCompactions). It has no direct unit test — only exercised indirectly via the env-gated whale benchmark. A focused test covering pause during an in-flight grant, resume-poke pickup, and Register/Unregister ordering would guard this against future pebble-version bumps. Same applies to the new Begin/Add/Finish/Abort layer session (segment cut, worker-error propagation, abort mid-wave), only reached with BATON_PEBBLE_SYNTH_LAYER_SST set.
| // Tee the raw row into the primary-keyspace rebuild pipeline. The | ||
| // iterator yields globally sorted keys, so each SST is sorted and | ||
| // the cut files are mutually disjoint. | ||
| if err := rebuild.add(iter.Key(), iter.Value()); err != nil { |
There was a problem hiding this comment.
🟡 Suggestion: A tee write error surfaced here (via add→flushBatch→takeErr) is returned from BuildDeferredGrantIndexes and fails the whole EndSync/sync. That contradicts the design intent stated at line 380 and the excise GUARD (line 414), where post-scan rebuild failures are downgraded to a loud no-op that "leaves the LSM alone rather than failing the sync." Consider capturing a scan-time tee error and continuing the scan (so the by_principal index still builds and the excise is skipped) rather than aborting the sync. Low confidence — an SST-write error mid-scan may also fail the concurrent spill sorter, but the asymmetry with the post-scan guard is real. (incremental)
| for k := 0; k < len(entitlementID); k++ { | ||
| if entitlementID[k] != ':' { | ||
| continue | ||
| } | ||
| for l := k + 1; l < len(entitlementID); l++ { | ||
| if entitlementID[l] != ':' { | ||
| continue | ||
| } | ||
| cand := entitlementIdentity{ | ||
| resourceTypeID: entitlementID[:k], | ||
| resourceID: entitlementID[k+1 : l], | ||
| stripped: true, | ||
| tail: entitlementID[l+1:], | ||
| } | ||
| nonEmpty, err := e.grantPrimaryPrefixNonEmpty(encodeGrantPrimaryEntitlementPrefix(cand)) | ||
| if err != nil { | ||
| return entitlementIdentity{}, err |
There was a problem hiding this comment.
🟡 Suggestion: This direct-split fallback opens a fresh Pebble iterator (grantPrimaryPrefixNonEmpty) for every O(colons²) candidate with no upper bound, unlike resolveGrantIdentityByExternalID which caps enumeration at maxGrantIDCandidates. An entitlement id with many colons and no matching entitlement record would open a large number of iterators on this bare-id edge. Consider applying a similar cap here for consistency and robustness. (low confidence)
24fba77 to
687fe2d
Compare
687fe2d to
e47abd8
Compare
e47abd8 to
a804d65
Compare
a804d65 to
376f9b4
Compare
376f9b4 to
a1fa055
Compare
ba5f7da to
104a04c
Compare
104a04c to
407f025
Compare
| } | ||
| pid := principal.GetId() | ||
| grantID := descEntitlement.GetId() + ":" + pid.GetResourceType() + ":" + pid.GetResource() | ||
| grantID := batonGrant.NewGrantID(principal, descEntitlement) |
There was a problem hiding this comment.
🟡 Suggestion: newExpandedGrantWithSources checks principal == nil but not principal.GetId() == nil. batonGrant.NewGrantID panics ("principal resource must have a valid resource ID") when the principal's Id is nil (grant.go:143-145), whereas the pre-PR manual ":"-join was nil-safe. Reachable only on a malformed principal, but consider guarding principal.GetId() == nil and returning an error instead of panicking. (confidence: low)
407f025 to
133e3a1
Compare
133e3a1 to
ee7896f
Compare
ee7896f to
159a719
Compare
Pebble: injective grant identities + sorted-ingest grant expansion
Summary
This PR makes grant expansion on the Pebble engine fast and predictable for very
large datasets by fixing two compounding problems: grant/entitlement IDs were
not injective (distinct logical entities could collide, and equal entities
could carry unequal IDs), and expansion emitted grants out of key order,
turning the LSM write path into a write-amplification machine.
On a production-scale benchmark (3.68M base grants expanding to 57.7M), a full
projection expansion + index build + save went from ~285s to ~164s (-43%),
while also shipping a fully consolidated artifact (compaction debt ~0) instead
of a partially-compacted one.
1. Injective structural identities
grant =
(ent_rt, ent_rid, ent_kind, ent_name, prin_rt, prin_id),entitlement =
(rt, rid, kind, name), stored as raw components with a0x00-separated tuple codec. Same logical entity ⇒ same key, always.
RPC, connector-facing APIs);
Grant.idis no longer deprecated.EntitlementKindCustommarker (the sanitizer intentionally keeps the marker cleartext; documented).
Migration
SSTs and swapped in atomically via
IngestAndExcise, then stamped. Crash-safe(re-runs from scratch), no per-key fallback paths in readers.
deterministic collapse rule.
SQLite read-only migrations). SQLite→Pebble conversion emits the new format
directly and stamps the migration marker.
2. Index diet
Folded into primary-key prefix scans (the grant primary key is
entitlement-first): grant
by_entitlement,by_entitlement_resource,by_principal_resource_type, and entitlementby_resource. Remainingsecondary indexes (
by_principal,by_needs_expansion) are key-only.by_principalis scattered relative to expansion's write order, so its inlinemaintenance is deferred: expansion skips it and EndSync rebuilds the entire
family from one primary scan into a single sorted SST (
IngestAndExcise).3. Sorted-ingest expansion write path
Expansion previously wrote synthesized grants through batch commits in
topological-node order — massively out of key order, so the memtable/L0 path
rewrote the data repeatedly through compaction.
Now:
Every parent of a wave-k node lives in a wave < k, so a wave's output is
never read within that wave — its writes can be deferred and published at
the wave boundary.
key/value bytes immediately and streamed into a background-sorted spill
sorter (128MiB chunks). Every ~8M rows the session cuts a segment, and a
background worker k-way merges the segment into one SST and ingests it —
overlapping merge/ingest with expansion compute.
with a hand-rolled deterministic wire encoder for
sources(byte-identicalto the reflective proto marshal, pinned by test).
4. LSM lifecycle around bulk ingest
Segment SSTs span interleaved key ranges, so Pebble stacks them in upper
levels and schedules hundreds of compactions to untangle them — work that
competed with EndSync and whose output never survived to the saved artifact
(the file is checkpointed and closed minutes later). Two changes:
stops granting automatic compactions from EndSync to close; binding a new
sync resumes it. In-flight compactions finish normally.
goroutine) into rolling flat SSTs, and one
IngestAndExciseswaps the wholegrant range: each byte is rewritten exactly once, on a scan already being
paid for, replacing what background compaction needed ~2m48s of thread time
to approximate. The saved artifact ships with compaction debt ~0.
Benchmark (3.68M → 57.7M grants, Apple Silicon)
Compactions during the post-expansion window: 236 → 9. Output c1z is
byte-count comparable (slightly smaller) with identical logical content,
verified by differential expansion tests against SQLite and the legacy path.
Compatibility
structural-snapshot tests, and a seeded fuzz sweep.