Skip to content

[POC] Bucket storage report (operations vs rows)#683

Draft
bean1352 wants to merge 30 commits into
mainfrom
feat/bucket-storage-report
Draft

[POC] Bucket storage report (operations vs rows)#683
bean1352 wants to merge 30 commits into
mainfrom
feat/bucket-storage-report

Conversation

@bean1352

@bean1352 bean1352 commented Jun 23, 2026

Copy link
Copy Markdown

Bucket storage report

Summary

This adds an admin endpoint that reports the worst-offender buckets for the active sync configuration, ranked by total operations against total live rows. It is implemented for MongoDB storage only.

The ratio of operations to rows is a fragmentation indicator. A freshly compacted bucket sits close to 1, because each live row is represented by roughly one operation. A high ratio means the bucket has accumulated operation history that no longer maps to live rows, which is a sign that a compaction or a defragmentation would reduce what new clients have to download.

The endpoint is built to stay within the storage query timeout even on large instances. It ranks buckets inside the database rather than loading them into memory, and it estimates per-bucket row counts by sampling the operation history rather than scanning every current row.

API

POST /api/admin/v1/bucket-report

Authenticated with an admin API token (Authorization: Token <token>), the same as the other /api/admin/* endpoints.

Request

{ "limit": 20 }

limit is optional and defaults to 50. It sets how many worst-offender buckets to return, ranked by operation count. Because row counts are sampled once per returned bucket, the limit also bounds the cost of the report.

Response

{
  "data": {
    "buckets": [
      {
        "bucket": "1#by_list[\"a3db…\"]",
        "operations": 3001,
        "rows": 3,
        "operation_bytes": 793829,
        "fragmentation": 1000.33,
        "rows_estimated": true,
        "suggested_action": "compact",
        "tables": ["todos"]
      },
      {
        "bucket": "1#by_list[\"81d3…\"]",
        "operations": 1,
        "rows": 1,
        "operation_bytes": 329,
        "fragmentation": 1,
        "rows_estimated": false,
        "suggested_action": "none",
        "tables": ["todos"]
      }
    ],
    "definitions": [
      {
        "definition": "1#by_list",
        "bucket_count": 5000,
        "operations": 14000,
        "operation_bytes": 4025500,
        "rows": 6100,
        "fragmentation": 2.3,
        "rows_estimated": true,
        "suggested_action": "compact",
        "tables": ["todos"]
      }
    ],
    "totals": {
      "bucket_count": 5000,
      "operations": 14000,
      "operation_bytes": 4025500,
      "estimated": false
    },
    "buckets_truncated": true,
    "definitions_truncated": false
  }
}
Field Description
buckets[].bucket Full bucket name (versioned in storage v2 and later, for example 1#global[])
buckets[].operations Total operations in the bucket history (PUT, REMOVE, MOVE, CLEAR). Exact, read from bucket_state
buckets[].rows Live rows in the bucket. Exact for small buckets, a sampled estimate for large ones
buckets[].operation_bytes Approximate size of the operation history
buckets[].fragmentation operations / max(rows, 1)
buckets[].rows_estimated True when rows and fragmentation are a sampled estimate
buckets[].suggested_action none, compact, defragment, or both (see below)
buckets[].tables Tables making up the bucket's history, largest share first. These are the tables whose rows a defragment should touch
definitions[] Rollup per bucket definition: bucket_count, operations, operation_bytes, plus sampled rows, fragmentation, rows_estimated, suggested_action, and tables with the same meanings as per bucket. Rows count once per bucket containing them
totals.bucket_count Number of buckets with stored operations. Estimated when the bucket set was sampled
totals.operations, totals.operation_bytes Instance-wide operation sums. Estimated when the bucket set was sampled
totals.estimated True when the bucket set was sampled rather than fully ranked
buckets_truncated True when there are more buckets than were returned (the request's limit)
definitions_truncated True when the definition rollup is incomplete: more definitions exist than the report caps at (20), or one was skipped because sampling it failed

Buckets are ranked worst first: most operations first, then most fragmented as a tie-break. There is deliberately no instance-wide row or fragmentation total. Row counts are sampled per returned bucket, not summed across the whole instance, because an instance-wide row total would require the full scan this design sets out to avoid.

Each suggested action maps to a command. compact means running the service compact, either instance-wide or scoped to the reported buckets with compact -b '<bucket>,<bucket>'. defragment means touching the rows of the bucket's tables (scoped by the bucket's parameters) so they are rewritten at the end of the bucket, then compacting. both means the bucket has both kinds of overhead, so compact first and defragment after.

How it works

The report does two things: it finds the buckets with the most operations, and it works out how many live rows each of those buckets has. Each step is exact where that is cheap and estimated where it is not, and every estimated value is flagged in the response.

Ranking buckets. Operation counts are already maintained per bucket in bucket_state, one small document per bucket, so a single aggregation returns the top N buckets and the instance-wide totals. It runs with allowDiskUse: false and a 60 second maxTimeMS, so an oversized instance fails fast instead of degrading. Up to 50,000 buckets the ranking is exact. Above that, the report ranks a random sample of 10,000 buckets, scales the totals up by the number of buckets belonging to the active configuration, and sets totals.estimated.

Counting rows for the returned buckets. Each returned bucket's row count comes from its operation history in bucket_data, selected with an _id range query so that only that bucket's documents are read (the range form is what lets MongoDB use the primary key index). A bucket with 1,000 operations or fewer is read in full, so its count is exact and rows_estimated is false. A larger bucket is sampled with $sampleRate and rows_estimated is true. In the rare case a sample selects nothing, the bucket is read in full instead.

Estimating rows from a sample. A sample on its own undercounts rows, because rows that were not sampled are invisible. The estimator corrects for this using repetition: if the sample keeps landing on the same rows, the bucket has few rows; if every sampled operation lands on a new row, it has many. Two things follow from that. The sample must be large enough to contain repeats, so its size grows with the bucket (sqrt(200 * operations), clamped between 1,000 and 25,000). And the estimate is found numerically, by searching for the row count whose expected number of distinct sampled rows matches the observed number. Operations left behind by compaction (MOVE and CLEAR) carry no row identity and are excluded from the model; without that, freshly compacted buckets under-count rows. The formula, its assumptions, and unit tests live with estimateDistinctRows in bucket-report.ts.

The definition rollup and the suggested action. The same aggregation that ranks buckets also groups them by definition (the bucket-name prefix), giving exact per-definition operation totals; each returned definition's rows are then sampled the same way as a bucket's, over its whole history. Because that reads more data than a single bucket, a definition whose sampling exceeds the time budget is omitted from the rollup rather than failing the report. The suggested action falls out of the sample: PUT and REMOVE operations carry a row identity while MOVE and CLEAR are residue left by compaction, so the mix says what would help. Fragmentation under 3 suggests none (the same 3x rule of thumb the PowerSync diagnostics app uses); mostly superseded PUT/REMOVE history suggests compact (cheap, reclaims it); mostly MOVE/CLEAR residue suggests defragment (a compact already ran and cannot reclaim more); both together, or an inconclusive mix, suggest both. A definition can read none while some of its buckets read compact: the rollup scores the definition as a whole, so localized churn in a few buckets is exactly the case for the bucket-scoped compact. The thresholds are heuristics and the report is meant to be re-run after acting on it. The same sample also yields tables, the tables that make up the history ordered by their share of it, which names what a defragment should touch.

The per-bucket queries run ten at a time, so report cost scales with limit. Nothing reads current_data or scans a whole bucket_data collection, except the per-definition row sampling described above, which is bounded by the query timeout. The endpoint is implemented for MongoDB storage v1, v2, and v3; v3 stores operations in batched documents, so its pipeline unwinds the batches before counting.

Compact versus defragment

The report distinguishes a bucket that a routine compaction can fix from one that needs a defragmentation. This was verified end to end against real MongoDB storage, using a single global bucket of 550 rows.

Stage operations rows fragmentation
Baseline (after snapshot) 550 550 1.0
After heavy updates 6550 550 11.9
Compact only 6550 550 11.9 (bytes drop, operation count holds)
Defragment then compact 551 550 1.0

Compaction on its own reduces storage size but not the number of operations a new client downloads, because this bucket led with rows that were never updated. Compaction replaces superseded operations with lightweight MOVE operations: the data is removed but the operation count is preserved. Only a defragmentation, which touches every row and is then compacted, collapses the count back to roughly one operation per row.

Testing

Automated tests cover:

  • MongoDB storage v1, v2, and v3 against a real MongoDB, exercising operation counts, fragmentation, ranking, limit and truncation, the definition rollup and its cap, suggested actions, tables, and v3 active-config scoping.
  • The shared report assembly (fragmentation, ranking, truncation, suggested actions), the row estimator, and limit resolution, as unit tests.
  • The endpoint itself, including the snake_case response encoding and the no-active-config and unsupported-storage error branches.

Manual testing at high bucket counts was run on Docker:

  • 10,000 buckets, 12.75 million operations, exact ranking path. Totals were exact (estimated: false) with the correct bucket count. Ranking and per-bucket sampling returned the top 50 buckets in about 280 ms and the top 300 in about 700 ms. Across 300 returned buckets the estimated row counts had a mean error of 0.3%, a 95th percentile of 2.3%, and a worst case of about 10% on the smallest fragmented buckets.
  • About 65,000 buckets, sampled ranking path. The bucket set was sampled (estimated: true) and ranking plus per-bucket sampling completed in roughly 150 to 170 ms per request, against a 60 second timeout. The operation total was a scaled estimate that bracketed the true value.
  • Sample size. An earlier fixed sample of 1,000 operations produced row errors of up to 100% on wide buckets, because the sample rarely repeated a row and the estimator had nothing to work from. Scaling the sample to the bucket size brought the worst case under about 10% and made it stable run to run.
  • Per-bucket query cost. Selecting each bucket by an _id range is what keeps the report fast. Matching on the sub-fields of the compound _id instead scanned the whole collection once per returned bucket, which took about 28 seconds for 50 buckets over 12 million operations, against under 300 ms with the range match.
  • Multi-definition configuration. Four bucket definitions over seven tables (10,501 buckets, 4.5 million operations), with one table shared by two definitions. Ranking mixed the definitions correctly, totals stayed exact, and a fully churned global bucket was recovered exactly at 63:1 fragmentation. This run also surfaced that compaction's MOVE operations skewed the row estimates; excluding identity-less operations from the model fixed it (row-error mean 1%, worst case 19% on buckets with deliberately skewed churn).
  • Definition rollup and suggested actions. On the same dataset the rollup reported all four definitions with exact operation totals and sampled rows within 0% to 6% of ground truth, and the suggested actions matched the storage state: healthy definitions read none, buckets with un-compacted churn read compact, and already-compacted fragmented buckets read defragment or both. With the rollup included the full report completed in about 3 seconds, almost all of it spent sampling the definitions' rows.

Known limitations

  • Estimates assume even churn within a bucket. Sampled rows and fragmentation values (flagged rows_estimated) are accurate when a bucket's operations are spread roughly evenly across its rows. A bucket with a few hot rows among many cold ones will be approximate. On storage v3 the sample is drawn in batches of operations rather than one at a time, which can additionally lean the estimate toward over-reporting fragmentation.
  • Rare offenders at very large scale. Above the sample threshold the ranking is drawn from a random $sample, so a small number of fragmented buckets among hundreds of thousands may be missed on any given request.
  • Sample size cap. The per-bucket sample is capped at 25,000 operations to bound cost. A bucket that is both very wide and barely fragmented can go past the point where the cap still gives a strong estimate, but such buckets are healthy and are not the offenders the report exists to surface. The cap and the sampling constants were tuned against local data and should be checked against production-scale data.
  • No instance-wide row or fragmentation total. Row counts are sampled per returned bucket and per returned definition, not totalled across the instance, so totals covers operations only.
  • Definition rollup cost. Sampling a definition's rows reads (a sample of) its whole history, which is heavier than the per-bucket queries. Each definition query is bounded by the same time budget, and a definition that exceeds it is omitted from definitions without failing the report. Both that and hitting the 20-definition cap set definitions_truncated.
  • Suggested actions are heuristics. The fragmentation threshold of 3 matches the rule of thumb used by the PowerSync diagnostics app; the residue and superseded shares of 50% were chosen against local data. The intended workflow is to act on the suggestion and re-run the report.
  • Tables are drawn from the sample. A table contributing a tiny share of a large bucket's history can be missing from tables. That does not affect the defragment use case, which targets the dominant tables.
  • Legacy v1 and v2 data. bucket_state is not backfilled, so buckets whose data predates bucket_state tracking under-report operations until they are next written to or compacted.
  • Postgres storage is excluded. Postgres has no bucket_state pre-aggregate to rank from, and a storage re-architecture is pending, so a Postgres implementation would be built on ground that is about to move.

AI usage

Claude Opus was used to help research the codebase, assist with the implementation, and generate test cases. I reviewed and tested everything myself.

@changeset-bot

changeset-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 7d8fa99

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 16 packages
Name Type
@powersync/service-core Minor
@powersync/service-types Minor
@powersync/service-module-mongodb-storage Minor
@powersync/service-core-tests Minor
@powersync/service-module-convex Patch
@powersync/service-module-core Patch
@powersync/service-module-mongodb Patch
@powersync/service-module-mssql Patch
@powersync/service-module-mysql Patch
@powersync/service-module-postgres-storage Patch
@powersync/service-module-postgres Patch
@powersync/service-image Minor
test-client Patch
@powersync/service-schema Minor
@powersync/service-client Patch
@powersync/lib-service-postgres Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

bean1352 and others added 24 commits June 24, 2026 13:50

@rkistner rkistner left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick check on $sampleRate performance, and unfortunately it's not a magic bullet in terms of performance. Essentially MongoDB still has to scan through the index at least before applying the $sampleRate. I can work, as long as we ensure:

  1. Limit the number of index entries a query is scanning to around 100k-1M at most.
  2. Make sure no document lookup is performed before filtering using $sampleRate. E.g. if we scan through 100k index entries to sample around 1000 of them, the query must scan through 1000 documents, not 100k documents (fetching documents is much slower than the index entries). Use MongoDB's explain to confirm this.

As an additional safeguard, we can use readPreference: 'secondaryOnly' for these queries, to make sure they don't affect performance on the primary node.


const pipeline: mongo.Document[] = [{ $match: match }];
if (sampled) {
pipeline.push({ $sample: { size: BUCKET_SELECTION_SAMPLE_SIZE } });

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$sample does not help for performance unless it's the first stage in the pipeline.

Potential options:

  1. $sample first, then filter. That would require the initial sample size to be higher than the limit we want.
  2. Filter first, then use $sampleRate. I'm not actually what the performance is like for $sampleRate - would need some testing.

And if you go for option 2, node that current _id.b / _id.g filters aren't efficient either, and require a full collection scan. Do filter efficiently, you need to use a pattern such as _id: {$gte: ..., $lt: ...} - there should be a couple of examples like that in this repo you can use as a starting point.

Comment on lines +226 to +228
expect(resolveBucketReportLimit(2.7)).toBe(2);
expect(resolveBucketReportLimit(-5)).toBe(1);
expect(resolveBucketReportLimit(0)).toBe(1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend responding with an error for these cases, rather than clamping the value.

Comment on lines +181 to +186
export function resolveBucketReportLimit(limit?: number): number {
if (limit == null) {
return DEFAULT_BUCKET_REPORT_LIMIT;
}
return Math.max(1, Math.floor(limit));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have a reasonable maximum, otherwise it could be easy to overload the service or storage database by passing in arbitrary large values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants