[POC] Bucket storage report (operations vs rows)#683
Conversation
🦋 Changeset detectedLatest commit: 7d8fa99 The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
rkistner
left a comment
There was a problem hiding this comment.
I had a quick check on $sampleRate performance, and unfortunately it's not a magic bullet in terms of performance. Essentially MongoDB still has to scan through the index at least before applying the $sampleRate. I can work, as long as we ensure:
- Limit the number of index entries a query is scanning to around 100k-1M at most.
- Make sure no document lookup is performed before filtering using
$sampleRate. E.g. if we scan through 100k index entries to sample around 1000 of them, the query must scan through 1000 documents, not 100k documents (fetching documents is much slower than the index entries). Use MongoDB's explain to confirm this.
As an additional safeguard, we can use readPreference: 'secondaryOnly' for these queries, to make sure they don't affect performance on the primary node.
|
|
||
| const pipeline: mongo.Document[] = [{ $match: match }]; | ||
| if (sampled) { | ||
| pipeline.push({ $sample: { size: BUCKET_SELECTION_SAMPLE_SIZE } }); |
There was a problem hiding this comment.
$sample does not help for performance unless it's the first stage in the pipeline.
Potential options:
$samplefirst, then filter. That would require the initial sample size to be higher than the limit we want.- Filter first, then use
$sampleRate. I'm not actually what the performance is like for$sampleRate- would need some testing.
And if you go for option 2, node that current _id.b / _id.g filters aren't efficient either, and require a full collection scan. Do filter efficiently, you need to use a pattern such as _id: {$gte: ..., $lt: ...} - there should be a couple of examples like that in this repo you can use as a starting point.
| expect(resolveBucketReportLimit(2.7)).toBe(2); | ||
| expect(resolveBucketReportLimit(-5)).toBe(1); | ||
| expect(resolveBucketReportLimit(0)).toBe(1); |
There was a problem hiding this comment.
I'd recommend responding with an error for these cases, rather than clamping the value.
| export function resolveBucketReportLimit(limit?: number): number { | ||
| if (limit == null) { | ||
| return DEFAULT_BUCKET_REPORT_LIMIT; | ||
| } | ||
| return Math.max(1, Math.floor(limit)); | ||
| } |
There was a problem hiding this comment.
We should also have a reasonable maximum, otherwise it could be easy to overload the service or storage database by passing in arbitrary large values.
Bucket storage report
Summary
This adds an admin endpoint that reports the worst-offender buckets for the active sync configuration, ranked by total operations against total live rows. It is implemented for MongoDB storage only.
The ratio of operations to rows is a fragmentation indicator. A freshly compacted bucket sits close to
1, because each live row is represented by roughly one operation. A high ratio means the bucket has accumulated operation history that no longer maps to live rows, which is a sign that a compaction or a defragmentation would reduce what new clients have to download.The endpoint is built to stay within the storage query timeout even on large instances. It ranks buckets inside the database rather than loading them into memory, and it estimates per-bucket row counts by sampling the operation history rather than scanning every current row.
API
POST /api/admin/v1/bucket-reportAuthenticated with an admin API token (
Authorization: Token <token>), the same as the other/api/admin/*endpoints.Request
{ "limit": 20 }limitis optional and defaults to 50. It sets how many worst-offender buckets to return, ranked by operation count. Because row counts are sampled once per returned bucket, the limit also bounds the cost of the report.Response
{ "data": { "buckets": [ { "bucket": "1#by_list[\"a3db…\"]", "operations": 3001, "rows": 3, "operation_bytes": 793829, "fragmentation": 1000.33, "rows_estimated": true, "suggested_action": "compact", "tables": ["todos"] }, { "bucket": "1#by_list[\"81d3…\"]", "operations": 1, "rows": 1, "operation_bytes": 329, "fragmentation": 1, "rows_estimated": false, "suggested_action": "none", "tables": ["todos"] } ], "definitions": [ { "definition": "1#by_list", "bucket_count": 5000, "operations": 14000, "operation_bytes": 4025500, "rows": 6100, "fragmentation": 2.3, "rows_estimated": true, "suggested_action": "compact", "tables": ["todos"] } ], "totals": { "bucket_count": 5000, "operations": 14000, "operation_bytes": 4025500, "estimated": false }, "buckets_truncated": true, "definitions_truncated": false } }buckets[].bucket1#global[])buckets[].operationsbucket_statebuckets[].rowsbuckets[].operation_bytesbuckets[].fragmentationoperations / max(rows, 1)buckets[].rows_estimatedrowsandfragmentationare a sampled estimatebuckets[].suggested_actionnone,compact,defragment, orboth(see below)buckets[].tablesdefinitions[]bucket_count,operations,operation_bytes, plus sampledrows,fragmentation,rows_estimated,suggested_action, andtableswith the same meanings as per bucket. Rows count once per bucket containing themtotals.bucket_counttotals.operations,totals.operation_bytestotals.estimatedbuckets_truncatedlimit)definitions_truncatedBuckets are ranked worst first: most operations first, then most fragmented as a tie-break. There is deliberately no instance-wide row or fragmentation total. Row counts are sampled per returned bucket, not summed across the whole instance, because an instance-wide row total would require the full scan this design sets out to avoid.
Each suggested action maps to a command.
compactmeans running the service compact, either instance-wide or scoped to the reported buckets withcompact -b '<bucket>,<bucket>'.defragmentmeans touching the rows of the bucket'stables(scoped by the bucket's parameters) so they are rewritten at the end of the bucket, then compacting.bothmeans the bucket has both kinds of overhead, so compact first and defragment after.How it works
The report does two things: it finds the buckets with the most operations, and it works out how many live rows each of those buckets has. Each step is exact where that is cheap and estimated where it is not, and every estimated value is flagged in the response.
Ranking buckets. Operation counts are already maintained per bucket in
bucket_state, one small document per bucket, so a single aggregation returns the topNbuckets and the instance-wide totals. It runs withallowDiskUse: falseand a 60 secondmaxTimeMS, so an oversized instance fails fast instead of degrading. Up to 50,000 buckets the ranking is exact. Above that, the report ranks a random sample of 10,000 buckets, scales the totals up by the number of buckets belonging to the active configuration, and setstotals.estimated.Counting rows for the returned buckets. Each returned bucket's row count comes from its operation history in
bucket_data, selected with an_idrange query so that only that bucket's documents are read (the range form is what lets MongoDB use the primary key index). A bucket with 1,000 operations or fewer is read in full, so its count is exact androws_estimatedisfalse. A larger bucket is sampled with$sampleRateandrows_estimatedistrue. In the rare case a sample selects nothing, the bucket is read in full instead.Estimating rows from a sample. A sample on its own undercounts rows, because rows that were not sampled are invisible. The estimator corrects for this using repetition: if the sample keeps landing on the same rows, the bucket has few rows; if every sampled operation lands on a new row, it has many. Two things follow from that. The sample must be large enough to contain repeats, so its size grows with the bucket (
sqrt(200 * operations), clamped between 1,000 and 25,000). And the estimate is found numerically, by searching for the row count whose expected number of distinct sampled rows matches the observed number. Operations left behind by compaction (MOVE and CLEAR) carry no row identity and are excluded from the model; without that, freshly compacted buckets under-count rows. The formula, its assumptions, and unit tests live withestimateDistinctRowsinbucket-report.ts.The definition rollup and the suggested action. The same aggregation that ranks buckets also groups them by definition (the bucket-name prefix), giving exact per-definition operation totals; each returned definition's rows are then sampled the same way as a bucket's, over its whole history. Because that reads more data than a single bucket, a definition whose sampling exceeds the time budget is omitted from the rollup rather than failing the report. The suggested action falls out of the sample: PUT and REMOVE operations carry a row identity while MOVE and CLEAR are residue left by compaction, so the mix says what would help. Fragmentation under 3 suggests
none(the same 3x rule of thumb the PowerSync diagnostics app uses); mostly superseded PUT/REMOVE history suggestscompact(cheap, reclaims it); mostly MOVE/CLEAR residue suggestsdefragment(a compact already ran and cannot reclaim more); both together, or an inconclusive mix, suggestboth. A definition can readnonewhile some of its buckets readcompact: the rollup scores the definition as a whole, so localized churn in a few buckets is exactly the case for the bucket-scoped compact. The thresholds are heuristics and the report is meant to be re-run after acting on it. The same sample also yieldstables, the tables that make up the history ordered by their share of it, which names what a defragment should touch.The per-bucket queries run ten at a time, so report cost scales with
limit. Nothing readscurrent_dataor scans a wholebucket_datacollection, except the per-definition row sampling described above, which is bounded by the query timeout. The endpoint is implemented for MongoDB storage v1, v2, and v3; v3 stores operations in batched documents, so its pipeline unwinds the batches before counting.Compact versus defragment
The report distinguishes a bucket that a routine compaction can fix from one that needs a defragmentation. This was verified end to end against real MongoDB storage, using a single
globalbucket of 550 rows.Compaction on its own reduces storage size but not the number of operations a new client downloads, because this bucket led with rows that were never updated. Compaction replaces superseded operations with lightweight
MOVEoperations: the data is removed but the operation count is preserved. Only a defragmentation, which touches every row and is then compacted, collapses the count back to roughly one operation per row.Testing
Automated tests cover:
Manual testing at high bucket counts was run on Docker:
estimated: false) with the correct bucket count. Ranking and per-bucket sampling returned the top 50 buckets in about 280 ms and the top 300 in about 700 ms. Across 300 returned buckets the estimated row counts had a mean error of 0.3%, a 95th percentile of 2.3%, and a worst case of about 10% on the smallest fragmented buckets.estimated: true) and ranking plus per-bucket sampling completed in roughly 150 to 170 ms per request, against a 60 second timeout. The operation total was a scaled estimate that bracketed the true value._idrange is what keeps the report fast. Matching on the sub-fields of the compound_idinstead scanned the whole collection once per returned bucket, which took about 28 seconds for 50 buckets over 12 million operations, against under 300 ms with the range match.none, buckets with un-compacted churn readcompact, and already-compacted fragmented buckets readdefragmentorboth. With the rollup included the full report completed in about 3 seconds, almost all of it spent sampling the definitions' rows.Known limitations
rowsandfragmentationvalues (flaggedrows_estimated) are accurate when a bucket's operations are spread roughly evenly across its rows. A bucket with a few hot rows among many cold ones will be approximate. On storage v3 the sample is drawn in batches of operations rather than one at a time, which can additionally lean the estimate toward over-reporting fragmentation.$sample, so a small number of fragmented buckets among hundreds of thousands may be missed on any given request.totalscovers operations only.definitionswithout failing the report. Both that and hitting the 20-definition cap setdefinitions_truncated.tables. That does not affect the defragment use case, which targets the dominant tables.bucket_stateis not backfilled, so buckets whose data predatesbucket_statetracking under-report operations until they are next written to or compacted.bucket_statepre-aggregate to rank from, and a storage re-architecture is pending, so a Postgres implementation would be built on ground that is about to move.AI usage
Claude Opus was used to help research the codebase, assist with the implementation, and generate test cases. I reviewed and tested everything myself.