fix(sof-export): clean up partial results on cancel, plus a TTL reaper (#144)#180
fix(sof-export): clean up partial results on cancel, plus a TTL reaper (#144)#180mauripunzueta wants to merge 4 commits into
Conversation
…lled Per the SQL-on-FHIR operations-common spec (HL7/sql-on-fhir#365), a server SHOULD clean up partial results when an export is cancelled via DELETE on the status URL. Previously, cancelling a $viewdefinition-export / $sqlquery-export job transitioned it to Cancelled but already-written output shards were never deleted and remained downloadable via GET /export/{job_id}/{filename}. - Add `delete_job(&self, job_id)` to the `ExportSink` trait and implement it for all three sinks: FilesystemSink (remove_dir_all), InMemorySink (drop matching keys), S3Sink (paginated list + delete under the job key prefix). - Call it from the cancellation path, and again from the background task when it finishes a job that was cancelled mid-run (covers the write-after-cancel race). - Gate `read_shard` on Cancelled status so a cancelled job's files 404 even while deletion is still draining. Closes #144
…ials Follow-up to the cancellation cleanup. Two adjacent storage-hygiene gaps in the SoF async export controller, found while reviewing #144: 1. No retention at all. The completion manifest advertises a 24h `Expires`, but nothing ever reclaimed finished jobs — the in-memory `jobs` map and the sink output grew unbounded for the process lifetime. Add a background reaper (HFS_EXPORT_OUTPUT_TTL, default 24h; HFS_EXPORT_CLEANUP_INTERVAL, default 300s) that deletes a terminal job's output and drops its bookkeeping once it ages past the TTL. `JobStatus::Cancelled` now carries `cancelled_at` and a `terminal_at()` helper lets all three terminal states age uniformly. 2. Failed jobs left orphaned partial shards. A failed job's result URL returns 500 with no manifest, so its shards are unreachable but were never deleted. The background task now deletes them on failure, and `read_shard` gates `Failed` (alongside `Cancelled`) so a racing poll never serves stale output. DELETE-on-completed is intentionally left as a no-op that preserves the Completed state (per test_export_cancel_after_completion_preserves_completed_state); completed output is reclaimed by the TTL reaper above, not by DELETE. Tests: reap_expired_reclaims_terminal_jobs_only. Full helios-rest lib suite (282) and the sof_export integration suite (44) pass; clippy clean with --features s3.
Document the output lifecycle at the module level and flesh out the per-sink delete_job contracts (idempotency, key-prefix semantics, and why the S3 path bridges to the blocking pool). Comments only — no behavior change.
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
…2026-0185
Two ambient CI breakages on origin/main, unrelated to this PR's changes:
- clippy 1.91 newly flags collapsible_else_if in sof/emit.rs; collapse the
nested else { if .. } into else if.
- cargo audit fails on RUSTSEC-2026-0185 (quinn-proto remote memory
exhaustion), a transitive reqwest QUIC dep. We never accept inbound QUIC,
so the reassembly path is unreachable; ignore it with justification.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
@claude and @mauripunzueta how does this change work in a clustered multi hfs environment? |
|
Claude finished @smunini's task in 1m 15s —— View job Todo List
How this works in a clustered multi-
|
|
@aacruzgon and @mauripunzueta , this is another case where we need a cluster-capable job store. Relates to #169 and #170 and #150 We need a discussion document that describes the areas of hfs that need to be modified, via an environment variable config setting to use a unified cluster-capable job store and document it's use for those who want to run hfs in a cluster. |
Summary
Closes #144 and addresses the adjacent storage-hygiene gaps it surfaced in the SQL-on-FHIR async export controller (
$viewdefinition-export/$sqlquery-export).Previously, cancelling a job transitioned it to
Cancelledbut its already-written output shards were never deleted and stayed downloadable; and finished jobs were never reclaimed at all — the in-processjobsmap and the sink output grew unbounded for the process lifetime, despite the completion manifest advertising a 24hExpires.What changed
1. Cancellation cleanup (#144 as scoped)
delete_job(&job_id)on theExportSinktrait, implemented for all three sinks:FilesystemSink—remove_dir_allon the job dir (missing dir is not an error)InMemorySink— drops entries under the{job_id}/key prefixS3Sink— paginatedlist_objects_v2+delete_objectunder the job key prefix (idempotent)read_shardis gated on terminal state so a cancelled job's files 404 even while deletion is still draining.2. TTL reaper for finished jobs
Expires.JobStatus::Cancellednow carriescancelled_at, and aterminal_at()helper lets all three terminal states age uniformly.HFS_EXPORT_OUTPUT_TTL(default86400) andHFS_EXPORT_CLEANUP_INTERVAL(default300). The interval is clamped to ≥ 1s (tokio::time::intervalpanics on zero).3. Failed-job partial cleanup
read_shardgatesFailedalongsideCancelled.Deliberately not changed
DELETEon a completed job remains a no-op that preserves theCompletedstate (the status URL keeps redirecting to the result manifest), pertest_export_cancel_after_completion_preserves_completed_state. Completed output is reclaimed by the TTL reaper above, not byDELETE.Testing
cancel_deletes_partial_output_and_download_404s,reap_expired_reclaims_terminal_jobs_only(age threshold, running-immunity, output deletion + bookkeeping removal).helios-restlib suite (282) and thesof_exportintegration suite (44) pass, including the completion-preservation test.cargo fmtclean;cargo clippy --features s3 --all-targets -D warningsclean.🤖 Generated with Claude Code