Skip to content

Resumable GCS uploads with server-controlled integrity (replace single-PUT) #5975

@rtibbles

Description

@rtibbles

Overview

Studio uploads each file in a single signed PUT — unreliable for large files (an interruption restarts from byte 0) and unsuitable for very large objects. Switch to GCS resumable uploads: sign a resumable initiation, move the web frontend to chunked uploads, and replace single-PUT entirely. Preserve the single-PUT's integrity guarantee: the server, not the client, decides whether stored content matches its content-addressed checksum.

Complexity: High
Target branch: hotfixes

Context

get_presigned_upload_url / _get_gcs_presigned_put_url sign a single PUT (content_md5, content_type); the web frontend POSTs /api/file/upload_url then PUTs the whole file. GCS resumable uploads instead sign a POST initiation (x-goog-resumable: start), returning a session URI that is the credential for chunked PUTs.

Security model to preserve. The single-PUT scheme signed Content-MD5, so GCS rejected bytes not hashing to the server-pinned value — binding content and size. Files are content-addressed (object path = MD5 checksum). The replacement must keep the server as source of truth: a client must not be able to store non-matching content, bypass quota by under-declaring size, or poison the store (including via never-finalized uploads or the "skip if object exists" dedup path).

Constraint. Proxying uploads through the app server is not an option — it previously caused severe app-server performance problems and is why Studio moved to direct-to-GCS uploads. Integrity must hold within the direct-upload model.

The Change

  • Backend: sign a resumable initiation (object path signed, as today) instead of a single PUT; drop content_md5 (it can't bind a resumable upload). The upload_url response signals the resumable scheme.
  • Frontend: initiate the session and upload in chunks (256 KiB multiples), resuming on interruption.
  • Server-controlled integrity (never trust the client):
    • Verified content-addressing (app-side): accept / dedup / serve gate on GCS's computed md5Hash equalling the expected checksum — never on object existence.
    • Spike: confirm whether a signed x-goog-hash at the initiation is GCS-enforced on finalize; if so, adopt it to reject non-matching bytes at upload time.
    • Infrastructure-side controls (prerequisite): object-finalize validation and lifecycle cleanup are handled infrastructure-side (in this issue's dependencies).
  • Cutover: remove single-PUT once both clients are migrated, coordinated with the ricecooker client so older clients don't break mid-transition.

Out of Scope

  • ricecooker's resumable client.
  • The File.file_size change.
  • Parallel / XML-multipart uploads (composite objects expose only crc32c, not md5Hash, complicating verification).
  • Proxying uploads through the app server.
  • Object-finalize validation and lifecycle cleanup (infrastructure-side prerequisite, in this issue's dependencies).

Acceptance Criteria

  • Studio signs a GCS resumable initiation; the upload_url response signals the resumable scheme.
  • The web frontend uploads via resumable chunks and resumes an interrupted upload instead of restarting.
  • App-side verified content-addressing: accept / dedup / serve require GCS's computed md5Hash to equal the expected checksum; a non-matching object is never accepted, dedup-matched, or served.
  • A client cannot cause Studio to accept, dedup-match, or serve content that doesn't match its checksum (upload-time / orphan / quota enforcement is covered by the infrastructure dependency).
  • The spike determines whether a signed x-goog-hash at initiation is GCS-enforced; if so, it is adopted.
  • A file larger than 2.1 GB uploads successfully from the web frontend.
  • Single-PUT is removed once clients are migrated and infrastructure-side finalize validation is in place.

References

AI usage

I used Claude (Opus 4.8, via le-skills:writing-github-issues) to verify the GCS resumable and checksum mechanics against the docs and draft this issue. I drove the security analysis (client-trust and never-finalized-upload risks) and the design decisions; I edited the drafts where they over-trusted the client.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions