Skip to content

Reusable zero-downtime migration capability for large tables #5973

@rtibbles

Description

@rtibbles

Overview

Studio has no pattern or tooling for migrating large tables safely: on a big table (e.g. File, ~100M rows) an unguarded migration can take an ACCESS EXCLUSIVE lock and cause downtime. Build a reusable capability so large-table schema changes are safe by default and the expand/contract pattern is repeatable — the foundation for widening File.file_size and for future migrations.

Complexity: Medium
Target branch: hotfixes

Context

Studio runs Django-managed migrations on PostgreSQL, applied at deploy. Unguarded DDL that rewrites or exclusively locks a large table blocks all queries for the duration. There is no lock-timeout guard, no CI check for unsafe migrations, and no reusable online backfill (a RunPython backfill runs in one large transaction).

The Change

  • Safe DDL by default: migrations run under a lock timeout (fail fast, don't block); indexes/constraints added without long locks; unsafe operations (table-rewriting type changes, blocking DDL) surfaced rather than silently applied.
  • CI linting flagging backward-incompatible / unsafe migrations.
  • Declarative dual-write: a database-level way to keep a shadow column in sync with its source.
  • Reusable online batched backfill: idempotent, resumable, throttled — not a single transaction.
  • A runbook for the expand/contract recipe.

Chosen building blocks: django-pg-zero-downtime-migrations (safe-DDL backend), django-migration-linter (CI), django-pgtrigger (dual-write); the backfill command and runbook are ours.

Acceptance Criteria

  • Migrations run under a lock timeout so they fail fast instead of blocking a large table.
  • Unsafe migrations (table-rewriting type changes, blocking DDL) are flagged in CI and surfaced — not silently applied — at runtime.
  • A declarative dual-write mechanism keeps a shadow column in sync with its source at the database level.
  • A reusable batched-backfill command is idempotent, resumable, throttled, and not a single transaction.
  • A runbook documents the expand/contract recipe for large tables.
  • The capability is reusable by future migrations, not specific to one column.

References

AI usage

I used Claude (Opus 4.8, via le-skills:writing-github-issues) to research the tooling and draft this issue. Adopting these libraries and building a reusable capability were my decisions; I edited for scope.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions