The dangerous migration is rarely the one that changes a schema; it is the one that rewrites operational data while the system is still serving traffic.

Situation

Most teams eventually outgrow ad hoc data fixes.

At first, a one-off script is reasonable: backfill a nullable column, correct malformed rows, reassign ownership after a product change, repair denormalized state, or move records from an old workflow into a new one. The operator knows the table, runs the script from a laptop or CI job, watches a few logs, and calls it done.

That works until the data change becomes operational infrastructure.

The same script now has to run in staging and production. It must survive deploy retries. It must not run twice. It must pause when database latency rises. It must expose progress to the incident channel. It must prove what it plans to touch before it touches it. It must be auditable after the engineer who wrote it has moved on.

Schema migration tools solve only part of this. Alembic, Django migrations, Rails migrations, and Flyway are good at ordering structural changes. They are less suited to long-running, chunked, resumable operational data changes where the core risk is not DDL correctness but production behavior under load.

The Problem

The failure mode is not simply “the script has a bug.”

The more common failure is that the script has no operating model. It scans too much. It holds locks too long. It retries without idempotency. It mixes deploy logic with data repair logic. It emits logs but no durable checkpoint. It has a --dry-run flag that exercises a different path from the real run. It assumes rollback means reversing the script, even though the application may already have observed the new state.

Operational data migrations need different guarantees from normal application jobs:

  • only one runner can own a migration at a time
  • every unit of work can be retried safely
  • progress is stored outside process memory
  • batches are small enough to bound lock time
  • validation runs before, during, and after execution
  • operators can pause, resume, and abort without editing code
  • CI can test the plan without touching production data

The core question is: how do we make Python data migrations boring enough to run through the same platform controls as a deployment?

Core Concept

A safe Python migration runner is a control plane around dangerous work. The migration code still contains domain-specific logic, but the runner owns orchestration, locking, checkpointing, validation, and observability.

flowchart TD
  A[CI job — migration request] --> B[plan builder — validate manifest]
  B --> C[dry run — estimate rows and batches]
  C --> D[approval gate — human or policy]
  D --> E[runner — acquire advisory lock]
  E --> F[checkpoint store — record state]
  F --> G[batch executor — bounded transaction]
  G --> H[validators — preflight and postflight]
  H --> I[metrics and logs — progress stream]
  I --> J{more batches}
  J -->|yes| G
  J -->|no| K[complete — release lock]
  E --> L[pause switch — operator control]
  L -->|paused| F

The unit of deployment is a migration package, not a loose script. Each package has a manifest:

id: backfill_account_tiers_2026_05_24
owner: platform-data
database: primary
mode: online
batch_size: 500
max_runtime_seconds: 1800
requires_approval: true

The Python interface should be small:

class Migration:
    def plan(self, db) -> Plan:
        ...

    def select_batch(self, db, checkpoint) -> list[RowRef]:
        ...

    def apply_batch(self, db, rows) -> BatchResult:
        ...

    def validate(self, db) -> ValidationResult:
        ...

The runner calls these methods; migration authors do not implement retries, locks, metrics, or state transitions. That division matters because platform safety depends on consistent behavior across migrations.

The first guardrail is a durable state machine. A migration moves through planned, approved, running, paused, failed, and completed. Each batch records a checkpoint, row count, checksum if practical, start time, end time, and error. If the process dies, the next run resumes from the last committed checkpoint.

The second guardrail is database-level ownership. In PostgreSQL, advisory locks are designed for application-defined coordination and are automatically cleaned up at session end or transaction end depending on the lock type. The runner can use a transaction-scoped advisory lock to prevent two workers from running the same migration concurrently without creating a coordination table hot spot. This follows PostgreSQL’s documented advisory lock behavior rather than inventing distributed locking semantics in Python.

The third guardrail is batch isolation. Each batch runs in its own bounded transaction. That gives the system a chance to pause between batches, reduces lock duration, and makes retries tractable. Long transactions are operationally expensive: they hold locks, delay vacuum progress, and make failures harder to contain. A runner should default to many small commits rather than one heroic commit.

The fourth guardrail is symmetry between dry run and execution. Dry run should call the same plan and select_batch logic, then stop before mutation. It should report estimated row counts, index usage assumptions, batch count, runtime budget, and the exact safety checks that will gate execution. A dry run that only prints “would update rows” is theater.

The fifth guardrail is an operator contract. Pause means finish the current batch and stop. Abort means stop scheduling new work and mark the migration as failed or canceled. Retry means resume from the checkpoint. Rollback is not a button unless the migration defines a verified compensating action. In many operational data changes, the safer rollback is a forward fix.

In Practice

Context: GitLab documents both post-deployment migrations and batched background migrations for database changes that should not be coupled directly to the main deploy path. Its documentation states that batched background migrations are used to update database tables in batches, and that queueing a batched background migration should happen in a post-deployment migration.

Action: The architectural pattern is to separate application rollout, migration scheduling, and migration execution. A Python runner should copy that separation: CI packages and validates the migration, a deploy step registers it, and a worker executes batches under operational controls.

Result: The documented pattern avoids treating a long-running data rewrite as a single deploy transaction. Operators can inspect migration state, reason about active background work, and keep application rollback concerns separate from data progress. That is the important lesson, not GitLab’s specific Rails implementation.

Learning: Do not hide operational data changes inside app startup, release hooks, or arbitrary one-off jobs. Make them first-class platform objects with lifecycle, ownership, and status.

Context: PostgreSQL documents explicit locking and advisory locks as mechanisms with well-defined transaction and session behavior. It also documents that table-level locks conflict differently depending on the operation. This matters because a migration that is “just updating rows” can still create production pressure through lock waits, index churn, and transaction age.

Action: The runner should encode database behavior into policy. It should require indexed batch selectors, set statement and lock timeouts, cap rows per transaction, and fail closed when the query plan is unsafe.

Result: Safety moves from reviewer memory into automation. Reviewers still evaluate business logic, but the runner consistently enforces the mechanical rules that prevent common production incidents.

Learning: A safe migration runner is not a clever script framework. It is a production workload scheduler for database mutations.

Where It Breaks

Failure modeWhy it happensMitigation
Full table scan during batch selectionmigration selects by an unindexed predicaterequire EXPLAIN checks and indexed cursor columns
Duplicate mutation after retrybatch writes are not idempotentuse deterministic row selection and write guards
Long lock waitstransaction touches too many rows or waits behind trafficset lock timeout and shrink batch size
Unbounded runtimerunner has no budget or pause pointenforce max runtime and pause between batches
False dry run confidencedry run uses different logicshare plan and selection code with execution
Unsafe rollback expectationdata has already been consumed by live coderequire compensating migration or forward fix plan
Invisible progressonly process logs existpersist checkpoint and emit metrics per batch

What to Do Next

  • Problem: Operational data changes fail when they are treated as scripts instead of production workflows.
  • Solution: Build a Python runner that owns lifecycle, locking, checkpointing, batch execution, validation, and operator controls.
  • Proof: The pattern is consistent with documented systems behavior: GitLab separates post-deployment and batched background migrations, while PostgreSQL provides explicit primitives for lock-aware coordination.
  • Action: Start with a minimal runner: manifest validation, dry run, advisory lock, checkpoint table, bounded batch transaction, pause flag, and postflight validator. Add policy only after every migration goes through that path.