Building a Safe Python Migration Runner for Operational Data Changes

The dangerous migration is rarely the one that changes a schema; it is the one that rewrites operational data while the system is still serving traffic.

Situation

Most teams eventually outgrow ad hoc data fixes.

At first, a one-off script is reasonable: backfill a nullable column, correct malformed rows, reassign ownership after a product change, repair denormalized state, or move records from an old workflow into a new one. The operator knows the table, runs the script from a laptop or CI job, watches a few logs, and calls it done.

That works until the data change becomes operational infrastructure.

The same script now has to run in staging and production. It must survive deploy retries. It must not run twice. It must pause when database latency rises. It must expose progress to the incident channel. It must prove what it plans to touch before it touches it. It must be auditable after the engineer who wrote it has moved on.

Schema migration tools solve only part of this. Alembic, Django migrations, Rails migrations, and Flyway are good at ordering structural changes. They are less suited to long-running, chunked, resumable operational data changes where the core risk is not DDL correctness but production behavior under load.

The Problem

The failure mode is not simply “the script has a bug.”

The more common failure is that the script has no operating model. It scans too much. It holds locks too long. It retries without idempotency. It mixes deploy logic with data repair logic. It emits logs but no durable checkpoint. It has a --dry-run flag that exercises a different path from the real run. It assumes rollback means reversing the script, even though the application may already have observed the new state.

Operational data migrations need different guarantees from normal application jobs:

only one runner can own a migration at a time
every unit of work can be retried safely
progress is stored outside process memory
batches are small enough to bound lock time
validation runs before, during, and after execution
operators can pause, resume, and abort without editing code
CI can test the plan without touching production data

The core question is: how do we make Python data migrations boring enough to run through the same platform controls as a deployment?

Core Concept

A safe Python migration runner is a control plane around dangerous work. The migration code still contains domain-specific logic, but the runner owns orchestration, locking, checkpointing, validation, and observability.

flowchart TD
  A[CI job — migration request] --> B[plan builder — validate manifest]
  B --> C[dry run — estimate rows and batches]
  C --> D[approval gate — human or policy]
  D --> E[runner — acquire advisory lock]
  E --> F[checkpoint store — record state]
  F --> G[batch executor — bounded transaction]
  G --> H[validators — preflight and postflight]
  H --> I[metrics and logs — progress stream]
  I --> J{more batches}
  J -->|yes| G
  J -->|no| K[complete — release lock]
  E --> L[pause switch — operator control]
  L -->|paused| F

The unit of deployment is a migration package, not a loose script. Each package has a manifest:

id: backfill_account_tiers_2026_05_24
owner: platform-data
database: primary
mode: online
batch_size: 500
max_runtime_seconds: 1800
requires_approval: true

The Python interface should be small:

class Migration:
    def plan(self, db) -> Plan:
        ...

    def select_batch(self, db, checkpoint) -> list[RowRef]:
        ...

    def apply_batch(self, db, rows) -> BatchResult:
        ...

    def validate(self, db) -> ValidationResult:
        ...

The runner calls these methods; migration authors do not implement retries, locks, metrics, or state transitions. That division matters because platform safety depends on consistent behavior across migrations.

The first guardrail is a durable state machine. A migration moves through planned, approved, running, paused, failed, and completed. Each batch records a checkpoint, row count, checksum if practical, start time, end time, and error. If the process dies, the next run resumes from the last committed checkpoint.

The second guardrail is database-level ownership. In PostgreSQL, advisory locks are designed for application-defined coordination and are automatically cleaned up at session end or transaction end depending on the lock type. The runner can use a transaction-scoped advisory lock to prevent two workers from running the same migration concurrently without creating a coordination table hot spot. This follows PostgreSQL’s documented advisory lock behavior rather than inventing distributed locking semantics in Python.

The third guardrail is batch isolation. Each batch runs in its own bounded transaction. That gives the system a chance to pause between batches, reduces lock duration, and makes retries tractable. Long transactions are operationally expensive: they hold locks, delay vacuum progress, and make failures harder to contain. A runner should default to many small commits rather than one heroic commit.

The fourth guardrail is symmetry between dry run and execution. Dry run should call the same plan and select_batch logic, then stop before mutation. It should report estimated row counts, index usage assumptions, batch count, runtime budget, and the exact safety checks that will gate execution. A dry run that only prints “would update rows” is theater.

The fifth guardrail is an operator contract. Pause means finish the current batch and stop. Abort means stop scheduling new work and mark the migration as failed or canceled. Retry means resume from the checkpoint. Rollback is not a button unless the migration defines a verified compensating action. In many operational data changes, the safer rollback is a forward fix.

In Practice

Context: GitLab documents both post-deployment migrations and batched background migrations for database changes that should not be coupled directly to the main deploy path. Its documentation states that batched background migrations are used to update database tables in batches, and that queueing a batched background migration should happen in a post-deployment migration.

Action: The architectural pattern is to separate application rollout, migration scheduling, and migration execution. A Python runner should copy that separation: CI packages and validates the migration, a deploy step registers it, and a worker executes batches under operational controls.

Result: The documented pattern avoids treating a long-running data rewrite as a single deploy transaction. Operators can inspect migration state, reason about active background work, and keep application rollback concerns separate from data progress. That is the important lesson, not GitLab’s specific Rails implementation.

Learning: Do not hide operational data changes inside app startup, release hooks, or arbitrary one-off jobs. Make them first-class platform objects with lifecycle, ownership, and status.

Context: PostgreSQL documents explicit locking and advisory locks as mechanisms with well-defined transaction and session behavior. It also documents that table-level locks conflict differently depending on the operation. This matters because a migration that is “just updating rows” can still create production pressure through lock waits, index churn, and transaction age.

Action: The runner should encode database behavior into policy. It should require indexed batch selectors, set statement and lock timeouts, cap rows per transaction, and fail closed when the query plan is unsafe.

Result: Safety moves from reviewer memory into automation. Reviewers still evaluate business logic, but the runner consistently enforces the mechanical rules that prevent common production incidents.

Learning: A safe migration runner is not a clever script framework. It is a production workload scheduler for database mutations.

Where It Breaks

Failure mode	Why it happens	Mitigation
Full table scan during batch selection	migration selects by an unindexed predicate	require `EXPLAIN` checks and indexed cursor columns
Duplicate mutation after retry	batch writes are not idempotent	use deterministic row selection and write guards
Long lock waits	transaction touches too many rows or waits behind traffic	set lock timeout and shrink batch size
Unbounded runtime	runner has no budget or pause point	enforce max runtime and pause between batches
False dry run confidence	dry run uses different logic	share plan and selection code with execution
Unsafe rollback expectation	data has already been consumed by live code	require compensating migration or forward fix plan
Invisible progress	only process logs exist	persist checkpoint and emit metrics per batch

What to Do Next

Problem: Operational data changes fail when they are treated as scripts instead of production workflows.
Solution: Build a Python runner that owns lifecycle, locking, checkpointing, batch execution, validation, and operator controls.
Proof: The pattern is consistent with documented systems behavior: GitLab separates post-deployment and batched background migrations, while PostgreSQL provides explicit primitives for lock-aware coordination.
Action: Start with a minimal runner: manifest validation, dry run, advisory lock, checkpoint table, bounded batch transaction, pause flag, and postflight validator. Add policy only after every migration goes through that path.

Situation

The Problem

Core Concept

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Platform Automation Maturity Model: Scripts, Modules, Catalogs, Pipelines, Control Planes

Automation Rollback Playbook: Disable, Revert, Repair State, and Reconcile Reality

DB Team Automation Roadmap: Backups, Patching, Refreshes, Provisioning, and Guardrails