Database Changes in CI/CD: Migrations, Backfills, Expand-Contract, and Verification

A deployment pipeline that treats database change as a shell command is not automated; it is just moving the outage closer to production.

Situation

Application delivery has become routine. Every merge can build, test, package, scan, deploy, and roll back. The uncomfortable exception is the database. Schema changes are durable, shared, stateful, and often expensive. A bad application deploy can be rolled back by moving traffic to a previous artifact. A bad column drop, blocking index build, or half-completed backfill is a different class of failure.

That is why database delivery needs its own release protocol inside CI/CD. Migrations are not just files in a repository. They are operations against a live, contended system with locks, replication lag, query plans, old application versions, new application versions, background workers, and human rollback expectations.

Rails describes migrations as a way to evolve schema over time, but its own documentation also notes that not every database supports transactional DDL for every schema operation; when a migration fails, some completed parts may not be rolled back automatically.¹ That small detail is the heart of the problem. Database change is deployment, data repair, capacity management, and verification all at once.

The Problem

Most teams begin with a simple rule: run migrations before deploy. That works until the migration is slow, incompatible, or logically coupled to code that is not fully rolled out.

The common failure modes are predictable:

A deploy adds code that reads a column before the migration is complete.
A migration drops a column still used by an older application instance.
A backfill competes with production traffic and creates lock waits or replica lag.
A new constraint validates existing dirty data and blocks the deploy.
A rollback reverts application code but leaves the database in the new shape.
CI proves the migration works on an empty test database but not on production-sized data.

The question is not whether database changes should be automated. They should. The question is what the pipeline must know before it is allowed to change shared state.

Core Concept

The safe pattern is expand, deploy, backfill, verify, contract. It turns a dangerous one-step migration into a sequence of compatible states.

flowchart TD
  A[proposal — schema change request] --> B[static checks — unsafe operation detection]
  B --> C[expand migration — additive schema]
  C --> D[deploy code — dual read or dual write]
  D --> E[backfill job — bounded batches]
  E --> F[verification — counts constraints and query plans]
  F --> G[contract migration — remove obsolete shape]
  G --> H[post deploy audit — drift and health checks]

  B -->|reject| X[manual review — lock risk or data risk]
  E -->|pause| Y[traffic protection — throttle or stop]
  F -->|fail| Z[remediation — repair data before contract]

The first design rule is compatibility. Every production state must tolerate old code and new code running together. That means additive migrations first: add nullable columns, create tables, add indexes concurrently where the database supports it, and avoid immediate destructive changes.

The second rule is separation. Schema migration and data migration are different operations. A schema migration changes shape. A backfill changes volume. Backfills belong in resumable, observable jobs, not inside a deploy transaction. They need batch size, sleep interval, retry policy, progress state, error quarantine, and an emergency stop.

The third rule is verification as a gate, not a dashboard. The pipeline should not merely run db:migrate and report success. It should ask whether the resulting database state is compatible with the next release step. That means verifying migration order, expected columns, indexes, constraints, row counts, null rates, duplicate keys, backfill completion, and query plan changes for critical paths.

The fourth rule is delayed destruction. Contract migrations happen only after the system has proven that the old shape is unused. Dropping a column is not the rollback plan. It is the last step after telemetry, code search, deploy completion, and data verification say the old contract is gone.

In Practice

Context: The documented pattern across mature systems is that schema change must be decoupled from ordinary deploy speed. GitLab documents post-deployment migrations for changes that should run after application code is deployed, and it separately documents batched background migrations for long-running data changes.²³ That is not an exotic optimization. It is an acknowledgement that different database operations belong at different points in the release lifecycle.

Action: The platform should encode those phases directly. A pull request that adds a column should pass static migration checks. A deploy should apply only migrations that are safe before code rollout. A post-deploy phase should run operations that depend on new code being present. A backfill worker should own data movement in controlled batches. A final contract migration should be blocked until verification proves the old path is no longer required.

Result: The result is not zero risk. It is localized risk. A failed additive migration can block a deploy before incompatible code ships. A slow backfill can be paused without rolling back the application. A failed verification can stop the contract phase while production continues using the expanded schema. GitHub’s gh-ost is an example of the same operational instinct for MySQL schema changes: online migration machinery exists because directly altering large production tables can couple migration workload to user-facing database load.⁴⁵

Learning: The important lesson is that database CI/CD should optimize for reversible application states, not reversible SQL files. Rollback is often a code movement back to a compatible version while the database remains expanded. The database should move forward through safe states, with destructive changes delayed until they are boring.

The Pipeline Contract

A serious database pipeline needs more than a migration runner.

It needs a classifier. Additive operations can proceed automatically. Potentially blocking operations require review. Destructive operations require proof that they are in the contract phase. Data rewrites require a backfill plan.

It needs production realism. CI should run migrations from both an empty database and a recent schema snapshot. The empty case catches ordering problems. The snapshot case catches drift, long-forgotten assumptions, and migrations that only work when no data exists.

It needs policy checks. Examples include rejecting column drops outside a contract migration, requiring concurrent index creation where supported, blocking non-null constraints without a prior validation plan, and requiring idempotent backfill jobs with checkpoints.

It needs observability. A backfill without progress is just a long-running incident with a friendlier name. Track rows scanned, rows changed, error rate, lock waits, deadlocks, replica lag, batch latency, and estimated completion. The deploy system should be able to pause the job automatically when the database is under stress.

It needs explicit ownership. The author of a migration owns the full lifecycle: expand, application compatibility, backfill, verification, and contract. Platform automation can enforce the gates, but it cannot infer the business invariant. Only the owning team can say what “fully backfilled” or “safe to remove” means.

Where It Breaks

Failure mode	Why it happens	Mitigation
Migration passes CI but blocks production	Test data is too small and lock behavior is invisible	Run static checks, use realistic schema snapshots, require online patterns for large tables
Backfill overloads the primary	Data movement is deployed like code instead of operated like workload	Use bounded batches, throttling, checkpoints, and automatic pause conditions
Rollback expectation is false	Application rollback cannot undo destructive schema changes	Use expand-contract and keep old schema available through rollback windows
Constraint validation fails late	Existing data violates the new invariant	Add constraints in stages, preflight violations, repair data before enforcement
Contract happens too early	Old code path still exists in workers, scripts, or delayed jobs	Verify usage with telemetry, code search, deploy completion, and job drain checks
Pipeline becomes too slow	Every change is treated as maximum risk	Classify operations and automate the safe path while escalating only risky changes

What to Do Next

Problem: Database changes fail differently than application changes because they mutate shared durable state.
Solution: Treat schema migration, code rollout, backfill, verification, and contract as separate CI/CD phases.
Proof: Use documented patterns such as post-deployment migrations, batched background migrations, and online schema migration tools as evidence that mature systems separate risk by operation type.
Action: Add pipeline gates for unsafe DDL, require resumable backfills, block destructive changes until verification passes, and make every database change declare its expand-contract plan.

Situation

The Problem

Core Concept

In Practice

The Pipeline Contract

Where It Breaks

What to Do Next

Footnotes

Rajiv

Related Posts

Per-App Postgres on Kubernetes Changes the Failure Boundary

Azure Database for PostgreSQL: Flexible Server vs Hyperscale (Citus) Architecture Decision

Cassandra Write Path Fundamentals for Database Engineers