Database Migration Cutover Workflow: Dual Writes, CDC, Backfill, Freeze, and Rollback

A database migration does not fail at the data copy step; it fails when the organization discovers that “almost synchronized” is not an operational state.

Situation

Teams migrate databases for good reasons: splitting a monolith, moving from self-managed infrastructure to managed cloud, changing storage engines, isolating high-growth domains, or replacing a schema that can no longer carry product behavior. The hard part is rarely the first export. The hard part is keeping the old and new systems correct while real traffic continues to mutate the source of truth.

That creates a familiar migration timeline: capture the source log position to start CDC, backfill historical rows up to that position, stream changes through CDC to catch up, run dual writes for application-owned mutations, validate both sides, freeze writes, cut over traffic, and preserve a rollback path. Each step sounds independently reasonable. Together, they form a distributed system with ordering, idempotency, schema drift, replay, and ownership problems.

The mistake is treating cutover as a deployment event. It is not. Cutover is the final state transition in a long-running data protocol.

The Problem

Most migration failures come from ambiguous ownership. During the migration, which system owns a row? Which write path is authoritative? Which timestamp wins? What happens when the new database accepts a write but the old database times out? Can the team roll back after target-only writes begin?

Dual writes are especially dangerous when they are framed as “write to both databases.” A correct dual-write path needs idempotency keys, retry semantics, deterministic mapping, observability, and a defined failure policy. Without those controls, the system can silently create divergence while all application requests return success.

CDC has a different failure mode. It is good at preserving ordered change streams from a database log, but it does not magically repair bad transformations, missing DDL, incompatible constraints, or application writes that bypass the captured source. A backfill can load yesterday’s truth while CDC races to deliver today’s mutations. If validation only checks row counts, the migration may pass while balances, permissions, inventory, or workflow states are wrong.

The core question is: how do you design a migration cutover so that every phase has one owner, one verification gate, and one rollback boundary?

Core Concept

The safest pattern is to run the migration as a controlled state machine, not as a collection of scripts. Each phase should have explicit entry criteria, exit criteria, metrics, and rollback behavior.

flowchart TD
  A[source database — current owner] --> B[backfill worker — bounded chunks]
  A --> C[CDC stream — ordered changes]
  B --> D[target database — candidate owner]
  C --> D
  E[application — feature flags] --> F[dual write adapter — idempotent operations]
  F --> A
  F --> D
  D --> G[validation — counts checksums invariants]
  G --> H{cutover gate — lag zero errors zero}
  H -->|not ready| I[rollback plan — source remains owner]
  H -->|ready| J[write freeze — drain queues]
  J --> K[flip reads and writes — target owner]
  K --> L[post cutover watch — repair or revert]

Start with ownership. Before cutover, the source database remains authoritative. The target is a candidate copy. The correct operational timeline begins by establishing the CDC stream and capturing the source log position before data moves. Once the log sequence number is secured, backfill moves historical state in bounded chunks up to that point so it can be paused, resumed, and re-run. Each chunk should record high-water marks, row counts, checksums where practical, and transformation versions.

CDC then continuously carries the delta from the established start point. The stream should be monitored as a first-class dependency: replication lag, apply latency, failed records, retry queue depth, schema errors, and last committed source position. AWS Database Migration Service documents this as a full-load plus CDC pattern for minimizing downtime during migration, where ongoing changes are cached during the initial load and then replicated continuously (AWS DMS CDC documentation, AWS cutover guidance).

Dual writes should be introduced only after the transformation path is deterministic. The adapter should not be scattered through business logic. It should be a narrow write boundary with idempotency, structured error handling, and a kill switch. The old database remains the commit authority until the cutover gate. If the target write fails before cutover, the system can retry or enqueue repair because the source still owns truth. If the source write fails, the request fails.

Validation must go beyond “the table loaded.” Use layered checks: row counts, sampled checksums, domain invariants, referential integrity, read comparison on production-shaped queries, and reconciliation of recent writes by source position. The most useful checks are business invariants: every paid invoice has ledger entries, every active entitlement maps to a customer, every order state has a valid transition history.

The write freeze is the shortest phase, but it is the most important. Freeze application writes, drain queues, stop scheduled jobs that mutate data, wait for CDC lag to reach zero, record the final source log position, run final validation, then flip reads and writes. If the system cannot tolerate a global freeze, freeze the migrating domain behind routing, feature flags, or partition ownership.

Rollback must be defined before the flip. Before target-only writes, rollback is simple: route traffic back to the source because the source remains authoritative. After target-only writes, rollback is no longer a switch; it is another migration. You either need reverse replication already proven, or you need to roll forward by repairing the target. Teams often say “we can roll back” when they only mean “we can redeploy the old application.” That is not database rollback.

In Practice

Context: AWS’s published migration guidance describes cutover strategies including offline migration, flash cutover, active-active configuration, and incremental migration. Its DMS model commonly combines full load with CDC so that ongoing changes are tracked from a specific log sequence number during the initial copy, followed by continuous replication until the cutover window (AWS Prescriptive Guidance).

Action: The documented pattern is to capture the log position first, separate the initial load from ongoing change capture, monitor replication progress, and choose a cutover strategy based on acceptable downtime and write behavior. For application teams, that means the migration plan should expose replication lag and failed apply operations as release gates, not background metrics.

Result: The operational result is reduced downtime, but not zero responsibility. CDC narrows the freeze window; it does not remove the need for validation, schema compatibility, application quiescence, and a final ownership flip.

Learning: Treat CDC as a transport, not as correctness. Correctness comes from deterministic transformations, replayable writes, invariant checks, and a cutover gate that can say no.

Context: GitHub’s gh-ost is a public example of a migration tool designed around online MySQL schema change. Its repository describes it as a triggerless online schema migration tool that uses the binary log and supports controlled cutover behavior (GitHub gh-ost).

Action: The documented pattern is to create a shadow structure, stream changes from the database log, copy data incrementally while applying those changes concurrently, throttle work, and postpone the final cutover until the system is ready.

Result: That architecture makes the dangerous part explicit. The copy and catch-up phases can run while production continues, but the final rename or ownership switch is still a deliberate cutover step.

Learning: Online migration tools succeed because they isolate phases. They do not pretend the final switch is ordinary background work.

Context: Shopify has publicly described moving toward log-based CDC for capturing changes from its sharded MySQL monolith, emphasizing immutable append-only change capture rather than query-based extraction (Shopify Engineering).

Action: The documented pattern is to capture database changes from the log so downstream consumers can process a durable sequence of mutations.

Result: This supports more reliable propagation than periodically querying mutable tables, especially when many consumers need to react to changes.

Learning: A migration target should consume changes like a durable event stream where possible. Polling and ad hoc extracts are weaker foundations for cutover because they obscure ordering and missed updates.

Where It Breaks

Failure mode	Why it happens	Control
Silent divergence	Dual writes succeed on one side and fail on the other	Idempotency keys, retry queues, reconciliation
False validation confidence	Counts match but business state differs	Domain invariants and query comparison
CDC lag hides cutover risk	Backfill load or schema errors slow apply	Lag SLOs and failed-record gates
Rollback is fictional	Target accepts writes with no reverse path	Define rollback boundary before cutover
Freeze misses writers	Jobs, queues, admin tools, or batch systems keep mutating source	Write inventory and freeze enforcement
Schema drift breaks apply	DDL changes during migration are not mirrored	Migration change freeze and schema contract
Replayed events corrupt state	Updates are not idempotent or ordering-aware	Source positions and deterministic merge rules

What to Do Next

Problem: The migration is not safe while ownership is ambiguous. Name the authoritative database for every phase and document when that changes.
Solution: Build the workflow around a correct timeline: capture log position, backfill, CDC catch-up, validation, freeze, cutover, and post-cutover monitoring. Keep dual writes behind one idempotent adapter.
Proof: Require gates for CDC lag, failed applies, invariant checks, sampled read comparison, queue drain, and final source log position. A cutover without these gates is a bet.
Action: Write the rollback plan before writing the migration script. If rollback after target-only writes requires reverse replication, prove it before cutover. Otherwise call the plan what it is: roll forward with repair.

Situation

The Problem

Core Concept

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Designing for Peak Traffic Without Designing for Permanent Waste

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse