Schema Deployment Risk Checklist

The most dangerous moment in a schema deployment is not the migration itself — it is the 30 seconds before you run it when you think you understand the lock behavior but haven’t confirmed it. ALTER TABLE ADD COLUMN on a 2 GB table is instantaneous on PostgreSQL 11 and later. The same statement on PostgreSQL 10 can hold an ACCESS EXCLUSIVE lock for minutes. CREATE INDEX without CONCURRENTLY will block all writes on the table for the duration of the build. Understanding which statement takes which lock, and what the options are to avoid it, is table stakes for schema work on production databases.

Situation

Schema migrations in a running production system have three risk dimensions: lock duration, reversibility, and execution time. These are independent axes. A migration can be fast but irreversible (dropping a column). It can be slow but non-blocking (CREATE INDEX CONCURRENTLY). It can be fast, reversible, and still dangerous because the lock type is wrong for the traffic pattern.

Most teams have learned about CREATE INDEX CONCURRENTLY. Fewer have mapped out the full lock table for ALTER TABLE variants. The failure pattern is predictable: an engineer runs ALTER TABLE orders ADD COLUMN tax_id VARCHAR(32) NOT NULL DEFAULT '' on a table with 500 million rows, assumes it is fast because they have done it before on small tables, and discovers it is holding an ACCESS EXCLUSIVE lock while taking 12 minutes to backfill the default.

This checklist forces the assessment before the migration runs, not after it starts.

The Problem

When schema migrations fail, they usually do not corrupt data — they corrupt availability. A migration that holds an ACCESS EXCLUSIVE lock on a heavily trafficked table causes all incoming queries to queue. Once the connection pool saturates, the application begins dropping requests, triggering an escalating cascade of timeouts.

Signal	Where to see it	What it means
Application connection queuing after migration started	APM or `pg_stat_activity`	Migration holding ACCESS EXCLUSIVE lock — connections waiting
Migration running longer than expected	`pg_stat_activity` with `state = 'active'` and old `xact_start`	Table size or data backfill underestimated on staging
Replication lag spiking during migration	`pg_stat_replication` — `replay_lag` growing	Migration WAL volume causing replication to fall behind
Migration script fails with lock timeout	Application or migration tool error log	Lock acquisition timed out — another transaction holding the table
Rollback script unavailable	Migration tool history	Migration was run without a matching down migration

The traditional approach of “test it on staging” provides a false sense of security. A deployment that runs in two seconds on a 100 MB staging table can stall for twenty minutes on a 500 GB production table. Furthermore, if a migration blocks mid-execution due to lock contention or disk space limits, the lack of an immediate, tested rollback plan forces engineers to invent recovery strategies during an active incident.

How can a team systematically verify the lock behavior, execution duration, and reversibility of a schema migration before it ever touches production?

Core Concept

The solution is a structured evaluation that categorizes migrations by lock type, table size, and rollback complexity before execution.

Decision Tree

flowchart TD
    A[Schema migration planned] --> B{Requires ACCESS EXCLUSIVE lock?}
    B -->|no — CONCURRENTLY or ANALYZE| C[Safe to run anytime — proceed]
    B -->|yes| D{Table size greater than 1 GB?}
    D -->|yes| E{Online alternative available?}
    E -->|yes| F[Use online alternative — see options below]
    E -->|no| G[Schedule maintenance window]
    D -->|no — small table| H{Traffic pattern allows short lock?}
    H -->|yes| I[Run during low-traffic window]
    H -->|no| J[Use online alternative or maintenance window]
    F --> K{NOT NULL without default?}
    K -->|yes| L[3-step split — nullable then backfill then constraint]
    K -->|no| M[ADD COLUMN with DEFAULT on PG11 or later — instant]

What this diagram shows: A migration risk decision tree. The first branch identifies whether the operation requires ACCESS EXCLUSIVE lock. If so, table size determines whether an online alternative exists. The final branch handles NOT NULL without a default — which requires the three-step pattern: add as nullable, backfill, then add the constraint.

First Five Checks

Does the migration require ACCESS EXCLUSIVE lock? — the most important question to answer first:

-- Check the lock type for common DDL operations:
-- ACCESS EXCLUSIVE (blocks reads AND writes):
--   ALTER TABLE (most variants)
--   DROP TABLE, TRUNCATE, DROP INDEX
--   VACUUM FULL, CLUSTER
--
-- SHARE UPDATE EXCLUSIVE (allows reads and writes):
--   CREATE INDEX CONCURRENTLY
--   VACUUM, ANALYZE, CREATE STATISTICS
--
-- SHARE (allows reads, blocks writes):
--   CREATE INDEX (without CONCURRENTLY)

-- To confirm lock behavior during a migration, check what is waiting:
SELECT pid, relation::regclass, mode, granted
FROM pg_locks
WHERE NOT granted
ORDER BY pid;

If your migration uses ALTER TABLE on a large table, it will take ACCESS EXCLUSIVE. Period. Understand this before starting.

What is the table size? — execution time scales with table size for any migration that rewrites rows:

SELECT
  pg_size_pretty(pg_total_relation_size('orders'::regclass)) AS total_size,
  pg_size_pretty(pg_relation_size('orders'::regclass)) AS heap_size,
  pg_size_pretty(pg_indexes_size('orders'::regclass)) AS index_size,
  reltuples::bigint AS estimated_rows
FROM pg_class
WHERE relname = 'orders';

For any migration that rewrites the heap (ADD COLUMN with default on PG10, changing column types, ADD CONSTRAINT), the lock duration is proportional to table size. A migration that runs in 3 seconds on a 100 MB staging table will run for 18 minutes on a 36 GB production table.

Is the migration reversible? — classify before running:

-- Check existing column definitions before adding or dropping
SELECT
  column_name,
  data_type,
  is_nullable,
  column_default
FROM information_schema.columns
WHERE table_name = 'orders'
ORDER BY ordinal_position;

Reversibility classification:

ADD COLUMN nullable — reversible: DROP COLUMN
ADD COLUMN NOT NULL DEFAULT value — reversible on PG11 and later: DROP COLUMN (PG11+ stores the default in catalog, no rewrite)
DROP COLUMN — irreversible: data is gone after vacuum runs
ALTER COLUMN TYPE — reversible in principle, but requires another full rewrite; plan carefully
CREATE INDEX — fully reversible: DROP INDEX
ADD CONSTRAINT CHECK — reversible: DROP CONSTRAINT, but adds a lock; use NOT VALID + VALIDATE CONSTRAINT split

Test the migration on a production-sized staging database — estimate true execution time:

# Time the migration on a copy of production data
psql -d staging_prod_copy -c "\timing" -c "ALTER TABLE orders ADD COLUMN archived_at timestamptz;"

# For longer migrations, use EXPLAIN to see what the operation will do before committing
BEGIN;
ALTER TABLE orders ADD COLUMN archived_at timestamptz;
-- Check pg_locks here to observe lock behavior
ROLLBACK;  -- abort to avoid actual change

Timing on staging with a production-sized dataset is the only reliable estimate. Factor-of-10 size differences between staging and production are common and explain most migration surprises.

Is the migration idempotent? — essential for safe retries:

-- Idempotent column addition
ALTER TABLE orders ADD COLUMN IF NOT EXISTS archived_at timestamptz;

-- Idempotent index creation
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_archived_at ON orders (archived_at);

-- Idempotent constraint addition
DO $$
BEGIN
  IF NOT EXISTS (
    SELECT 1 FROM pg_constraint
    WHERE conname = 'chk_orders_status'
    AND conrelid = 'orders'::regclass
  ) THEN
    ALTER TABLE orders ADD CONSTRAINT chk_orders_status
    CHECK (status IN ('pending', 'processing', 'shipped', 'cancelled'))
    NOT VALID;
  END IF;
END $$;

A migration that fails midway and cannot be safely retried creates recovery debt. IF NOT EXISTS guards on CREATE INDEX CONCURRENTLY and ADD COLUMN are the standard pattern.

Remediation Options

Option 1 — Lock-safe online alternatives

For the most common migration types, online alternatives avoid ACCESS EXCLUSIVE:

-- ADD INDEX: always use CONCURRENTLY on production
CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id);

-- ADD COLUMN with default (PostgreSQL 11 and later): instant, no table rewrite
-- PostgreSQL 11 and later stores the default in pg_attrdef, not in the heap
ALTER TABLE orders ADD COLUMN archived_at timestamptz DEFAULT NULL;

-- ADD NOT NULL constraint without default: 3-step split
-- Step 1: Add column as nullable
ALTER TABLE orders ADD COLUMN IF NOT EXISTS tax_id VARCHAR(32);

-- Step 2: Backfill in batches (do NOT do this in a single UPDATE)
DO $$
DECLARE
  batch_size INT := 10000;
  offset_val INT := 0;
  rows_updated INT;
BEGIN
  LOOP
    UPDATE orders
    SET tax_id = ''
    WHERE id IN (
      SELECT id FROM orders
      WHERE tax_id IS NULL
      LIMIT batch_size
    );
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    EXIT WHEN rows_updated = 0;
    PERFORM pg_sleep(0.1);  -- brief pause between batches
  END LOOP;
END $$;

-- Step 3: Add NOT NULL constraint (fast — validates only in PG12 and later)
ALTER TABLE orders ALTER COLUMN tax_id SET NOT NULL;
-- PG12 and later: uses a not-null marker in pg_attribute, not a CHECK constraint scan

Option 2 — Table rewrite with pg_repack

For bloated tables needing a full rewrite (e.g., removing a column after many deletes), pg_repack performs online table rebuilding without extended ACCESS EXCLUSIVE:

# Install pg_repack extension
CREATE EXTENSION pg_repack;

# Run repack online — rebuilds table without long lock
pg_repack -h localhost -U postgres -d mydb -t orders

# With specific columns (version 1.4.7 and later)
pg_repack -h localhost -U postgres -d mydb --table orders

pg_repack works by building a new table copy online, capturing changes via a trigger, then performing a fast swap at the end. The final swap takes a brief ACCESS EXCLUSIVE lock (usually under a second). Per the pg_repack documentation, it requires the table to have a primary key or a unique constraint.

Option 3 — Scheduled maintenance window with monitoring

When no online alternative exists — changing a column type, adding a foreign key that requires a full scan, or truncating a large table — execute during a maintenance window with active monitoring:

-- Set a lock timeout to abort if the migration waits too long for a lock
SET lock_timeout = '5s';

-- Set a statement timeout as a safety net
SET statement_timeout = '10min';

-- Run migration
ALTER TABLE orders ALTER COLUMN amount TYPE NUMERIC(12, 4);

-- Monitor from a second session during execution
SELECT pid, state, query, now() - xact_start AS duration
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

A lock_timeout prevents the migration from queuing indefinitely behind a long-running transaction. If the migration cannot acquire its lock in 5 seconds, it aborts cleanly, allowing you to investigate what is holding the lock before retrying.

Rollback Plan

For every migration, have the rollback command written before running the forward migration:

-- Forward: add column
ALTER TABLE orders ADD COLUMN archived_at timestamptz;

-- Rollback: drop column
ALTER TABLE orders DROP COLUMN IF EXISTS archived_at;

-- Forward: create index
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);

-- Rollback: drop index
DROP INDEX CONCURRENTLY IF EXISTS idx_orders_status;

-- Forward: add constraint (using NOT VALID to avoid full scan)
ALTER TABLE orders ADD CONSTRAINT chk_orders_positive_amount
  CHECK (amount > 0) NOT VALID;

-- Validate separately (allows reads and writes during validation)
ALTER TABLE orders VALIDATE CONSTRAINT chk_orders_positive_amount;

-- Rollback: drop constraint
ALTER TABLE orders DROP CONSTRAINT IF EXISTS chk_orders_positive_amount;

For migrations that are irreversible at the data level (DROP COLUMN, TRUNCATE), the rollback plan is: restore from backup. This should be documented explicitly in the migration, and the backup should be confirmed current before running.

Automation Opportunity

A pre-migration risk assessment script that runs before any ALTER TABLE in your CI pipeline catches most issues automatically:

#!/bin/bash
# Check if a migration will require a table rewrite on a large table
TABLE_SIZE=$(psql -tAc "SELECT pg_relation_size('${TABLE_NAME}'::regclass)")
TABLE_SIZE_GB=$(echo "scale=2; ${TABLE_SIZE}/1073741824" | bc)

if (( $(echo "$TABLE_SIZE_GB > 1" | bc -l) )); then
  echo "WARNING: Table ${TABLE_NAME} is ${TABLE_SIZE_GB}GB"
  echo "Verify migration is CONCURRENTLY-safe or schedule maintenance window"
  exit 1
fi

For teams using schema migration tools (Flyway, Liquibase, golang-migrate), pre-migration hooks that run the size check and lock-type classification against the target SQL are the standard pattern.

Schema Deployment Checklist

Identify the SQL statement and its lock type — ACCESS EXCLUSIVE, SHARE, or SHARE UPDATE EXCLUSIVE
Query pg_total_relation_size for the target table — flag if greater than 1 GB
Determine if the migration is reversible — write the rollback SQL before running the forward migration
Test execution time on a production-sized staging database with \timing
Confirm the migration is idempotent — add IF NOT EXISTS and IF EXISTS guards where applicable
Determine if an online alternative exists — CONCURRENTLY index, PG11+ ADD COLUMN, 3-step NOT NULL
For ACCESS EXCLUSIVE on large tables — schedule a maintenance window or use the online alternative
Set lock_timeout = '5s' and statement_timeout before running any blocking migration
Confirm a current backup exists before running any irreversible migration (DROP COLUMN, TRUNCATE)
Monitor pg_stat_activity for lock contention during the migration window from a second session
Verify replication lag does not spike during migration — check pg_stat_replication
After migration completes, run EXPLAIN (ANALYZE) on the primary affected queries to confirm plan is correct

In Practice

The PostgreSQL documentation for ADD COLUMN explicitly describes the behavioral change in PostgreSQL 11: prior to version 11, ADD COLUMN with a DEFAULT clause required a full table rewrite to store the default in every existing row. PostgreSQL 11 introduced storage of the default in pg_attrdef, allowing ADD COLUMN ... DEFAULT to complete in milliseconds regardless of table size — the default is applied on read for existing rows, not during the migration. This behavior is documented in the PostgreSQL 11 release notes.

The documentation for CREATE INDEX CONCURRENTLY documents its two-pass scan approach: it makes two passes over the table — one to build the initial index, one to incorporate concurrent changes — before marking the index valid. This means it takes longer than non-concurrent index creation, but it never holds an ACCESS EXCLUSIVE lock. The tradeoff is explicit in the documentation: “the table is not locked against writes for an extended period of time, but the build takes longer.”

Where It Breaks

Failure mode	Trigger	Fix
`CREATE INDEX CONCURRENTLY` leaves invalid index	Transaction conflict or cancellation during build	Drop the invalid index; recreate with CONCURRENTLY
`NOT VALID` constraint skips existing data violations	Backfill was incomplete before constraint was added	Run `VALIDATE CONSTRAINT` to enforce on all rows; fix violations first
3-step NOT NULL breaks if backfill is skipped	Developer runs step 1 and step 3 without step 2	Enforce step ordering in migration tooling; use explicit progress markers
`lock_timeout` causes migration abort	Another long transaction holds an incompatible lock	Identify and wait for blocking transaction; retry migration with longer timeout
`pg_repack` fails on table with no primary key	Table uses composite key or has no unique identifier	Add a surrogate primary key first, or use a maintenance window rewrite

What This Post Does Not Cover

This checklist covers schema migration risk for PostgreSQL and MySQL. It does not cover: migration tooling comparisons (Flyway vs Liquibase vs sqitch), zero-downtime application deployment patterns when schema and code changes must roll out together, MongoDB schema validation evolution, or database-level encryption key rotation during schema changes. Each of those is a separate decision area.

What to Do Next

Problem: Schema migrations that appear safe on small staging tables can hold ACCESS EXCLUSIVE locks for minutes on large production tables, queuing and dropping connections until they complete or are killed.
Solution: Classify every migration by lock type and table size before running it; use CREATE INDEX CONCURRENTLY and the 3-step NOT NULL split for large tables; and always have the rollback command written before the forward migration runs.
Proof: After implementing CONCURRENTLY and deferred NOT NULL patterns, migration deployments should complete with zero connection queuing — observable in pg_stat_activity showing no waiting state during the migration window.
Action: Pick one upcoming schema migration and run through this checklist before executing it. If it requires ACCESS EXCLUSIVE on a table over 1 GB, find the online alternative or schedule the maintenance window before the deployment date.