Schema Deployment Risk Checklist
The most dangerous moment in a schema deployment is not the migration itself — it is the 30 seconds before you run it when you think you understand the lock behavior but haven’t confirmed it. ALTER TABLE ADD COLUMN on a 2 GB table is instantaneous on PostgreSQL 11 and later. The same statement on PostgreSQL 10 can hold an ACCESS EXCLUSIVE lock for minutes. CREATE INDEX without CONCURRENTLY will block all writes on the table for the duration of the build. Understanding which statement takes which lock, and what the options are to avoid it, is table stakes for schema work on production databases.
Situation
Schema migrations in a running production system have three risk dimensions: lock duration, reversibility, and execution time. These are independent axes. A migration can be fast but irreversible (dropping a column). It can be slow but non-blocking (CREATE INDEX CONCURRENTLY). It can be fast, reversible, and still dangerous because the lock type is wrong for the traffic pattern.
Most teams have learned about CREATE INDEX CONCURRENTLY. Fewer have mapped out the full lock table for ALTER TABLE variants. The failure pattern is predictable: an engineer runs ALTER TABLE orders ADD COLUMN tax_id VARCHAR(32) NOT NULL DEFAULT '' on a table with 500 million rows, assumes it is fast because they have done it before on small tables, and discovers it is holding an ACCESS EXCLUSIVE lock while taking 12 minutes to backfill the default.
This checklist forces the assessment before the migration runs, not after it starts.
The Problem
When schema migrations fail, they usually do not corrupt data — they corrupt availability. A migration that holds an ACCESS EXCLUSIVE lock on a heavily trafficked table causes all incoming queries to queue. Once the connection pool saturates, the application begins dropping requests, triggering an escalating cascade of timeouts.
| Signal | Where to see it | What it means |
|---|---|---|
| Application connection queuing after migration started | APM or pg_stat_activity | Migration holding ACCESS EXCLUSIVE lock — connections waiting |
| Migration running longer than expected | pg_stat_activity with state = 'active' and old xact_start | Table size or data backfill underestimated on staging |
| Replication lag spiking during migration | pg_stat_replication — replay_lag growing | Migration WAL volume causing replication to fall behind |
| Migration script fails with lock timeout | Application or migration tool error log | Lock acquisition timed out — another transaction holding the table |
| Rollback script unavailable | Migration tool history | Migration was run without a matching down migration |
The traditional approach of “test it on staging” provides a false sense of security. A deployment that runs in two seconds on a 100 MB staging table can stall for twenty minutes on a 500 GB production table. Furthermore, if a migration blocks mid-execution due to lock contention or disk space limits, the lack of an immediate, tested rollback plan forces engineers to invent recovery strategies during an active incident.
How can a team systematically verify the lock behavior, execution duration, and reversibility of a schema migration before it ever touches production?
Core Concept
The solution is a structured evaluation that categorizes migrations by lock type, table size, and rollback complexity before execution.
Decision Tree
flowchart TD
A[Schema migration planned] --> B{Requires ACCESS EXCLUSIVE lock?}
B -->|no — CONCURRENTLY or ANALYZE| C[Safe to run anytime — proceed]
B -->|yes| D{Table size greater than 1 GB?}
D -->|yes| E{Online alternative available?}
E -->|yes| F[Use online alternative — see options below]
E -->|no| G[Schedule maintenance window]
D -->|no — small table| H{Traffic pattern allows short lock?}
H -->|yes| I[Run during low-traffic window]
H -->|no| J[Use online alternative or maintenance window]
F --> K{NOT NULL without default?}
K -->|yes| L[3-step split — nullable then backfill then constraint]
K -->|no| M[ADD COLUMN with DEFAULT on PG11 or later — instant]
What this diagram shows: A migration risk decision tree. The first branch identifies whether the operation requires ACCESS EXCLUSIVE lock. If so, table size determines whether an online alternative exists. The final branch handles NOT NULL without a default — which requires the three-step pattern: add as nullable, backfill, then add the constraint.
First Five Checks
- Does the migration require ACCESS EXCLUSIVE lock? — the most important question to answer first:
-- Check the lock type for common DDL operations:
-- ACCESS EXCLUSIVE (blocks reads AND writes):
-- ALTER TABLE (most variants)
-- DROP TABLE, TRUNCATE, DROP INDEX
-- VACUUM FULL, CLUSTER
--
-- SHARE UPDATE EXCLUSIVE (allows reads and writes):
-- CREATE INDEX CONCURRENTLY
-- VACUUM, ANALYZE, CREATE STATISTICS
--
-- SHARE (allows reads, blocks writes):
-- CREATE INDEX (without CONCURRENTLY)
-- To confirm lock behavior during a migration, check what is waiting:
SELECT pid, relation::regclass, mode, granted
FROM pg_locks
WHERE NOT granted
ORDER BY pid;
If your migration uses ALTER TABLE on a large table, it will take ACCESS EXCLUSIVE. Period. Understand this before starting.
- What is the table size? — execution time scales with table size for any migration that rewrites rows:
SELECT
pg_size_pretty(pg_total_relation_size('orders'::regclass)) AS total_size,
pg_size_pretty(pg_relation_size('orders'::regclass)) AS heap_size,
pg_size_pretty(pg_indexes_size('orders'::regclass)) AS index_size,
reltuples::bigint AS estimated_rows
FROM pg_class
WHERE relname = 'orders';
For any migration that rewrites the heap (ADD COLUMN with default on PG10, changing column types, ADD CONSTRAINT), the lock duration is proportional to table size. A migration that runs in 3 seconds on a 100 MB staging table will run for 18 minutes on a 36 GB production table.
- Is the migration reversible? — classify before running:
-- Check existing column definitions before adding or dropping
SELECT
column_name,
data_type,
is_nullable,
column_default
FROM information_schema.columns
WHERE table_name = 'orders'
ORDER BY ordinal_position;
Reversibility classification:
ADD COLUMN nullable— reversible:DROP COLUMNADD COLUMN NOT NULL DEFAULT value— reversible on PG11 and later:DROP COLUMN(PG11+ stores the default in catalog, no rewrite)DROP COLUMN— irreversible: data is gone after vacuum runsALTER COLUMN TYPE— reversible in principle, but requires another full rewrite; plan carefullyCREATE INDEX— fully reversible:DROP INDEXADD CONSTRAINT CHECK— reversible:DROP CONSTRAINT, but adds a lock; useNOT VALID+VALIDATE CONSTRAINTsplit
- Test the migration on a production-sized staging database — estimate true execution time:
# Time the migration on a copy of production data
psql -d staging_prod_copy -c "\timing" -c "ALTER TABLE orders ADD COLUMN archived_at timestamptz;"
# For longer migrations, use EXPLAIN to see what the operation will do before committing
BEGIN;
ALTER TABLE orders ADD COLUMN archived_at timestamptz;
-- Check pg_locks here to observe lock behavior
ROLLBACK; -- abort to avoid actual change
Timing on staging with a production-sized dataset is the only reliable estimate. Factor-of-10 size differences between staging and production are common and explain most migration surprises.
- Is the migration idempotent? — essential for safe retries:
-- Idempotent column addition
ALTER TABLE orders ADD COLUMN IF NOT EXISTS archived_at timestamptz;
-- Idempotent index creation
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_archived_at ON orders (archived_at);
-- Idempotent constraint addition
DO $$
BEGIN
IF NOT EXISTS (
SELECT 1 FROM pg_constraint
WHERE conname = 'chk_orders_status'
AND conrelid = 'orders'::regclass
) THEN
ALTER TABLE orders ADD CONSTRAINT chk_orders_status
CHECK (status IN ('pending', 'processing', 'shipped', 'cancelled'))
NOT VALID;
END IF;
END $$;
A migration that fails midway and cannot be safely retried creates recovery debt. IF NOT EXISTS guards on CREATE INDEX CONCURRENTLY and ADD COLUMN are the standard pattern.
Remediation Options
Option 1 — Lock-safe online alternatives
For the most common migration types, online alternatives avoid ACCESS EXCLUSIVE:
-- ADD INDEX: always use CONCURRENTLY on production
CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id);
-- ADD COLUMN with default (PostgreSQL 11 and later): instant, no table rewrite
-- PostgreSQL 11 and later stores the default in pg_attrdef, not in the heap
ALTER TABLE orders ADD COLUMN archived_at timestamptz DEFAULT NULL;
-- ADD NOT NULL constraint without default: 3-step split
-- Step 1: Add column as nullable
ALTER TABLE orders ADD COLUMN IF NOT EXISTS tax_id VARCHAR(32);
-- Step 2: Backfill in batches (do NOT do this in a single UPDATE)
DO $$
DECLARE
batch_size INT := 10000;
offset_val INT := 0;
rows_updated INT;
BEGIN
LOOP
UPDATE orders
SET tax_id = ''
WHERE id IN (
SELECT id FROM orders
WHERE tax_id IS NULL
LIMIT batch_size
);
GET DIAGNOSTICS rows_updated = ROW_COUNT;
EXIT WHEN rows_updated = 0;
PERFORM pg_sleep(0.1); -- brief pause between batches
END LOOP;
END $$;
-- Step 3: Add NOT NULL constraint (fast — validates only in PG12 and later)
ALTER TABLE orders ALTER COLUMN tax_id SET NOT NULL;
-- PG12 and later: uses a not-null marker in pg_attribute, not a CHECK constraint scan
Option 2 — Table rewrite with pg_repack
For bloated tables needing a full rewrite (e.g., removing a column after many deletes), pg_repack performs online table rebuilding without extended ACCESS EXCLUSIVE:
# Install pg_repack extension
CREATE EXTENSION pg_repack;
# Run repack online — rebuilds table without long lock
pg_repack -h localhost -U postgres -d mydb -t orders
# With specific columns (version 1.4.7 and later)
pg_repack -h localhost -U postgres -d mydb --table orders
pg_repack works by building a new table copy online, capturing changes via a trigger, then performing a fast swap at the end. The final swap takes a brief ACCESS EXCLUSIVE lock (usually under a second). Per the pg_repack documentation, it requires the table to have a primary key or a unique constraint.
Option 3 — Scheduled maintenance window with monitoring
When no online alternative exists — changing a column type, adding a foreign key that requires a full scan, or truncating a large table — execute during a maintenance window with active monitoring:
-- Set a lock timeout to abort if the migration waits too long for a lock
SET lock_timeout = '5s';
-- Set a statement timeout as a safety net
SET statement_timeout = '10min';
-- Run migration
ALTER TABLE orders ALTER COLUMN amount TYPE NUMERIC(12, 4);
-- Monitor from a second session during execution
SELECT pid, state, query, now() - xact_start AS duration
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
A lock_timeout prevents the migration from queuing indefinitely behind a long-running transaction. If the migration cannot acquire its lock in 5 seconds, it aborts cleanly, allowing you to investigate what is holding the lock before retrying.
Rollback Plan
For every migration, have the rollback command written before running the forward migration:
-- Forward: add column
ALTER TABLE orders ADD COLUMN archived_at timestamptz;
-- Rollback: drop column
ALTER TABLE orders DROP COLUMN IF EXISTS archived_at;
-- Forward: create index
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);
-- Rollback: drop index
DROP INDEX CONCURRENTLY IF EXISTS idx_orders_status;
-- Forward: add constraint (using NOT VALID to avoid full scan)
ALTER TABLE orders ADD CONSTRAINT chk_orders_positive_amount
CHECK (amount > 0) NOT VALID;
-- Validate separately (allows reads and writes during validation)
ALTER TABLE orders VALIDATE CONSTRAINT chk_orders_positive_amount;
-- Rollback: drop constraint
ALTER TABLE orders DROP CONSTRAINT IF EXISTS chk_orders_positive_amount;
For migrations that are irreversible at the data level (DROP COLUMN, TRUNCATE), the rollback plan is: restore from backup. This should be documented explicitly in the migration, and the backup should be confirmed current before running.
Automation Opportunity
A pre-migration risk assessment script that runs before any ALTER TABLE in your CI pipeline catches most issues automatically:
#!/bin/bash
# Check if a migration will require a table rewrite on a large table
TABLE_SIZE=$(psql -tAc "SELECT pg_relation_size('${TABLE_NAME}'::regclass)")
TABLE_SIZE_GB=$(echo "scale=2; ${TABLE_SIZE}/1073741824" | bc)
if (( $(echo "$TABLE_SIZE_GB > 1" | bc -l) )); then
echo "WARNING: Table ${TABLE_NAME} is ${TABLE_SIZE_GB}GB"
echo "Verify migration is CONCURRENTLY-safe or schedule maintenance window"
exit 1
fi
For teams using schema migration tools (Flyway, Liquibase, golang-migrate), pre-migration hooks that run the size check and lock-type classification against the target SQL are the standard pattern.
Schema Deployment Checklist
- Identify the SQL statement and its lock type — ACCESS EXCLUSIVE, SHARE, or SHARE UPDATE EXCLUSIVE
- Query
pg_total_relation_sizefor the target table — flag if greater than 1 GB - Determine if the migration is reversible — write the rollback SQL before running the forward migration
- Test execution time on a production-sized staging database with
\timing - Confirm the migration is idempotent — add
IF NOT EXISTSandIF EXISTSguards where applicable - Determine if an online alternative exists —
CONCURRENTLYindex, PG11+ ADD COLUMN, 3-step NOT NULL - For ACCESS EXCLUSIVE on large tables — schedule a maintenance window or use the online alternative
- Set
lock_timeout = '5s'andstatement_timeoutbefore running any blocking migration - Confirm a current backup exists before running any irreversible migration (DROP COLUMN, TRUNCATE)
- Monitor
pg_stat_activityfor lock contention during the migration window from a second session - Verify replication lag does not spike during migration — check
pg_stat_replication - After migration completes, run
EXPLAIN (ANALYZE)on the primary affected queries to confirm plan is correct
In Practice
The PostgreSQL documentation for ADD COLUMN explicitly describes the behavioral change in PostgreSQL 11: prior to version 11, ADD COLUMN with a DEFAULT clause required a full table rewrite to store the default in every existing row. PostgreSQL 11 introduced storage of the default in pg_attrdef, allowing ADD COLUMN ... DEFAULT to complete in milliseconds regardless of table size — the default is applied on read for existing rows, not during the migration. This behavior is documented in the PostgreSQL 11 release notes.
The documentation for CREATE INDEX CONCURRENTLY documents its two-pass scan approach: it makes two passes over the table — one to build the initial index, one to incorporate concurrent changes — before marking the index valid. This means it takes longer than non-concurrent index creation, but it never holds an ACCESS EXCLUSIVE lock. The tradeoff is explicit in the documentation: “the table is not locked against writes for an extended period of time, but the build takes longer.”
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
CREATE INDEX CONCURRENTLY leaves invalid index | Transaction conflict or cancellation during build | Drop the invalid index; recreate with CONCURRENTLY |
NOT VALID constraint skips existing data violations | Backfill was incomplete before constraint was added | Run VALIDATE CONSTRAINT to enforce on all rows; fix violations first |
| 3-step NOT NULL breaks if backfill is skipped | Developer runs step 1 and step 3 without step 2 | Enforce step ordering in migration tooling; use explicit progress markers |
lock_timeout causes migration abort | Another long transaction holds an incompatible lock | Identify and wait for blocking transaction; retry migration with longer timeout |
pg_repack fails on table with no primary key | Table uses composite key or has no unique identifier | Add a surrogate primary key first, or use a maintenance window rewrite |
What This Post Does Not Cover
This checklist covers schema migration risk for PostgreSQL and MySQL. It does not cover: migration tooling comparisons (Flyway vs Liquibase vs sqitch), zero-downtime application deployment patterns when schema and code changes must roll out together, MongoDB schema validation evolution, or database-level encryption key rotation during schema changes. Each of those is a separate decision area.
What to Do Next
- Problem: Schema migrations that appear safe on small staging tables can hold ACCESS EXCLUSIVE locks for minutes on large production tables, queuing and dropping connections until they complete or are killed.
- Solution: Classify every migration by lock type and table size before running it; use
CREATE INDEX CONCURRENTLYand the 3-step NOT NULL split for large tables; and always have the rollback command written before the forward migration runs. - Proof: After implementing
CONCURRENTLYand deferred NOT NULL patterns, migration deployments should complete with zero connection queuing — observable inpg_stat_activityshowing no waiting state during the migration window. - Action: Pick one upcoming schema migration and run through this checklist before executing it. If it requires ACCESS EXCLUSIVE on a table over 1 GB, find the online alternative or schedule the maintenance window before the deployment date.
Interactive tools for this topic