Automation Rollback Playbook: Disable, Revert, Repair State, and Reconcile Reality
Rollback is not one action. In an automated platform, rollback is a sequence: stop the machine, reverse the change, repair the control state, and prove that production matches the story your tools now tell.
Situation
Modern delivery systems are not just deployment scripts. They are standing control planes.
A merge to main can trigger CI, publish an artifact, update an environment, apply infrastructure, rotate configuration, invalidate caches, and notify downstream systems. The platform team usually sees this as maturity: fewer handoffs, fewer tickets, tighter feedback loops, and less operational waiting.
That model works while the automation is correct. It becomes dangerous when the automation is still running after the team has decided the change is bad.
The old rollback model assumed an operator could undo the last step. The new model has to assume the pipeline may keep creating new steps while the incident is in progress. A failed deploy might not be the only problem. A reconciliation loop might reapply the failed version. A CI workflow might publish a second bad artifact. An infrastructure plan might partially apply, fail, and leave state believing a resource exists in a shape that reality does not match.
The playbook must therefore treat rollback as control-system recovery, not merely code recovery.
The Problem
Most rollback procedures start too late. They begin with “revert the commit” or “roll back the deployment,” which is necessary but incomplete.
If the automation remains enabled, the revert can race the same machinery that caused the failure. For example, if an operator manually reverts a workload via kubectl rollout undo while a GitOps controller like Flux or ArgoCD remains active, the controller will detect the deviation and immediately reconcile the cluster back to the broken Git commit. If the state store is wrong, the next infrastructure plan can destroy the wrong object or recreate something that already exists. If the team only checks the deployment object, it can miss external reality: queues still draining with bad messages, caches containing invalid data, feature flags still pointing users into broken paths, or infrastructure bindings still attached to the wrong resource.
Automation failures also produce two timelines. Git has one timeline. Production has another. The CI system, deployment controller, infrastructure state file, cloud provider, database migrations, and customer-visible behavior may each have a different view of what happened.
The question is not “how do we undo the change?” The better question is: what order lets us regain control before we attempt repair?
Core Concept
A reliable rollback playbook has four phases: disable, revert, repair state, and reconcile reality.
flowchart TD
A[Incident trigger — automation suspected] --> B[Disable automation — stop new writes]
B --> C[Freeze inputs — protect deploy branch]
C --> D[Revert change — create explicit inverse commit]
D --> E[Roll back runtime — restore known workload revision]
E --> F[Repair state — align controller memory]
F --> G[Reconcile reality — compare declared and observed]
G --> H[Restart automation — guarded and observable]
G --> I[Escalate repair — manual owner review]
Disable comes first because it changes the system from active to bounded. This can mean disabling a CI workflow, pausing a deployment controller, locking an environment, freezing a branch, disabling scheduled jobs, or turning off a feature flag writer. The exact mechanism depends on the platform, but the goal is the same: no new automated writes while humans are repairing the failed one.
Revert should be explicit, reviewable, and forward-moving. In Git, revert records a new commit that reverses a prior commit rather than rewriting shared history. That matters during incidents because the audit trail is part of the recovery artifact. A rollback commit should name the production symptom, the reverted change, the expected runtime effect, and the verification owner.
Repair state is the phase teams skip until it hurts. Infrastructure and deployment tools maintain memory. Terraform state binds configuration addresses to remote objects. Kubernetes deployment history binds revisions to ReplicaSets. CI systems bind workflow runs to artifacts and environments. If those memories disagree with actual resources, a clean Git revert can still leave the platform unsafe.
Reconcile reality means checking the external system, not just the control plane. The source repository may say the old version is restored. The deployment API may say the rollout is complete. Neither proves that the load balancer sends traffic to the expected pods, the database schema matches the application, the queue has stopped amplifying bad work, or the next automation run will be harmless.
The final restart should be staged. Re-enable automation only after a dry run, plan, diff, or no-op deploy proves the controller is not about to recreate the incident.
In Practice
Context: GitHub documents that Actions workflows can be disabled and enabled through the UI, REST API, or CLI. That is not just an administrative convenience; it is the first rollback primitive for a platform where merges, schedules, and manual dispatches can trigger more writes. The documented pattern is to stop the workflow before assuming the repository is stable again: GitHub Actions workflow disablement.
Action: During a rollback, disable the workflow or environment path that can deploy, publish, or mutate state. Then protect the branch or environment so the revert is the only authorized write.
Result: The rollback becomes bounded. Operators are no longer debugging a moving target where a scheduled workflow can produce a second artifact or redeploy the failed revision.
Learning: Automation must have an emergency brake that is separate from the normal delivery path. A rollback button that depends on the broken pipeline is not a rollback plan.
Context: Git defines git revert as an operation that applies inverse changes and records them as new commits, preserving shared history instead of moving it. That behavior is well suited to incident recovery because the rollback itself becomes reviewable history. The documented pattern is to issue explicit revert commits rather than rewriting history during an incident: Git revert documentation.
Action: Prefer revert commits over force-pushing history on shared release branches. Link the rollback commit to the incident and to the verification evidence.
Result: The team can audit what was undone, who approved it, and when the system moved from mitigation to repair.
Learning: Rollback is production change management. Treat the inverse commit with the same rigor as the original change.
Context: Kubernetes Deployments expose rollout history and support rolling back to earlier revisions. The Kubernetes documentation describes the deployment controller as able to roll back to a previous revision and manage ReplicaSets through rollout operations. The documented pattern is to mitigate runtime impact quickly by rolling back the deployment controller state: Kubernetes Deployments and kubectl rollout undo.
Action: Use workload rollback to restore a known runtime revision, then verify pods, readiness, traffic routing, and application health. Do not stop at the deployment status.
Result: The runtime can recover faster than the repository or infrastructure layers, which buys time for deeper state repair.
Learning: Runtime rollback is mitigation, not closure. It reduces impact while the platform state catches up.
Context: Terraform documents state as the binding between configuration and remote objects. Its state guidance warns that if bindings are changed outside normal flow, operators must preserve the one-to-one relationship themselves. The documented pattern is to explicitly manage state drift with commands like terraform state rm before the next plan: Terraform state and state commands.
Action: After a partial apply, inspect state before the next plan. Use imports, moves, or removals deliberately, with backups and peer review.
Result: The next automation run is less likely to destroy, duplicate, or orphan infrastructure because the controller memory has been repaired before reactivation.
Learning: Declarative automation is only as safe as its state model. Reality reconciliation is part of rollback, not cleanup.
Where It Breaks
| Failure mode | Why it happens | Control |
|---|---|---|
| Automation replays the bad change | Workflow, scheduler, or controller remains active | Disable write paths before reverting |
| Revert succeeds but production stays broken | Runtime has separate rollout state or cached configuration | Verify workload, traffic, cache, and flags |
| Infrastructure plan becomes dangerous | State no longer matches remote resources | Repair bindings before applying |
| Database rollback is not reversible | Migration destroyed or reshaped data | Prefer forward repair migrations and backups |
| Incident ends with hidden drift | Teams trust Git or CI status alone | Reconcile declared state against observed reality |
| Automation restart causes a second incident | No dry run before re-enabling | Require no-op plan, diff, or canary |
What to Do Next
-
Problem: Your rollback procedure probably assumes a single failed change, but your platform has multiple controllers that can continue writing after the incident begins.
-
Solution: Rewrite the runbook around the four phases: disable automation, revert the change, repair control-plane state, and reconcile observed reality.
-
Proof: A good rollback is not “the build is green.” It is a verified no-op plan, stable runtime health, correct state bindings, and a controlled automation restart.
-
Action: Add emergency brakes to every production writer this quarter: CI workflows, deployment controllers, infrastructure pipelines, schedulers, feature flag writers, and release automation. Then rehearse the rollback with a harmless change and require evidence for each phase before calling it complete.