Terraform automation fails when teams treat infrastructure delivery like application delivery: build an artifact, deploy it anywhere, and roll it back if the deployment misbehaves. Infrastructure has a different failure shape. The artifact is a proposed mutation against live state, the reviewer is approving blast radius, the lock is protecting a shared control plane, and rollback is usually another forward change.

Situation

Platform teams are moving Terraform out of laptops and into CI/CD because local applies do not scale across many contributors, accounts, environments, and compliance boundaries. Pull requests give teams review, audit history, policy checks, and a familiar approval surface. CI gives them consistent versions, ephemeral credentials, structured logs, and a repeatable path from change request to apply.

That shift is necessary, but it changes the unit of control. A Terraform pipeline is not just fmt, validate, plan, and apply glued together. It is a workflow for deciding who can propose infrastructure changes, who can approve them, which exact plan is allowed to run, how concurrent mutation is prevented, and where the organization accepts that rollback becomes manual recovery.

The mature pattern is to make CI/CD boring: speculative plans on pull requests, human or policy review before merge, serialized applies against each state, narrowly scoped credentials, and explicit recovery procedures for failed applies.

The Problem

Most broken Terraform pipelines fail at the boundaries between those steps, not inside a single command.

A pull request plan can be reviewed and then become stale before apply because another change landed first. An apply job can recompute a new plan after approval, silently expanding the reviewed blast radius. Two applies can race against the same state if the backend or automation layer does not lock correctly. A failed apply can leave real infrastructure partially changed while state reflects only the operations Terraform completed. A revert commit can remove configuration, but it does not guarantee that the cloud provider can reverse every side effect safely.

The hard question is not “how do we run Terraform from CI?” It is: what boundary makes a Terraform change reviewed, serialized, attributable, and recoverable enough to trust?

Core Concept

The answer is to make apply a privileged boundary, not a continuation of generic CI.

flowchart TD
  A[developer opens pull request — terraform change] --> B[ci plan job — format validate plan]
  B --> C[plan output — human readable diff]
  B --> D[plan file — opaque artifact]
  C --> E[review boundary — code owners policy checks]
  E --> F[merge boundary — approved intent]
  F --> G[apply job — protected environment]
  D --> G
  G --> H[state lock — one writer per state]
  H --> I[provider mutation — cloud control plane]
  I --> J[state update — recorded outcome]
  J --> K[rollback boundary — roll forward or recover]

The plan stage should answer “what would this change do from the current state?” It should run on every pull request, publish readable output, and fail closed on formatting, validation, and policy violations. It should not have broad production mutation rights.

The review stage should approve intent and blast radius. Reviewers need enough signal to distinguish expected churn from dangerous replacement, privilege escalation, data loss, or changes outside the intended workspace. For high-risk modules, approval should come from code owners who operate that infrastructure, not only from the service team that benefits from it.

The apply stage should run only after the review boundary is satisfied. In strict pipelines, the apply uses a saved plan file generated by the approved run. HashiCorp documents terraform plan -out=FILE and applying that saved file with terraform apply FILE; the same documentation warns that saved plan files can contain sensitive values in cleartext, so the artifact store becomes part of the security boundary. See HashiCorp’s terraform plan command reference.

When teams instead recompute the plan after merge, they should admit the tradeoff: the reviewed plan was advisory, and the apply-time plan is the authoritative mutation. That can be acceptable when the apply job posts the final diff, requires a protected environment approval, and serializes per workspace. It is unsafe when merge approval is treated as approval for whatever CI later discovers.

In Practice

Context. The documented industry pattern is pull-request planning with protected application. HCP Terraform documents speculative plans for VCS-backed pull requests and states that speculative plans show possible changes but cannot apply them. That separates review visibility from mutation authority. See HashiCorp’s docs on remote operations.

Action. Put the pipeline on three rails. First, pull requests run speculative plans with read-oriented permissions and publish a summarized diff. Second, merges trigger applies in protected environments with restricted credentials. Third, every apply targets one state backend key or workspace and relies on state locking. Terraform’s own state locking documentation says Terraform locks state for operations that could write state when the backend supports locking. See HashiCorp’s state locking documentation.

Result. The result is not faster Terraform. It is a smaller failure domain. Reviewers approve a visible intent. Apply credentials exist only where mutation is allowed. Concurrent writes are blocked at the state boundary. If the provider API fails halfway through, the team knows which run held the lock, which change initiated it, and which workspace must be reconciled.

Learning. The useful lesson from tools such as Atlantis is that Terraform automation needs an application-level coordination layer in addition to backend locking. Atlantis documents pull-request locks around project and workspace operations, while noting that Terraform’s native command locking still applies underneath. See the Atlantis docs on locking. The pattern is explicit coordination: prevent competing plans and applies from pretending they are independent when they share state.

A second documented pattern is removing long-lived cloud secrets from CI. GitHub Actions documents OpenID Connect for exchanging workflow identity for short-lived cloud credentials without storing long-lived credentials as repository secrets. See GitHub’s OIDC security hardening documentation. For Terraform, this matters because the apply boundary should be time-limited, environment-scoped, and auditable.

Where It Breaks

BoundaryFailure modeDesign response
Plan artifactSaved plan contains sensitive dataEncrypt artifacts, restrict access, expire quickly, avoid broad log exposure
ReviewReviewer approves unreadable churnSummarize replacements, deletes, IAM changes, network exposure, and data resources separately
MergeApproved plan becomes staleApply the saved plan or require apply-time approval for the final plan
LockCI serializes jobs but backend does not lockUse a backend with locking and keep CI concurrency as a second guard
WorkspaceMultiple environments share stateSplit state by ownership and blast radius, not by repository convenience
CredentialsPull request job can mutate productionSeparate plan and apply roles, use protected environments, prefer short-lived identity
RollbackRevert commit is treated as undoTreat rollback as a new plan, review provider side effects, reconcile drift first
Failed applyInfrastructure and state disagreeStop further applies, inspect state, import or remove resources deliberately, then roll forward

Rollback is the most commonly misunderstood boundary. Terraform does not provide a transaction across cloud APIs. If a database parameter group changes, a security group rule is removed, and an instance replacement starts, there is no universal “undo” that restores all external behavior. A rollback commit is just another desired state. It still needs a plan, a lock, credentials, and review.

The operational runbook should therefore say “recover,” not “rollback.” Recovery may mean applying the previous configuration, importing a resource that was created before failure, removing a bad object from state, manually restoring a provider setting, or rolling forward with a compensating change. The right move depends on what the provider actually did.

What to Do Next

Problem: Your pipeline probably shows a plan, but it may not preserve the reviewed mutation through apply, serialize all writers, or define what happens after partial failure.

Solution: Treat apply as a protected boundary. Separate speculative planning from mutation, scope credentials to the stage, lock per state, and decide whether saved plans or apply-time approvals are the authoritative control.

Proof: Use documented Terraform behaviors as the design base: saved plans are executable artifacts, state locking protects supported backends from concurrent writes, speculative plans are review-only, and tools like Atlantis add pull-request coordination around shared workspaces.

Action: Audit one production workspace this week. Trace a change from pull request to apply. Verify who can approve it, which credentials can mutate it, whether a second apply can race it, where the plan artifact lives, and what the operator does if the apply fails halfway through.