Argo CD Deployment Workflow: Sync Waves, Health Checks, Rollbacks, and Drift

A deployment system is not production-grade because it can apply YAML; it is production-grade when it can order change, prove readiness, reverse bad state, and expose drift before users discover it.

Situation

Platform teams adopted GitOps because Kubernetes made the desired state machine visible. A commit can describe a namespace, deployment, service, ingress, policy, secret reference, and database migration job. Argo CD then reconciles the live cluster toward that declared state.

That model works well when applications are small and independent. The repository changes, Argo CD detects the new revision, renders manifests, compares them with live resources, and syncs the difference.

The harder case is the ordinary production case: one release touches multiple resource classes with different readiness semantics. Custom resource definitions must exist before custom resources. Service accounts and RBAC must exist before controllers start. Migrations may need to run before new pods receive traffic. Rollouts must wait for Kubernetes health, not merely for a successful kubectl apply. Some drift is harmless, some drift is an incident, and some drift is a controller doing its job.

Argo CD’s deployment workflow matters because it sits between Git’s clean history and Kubernetes’ eventually consistent reality.

The Problem

The default failure mode in GitOps is treating reconciliation as a single flat apply. That hides several operational problems.

Ordering is the first problem. Kubernetes accepts many objects independently, but applications often have dependencies. If a workload starts before its config, permissions, CRDs, or prerequisite jobs exist, the sync may technically complete while the rollout fails later.

Readiness is the second problem. A resource can be applied and still be unhealthy. A Deployment may be progressing, an Ingress may not have an address, a Job may still be running, and a custom resource may need controller-specific health logic. Without health gates, the deployment system reports movement rather than safety.

Rollback is the third problem. A GitOps rollback is not only “go back to the old image.” It must reconcile the entire declared state: manifests, config, hooks, generated resources, and dependent objects. Rolling back through a manual cluster edit creates a second source of truth.

Drift is the fourth problem. Drift can come from emergency manual patches, mutating admission controllers, autoscalers, operators, or failed pruning. Some drift should be repaired automatically. Some should be surfaced but left alone. The platform has to decide which is which.

The core question is: how do you design an Argo CD workflow that makes deployment order, health, rollback, and drift explicit enough to operate under pressure?

Core Concept

Treat Argo CD as a staged reconciliation pipeline, not a YAML launcher. The useful pattern is:

Declare ordering with sync phases and sync waves.
Let health checks decide whether later work should proceed.
Make rollback a Git operation or a controlled Argo CD revision operation.
Classify drift by ownership before enabling automated repair.

flowchart TD
  A[Git commit — desired state] --> B[Argo CD diff — compare live state]
  B --> C[PreSync hooks — validation and migration]
  C --> D[Sync wave negative one — namespaces and CRDs]
  D --> E[Sync wave zero — config and access]
  E --> F[Sync wave one — workloads]
  F --> G[Health checks — readiness gate]
  G --> H[PostSync hooks — verification]
  H --> I[Drift monitor — live state comparison]
  I --> B
  G --> J[Rollback path — revert desired state]
  J --> A

Sync waves are the ordering mechanism. Argo CD supports the argocd.argoproj.io/sync-wave annotation, where lower waves apply before higher waves. A practical convention is to put foundational resources in negative or early waves, application workloads in the middle, and verification hooks at the end.

Health checks are the gate. Built-in health exists for common Kubernetes resources, and custom health checks can be defined for resource types whose readiness is domain-specific. The important architectural decision is that apply success is not deployment success. The workflow should wait until health reflects the state users depend on.

Rollbacks should restore declared state. In the cleanest case, rollback is a Git revert that returns the application to a previous known-good manifest set. Argo CD can also sync to a prior revision from history, but the long-term source of truth still needs to converge back into Git. Otherwise, the next sync may reintroduce the bad state.

Drift handling needs policy. Automated sync with self-heal is powerful when Argo CD owns the field and manual edits are not allowed. It is dangerous when other controllers intentionally mutate resources. Ignore rules, diff customization, and clear ownership boundaries keep drift detection useful instead of noisy.

In Practice

Context: The documented Kubernetes pattern is declarative reconciliation: controllers compare desired state with observed state and continuously move the system toward the desired state. Argo CD applies the same pattern at the Git repository boundary, using Git as the desired state and the cluster API as observed state. Intuit’s documented public decision when creating Argo CD was to use the Git repository as the single source of truth to avoid split-brain scenarios between manual cluster edits and code.

Action: The documented Argo CD pattern is to encode ordering through sync phases and waves. PreSync hooks run before normal sync work, sync waves order resources within a phase, and PostSync hooks run after the main sync has completed. This allows a deployment to place validation, migration, base infrastructure, workloads, and verification into separate steps without leaving the GitOps model.

Result: The result is not a guarantee that the application is correct. The result is a more inspectable state machine. Operators can see which resource, hook, wave, or health check blocked progress. Kubernetes still owns pod scheduling, rollout progress, and controller convergence; Argo CD owns comparison, ordering, and sync orchestration.

Learning: The documented pattern is to make implicit dependencies explicit in metadata and policy. If a migration must precede traffic, it belongs in a hook or separate controlled release step. If a CRD must precede a custom resource, it belongs in an earlier wave. If a controller mutates fields after admission, those fields need a drift policy rather than repeated manual explanations.

A strong Argo CD workflow therefore does not hide Kubernetes behavior. It exposes it at the right level.

Where It Breaks

Failure mode	Why it happens	Mitigation
Sync succeeds but release fails	Apply completed before real readiness	Require health checks and verification hooks
Waves become a dependency graph language	Too much orchestration is encoded in annotations	Split applications or move complex workflows into purpose-built jobs
Rollback replays old assumptions	Older manifests may not match current external state	Test rollback paths and keep migrations backward compatible
Self-heal fights other controllers	Multiple systems own the same live fields	Define ownership and use diff customization
Hooks become hidden deployment logic	Critical behavior lives outside normal manifests	Keep hooks small, observable, and idempotent
Pruning deletes shared resources	Argo CD thinks it owns resources used elsewhere	Scope applications carefully and avoid shared mutable ownership

What to Do Next

Problem: Your Argo CD app syncs manifests, but production failure still depends on ordering, readiness, rollback, and drift behavior that may be implicit.
Solution: Model deployment as a gated reconciliation pipeline using sync waves, hooks, health checks, Git-first rollback, and explicit drift policy.
Proof: The architecture follows documented Kubernetes and Argo CD reconciliation patterns: desired state is declared, live state is compared, controllers converge, and health determines operational readiness.
Action: Audit one critical application. List every dependency, assign sync waves, define health gates, document rollback mechanics, and classify every recurring diff as either owned drift, ignored controller mutation, or an incident.

Situation

The Problem

Core Concept

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

CI/CD Observability: Queue Time, Flake Rate, Lead Time, Failure Domains, and Change Risk

Python Automation Needs an API Contract, Not a Folder of Scripts

Environment Promotion: Why Dev, Stage, and Prod Drift Apart