GitOps Is Reconciliation, Not Just YAML in Git

GitOps fails when teams treat the repository as the product; the product is the control loop that continuously makes reality match the repository.

Situation

Platform teams adopted GitOps because it gave delivery a better audit trail. Instead of asking who ran a command against production, they could point to a commit, a pull request, a reviewer, and a deployment controller. That was a real improvement over snowflake scripts and privileged laptops.

But the operational value was never simply “put YAML in Git.” A static repository does not deploy anything. A pull request does not detect drift. A merge commit does not know whether a rollout became healthy, whether a namespace was manually changed, or whether a dependency failed halfway through an apply.

The useful architecture is reconciliation: declare intended state, observe actual state, compute the delta, act, then repeat. Git is the durable input. The controller is the system.

The Problem

Many teams rebuild their old CI/CD pipeline and call it GitOps. The pipeline renders manifests, runs kubectl apply, exits green, and leaves the cluster to deal with whatever happens next. If an operator hotfixes a deployment, the pipeline does not notice. If a resource is deleted by accident, nothing repairs it. If an admission policy rejects half the rollout, the job may have already moved on. If the target environment is unavailable, the deployment depends on retry logic in a build system that was designed for jobs, not long-lived convergence.

This creates a dangerous split-brain model. Git contains the desired state, but the cluster contains the operating truth. The longer those two diverge, the less useful Git becomes as a source of record. Engineers start asking whether the manifest is real, whether production was patched manually, and whether rollback means reverting Git or reverse-engineering the live environment.

The core question is not whether the platform stores YAML in Git. The core question is: what mechanism continuously proves that the running system still matches the declared intent?

Reconciliation as the Architecture

A GitOps platform should be evaluated as a control system, not as a repository convention. The minimum loop has five responsibilities: source acquisition, diffing, apply, health evaluation, and drift response.

flowchart TD
  A[Git commit — desired state] --> B[Source controller — fetch revision]
  B --> C[Diff engine — compare live state]
  G[Cluster API — actual state] --> C
  C -->|drift found| D[Apply engine — converge resources]
  D --> G
  G --> E[Health model — observe readiness]
  E -->|healthy| F[Policy gates — pause or promote]
  E -->|not healthy| H[Alerts — unresolved drift]
  F --> B

This loop changes the engineering contract. CI is no longer the thing that deploys production directly. CI builds, tests, signs, scans, and proposes a desired state change. The reconciler owns convergence. That separation matters because delivery is not a single event. It is an ongoing relationship between declared intent and live state.

Good GitOps platforms therefore expose state, not just logs. They should show the desired revision, the observed revision, the diff, the sync status, the health status, the last reconciliation result, and the reason a resource cannot converge. Without those signals, teams are back to reading pipeline output and guessing what the cluster did afterward.

Pruning is also part of the architecture. If Git removes a resource, the reconciler must decide whether the live resource should be removed too. That decision should be explicit because deletion is a production behavior, not a formatting side effect. The same is true for self-healing. Automatically correcting drift is powerful, but only when teams understand which resources are managed, which fields are ignored, and which emergency changes will be overwritten.

In Practice

Context: Kubernetes itself is built around controller reconciliation. The Kubernetes controller documentation describes controllers as control loops that watch cluster state and act to move current state toward desired state. That is the architectural root of GitOps on Kubernetes, not a marketing layer on top of manifests. See the Kubernetes controller pattern documentation: kubernetes.io/docs/concepts/architecture/controller.

Action: A GitOps controller applies the same pattern to delivery. Argo CD documents automated sync and self-healing behavior, where an application controller can continue attempting synchronization when live state diverges from the declared application state. See Argo CD automated sync policy: argo-cd.readthedocs.io/en/stable/user-guide/auto_sync.

Result: The documented result is not “the pipeline ran.” The result is that the platform can detect out-of-sync resources, attempt convergence, and surface whether the application is healthy. That is a different failure model. A failed deployment becomes an unresolved reconciliation condition rather than a forgotten CI job. A manual production edit becomes drift rather than hidden state.

Learning: Flux exposes the same pattern through its Kustomization reconciliation model. Its documentation describes reconciling manifests from a Git repository and reports status during build, drift detection, and apply phases. It also documents suspension, which pauses new source revisions and drift correction. See Flux Kustomization documentation: fluxcd.io/flux/components/kustomize/kustomizations.

The documented pattern across these systems is consistent: GitOps is useful when Git is the source of desired state and a controller continuously reconciles actual state. The repository is necessary, but insufficient.

Where It Breaks

Failure mode	Why it happens	Engineering response
YAML sprawl	Every team invents its own structure, overlays, and naming rules	Provide paved templates, policy checks, and ownership conventions
Hidden drift	Operators patch live resources outside the reconciler	Enable drift detection, define emergency workflows, and audit ignored fields
Unsafe pruning	Deleted manifests remove live dependencies unexpectedly	Require explicit pruning policy and environment-specific deletion review
Weak health checks	The controller applies resources but cannot tell whether the service works	Define health checks for workloads, dependencies, and rollout gates
CI ownership confusion	Build pipelines still try to deploy directly	Make CI produce artifacts and desired state; make reconciliation own convergence
Secret handling gaps	Teams commit references without a clear runtime secret model	Use sealed, external, or controller-managed secrets with rotation ownership
Multi-cluster ambiguity	One commit fans out without clear blast-radius control	Use progressive rollout, cluster targeting, and per-environment status visibility

The hardest failure is cultural. Engineers trust GitOps when they can predict what the controller will do. They bypass it when it behaves like a mysterious bot with cluster-admin access. That means platform teams must design for explainability: clear diffs, clear ownership, clear pause controls, and clear recovery paths.

What to Do Next

Problem: If deployment is just kubectl apply from CI, production state will eventually diverge from repository state.
Solution: Put a reconciliation controller between Git and the runtime, and make convergence a continuous platform responsibility.
Proof: Kubernetes controllers, Argo CD automated sync, and Flux Kustomization reconciliation all implement the same desired-state control-loop pattern.
Action: Audit your delivery system for five capabilities: drift detection, health evaluation, retry behavior, pruning policy, and visible reconciliation status.

Situation

The Problem

Reconciliation as the Architecture

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Platform Automation Maturity Model: Scripts, Modules, Catalogs, Pipelines, Control Planes

Automation Rollback Playbook: Disable, Revert, Repair State, and Reconcile Reality

DB Team Automation Roadmap: Backups, Patching, Refreshes, Provisioning, and Guardrails