Progressive Delivery Reference Architecture: CI, GitOps, Flags, SLOs, and Rollback

Most delivery failures are not caused by teams shipping too often. They are caused by platforms that treat deploy, release, verification, and rollback as the same event.

Situation

Modern engineering organizations have mostly accepted continuous integration, containerized workloads, infrastructure as code, and GitOps-style reconciliation. The industry has moved from quarterly change windows to many small production changes per day. That shift is healthy: smaller changes are easier to review, easier to reason about, and easier to reverse.

But many platforms still have a blunt delivery model. A pull request merges. A pipeline builds an image. A deployment controller applies manifests. Production traffic moves. Observability lights up after the fact. Rollback becomes a human decision made under time pressure.

That model was tolerable when deployments were rare and hand-held. It breaks when platforms support dozens or hundreds of teams. At that scale, the delivery system must encode judgment: which artifact is allowed to run, where it is allowed to run, how much traffic it may receive, what signals prove it is healthy, and what happens when those signals fail.

Progressive delivery is the reference architecture for that problem.

The Problem

The common failure is coupling promotion to deployment mechanics. The CI system proves that code compiled and tests passed. The GitOps controller proves that desired state reached the cluster. Neither proves that the new behavior is safe for users.

Feature flags are often added later, but only as application toggles. SLOs are defined in dashboards, but not connected to rollout decisions. Rollback exists, but it is treated as an emergency command instead of a normal control path. The result is a platform where each piece is locally reasonable and globally unsafe.

The platform question is not, “Can we deploy automatically?”

The better question is: how do we make production exposure increase only when the artifact, configuration, runtime signals, and user-impact metrics agree that it should?

Progressive Delivery Control Plane

The answer is to separate five concerns that are often collapsed: build, desired state, exposure, verification, and reversal.

CI should produce immutable artifacts and evidence. GitOps should reconcile environment state. The rollout controller should manage traffic movement. The feature flag service should manage behavioral exposure. The observability layer should evaluate SLOs and guardrails. Rollback should be automated, rehearsed, and boring.

flowchart TD
  A[developer change — pull request] --> B[CI pipeline — test and package]
  B --> C[artifact registry — immutable image]
  B --> D[policy evidence — tests scans provenance]
  C --> E[GitOps repository — desired environment state]
  D --> E
  E --> F[GitOps reconciler — apply declared state]
  F --> G[rollout controller — staged traffic]
  G --> H[service mesh or ingress — traffic weights]
  G --> I[feature flag service — behavior exposure]
  H --> J[telemetry pipeline — metrics logs traces]
  I --> J
  J --> K[SLO evaluator — error budget and guardrails]
  K -->|healthy| L[promote — wider exposure]
  K -->|unhealthy| M[rollback — reduce exposure]
  M --> G
  M --> I

CI is the admission layer. It should answer whether an artifact is eligible for promotion, not whether production should receive all traffic. Required evidence includes unit tests, integration tests, static checks, dependency checks, image scanning, and provenance. The output is an immutable image digest, not a mutable tag.

GitOps is the convergence layer. It should make the environment reproducible and auditable. A production promotion is a change to declared state, reviewed and recorded in Git. The reconciler applies that state, but it should not own the full release decision. Its job is convergence, not judgment.

The rollout controller is the exposure layer. It shifts traffic in stages: internal, one percent, five percent, twenty-five percent, fifty percent, then full. Each step pauses for analysis. The step sizes are policy, not developer preference. Riskier services can move more slowly; low-risk internal services can move faster.

Feature flags are the behavior layer. They let teams deploy code without exposing every path immediately. That matters because many incidents are not caused by broken containers. They are caused by valid code exercising a new path under real production data. Flags let the platform separate binary health from behavioral safety.

SLOs are the decision layer. A rollout should not advance because a fixed timer expired. It should advance because user-impact indicators remain inside agreed bounds. Availability, latency, error rate, saturation, queue depth, payment failures, search quality, or job completion rate may all be valid checks depending on the service.

Rollback is the reverse exposure layer. It should be expressed as policy: reduce traffic, disable a flag, restore a previous image, or revert declared state. The platform should prefer the smallest reversal that stops user harm. Turning off a flag is often safer than rolling back an entire deployment. Reverting traffic is often faster than rebuilding.

In Practice

Context: Kubernetes documents Deployments as a controller that manages ReplicaSets and supports rolling updates and rollback behavior. The documented pattern is that a desired-state controller changes pods gradually rather than replacing every instance at once. That gives the platform a primitive for safe convergence, but not a full release-safety model. See the Kubernetes Deployment documentation.

Action: Argo Rollouts and Flagger build on the Kubernetes controller model by adding canary, blue-green, metric analysis, and traffic-provider integration. The documented pattern is to connect rollout steps with measurements from systems such as Prometheus, Datadog, or service mesh telemetry. In this architecture, those tools occupy the rollout-controller position, not the CI position.

Result: The delivery decision moves closer to production reality. A pipeline can still fail fast on bad artifacts, but a rollout can also stop when real request success rate, latency, or custom business metrics degrade. This is derived from how progressive delivery controllers behave: they watch analysis results during rollout and can pause, promote, or abort based on configured thresholds.

Learning: Google SRE material frames reliability through SLOs and error budgets. The documented pattern is that reliability targets should influence release velocity. Progressive delivery turns that principle into automation: if the service is burning error budget or violating guardrails, exposure stops increasing. If the system is healthy, exposure expands without waiting for a manual meeting.

The important lesson is that no single tool owns progressive delivery. CI, GitOps, flags, metrics, and rollback each enforce a different boundary. The architecture works when those boundaries are explicit.

Where It Breaks

Failure mode	Why it happens	Platform response
Metrics lag behind rollout	Telemetry windows are too short or pipelines are delayed	Require minimum sample sizes and warm-up periods before promotion
Guardrails are too generic	CPU and memory look fine while users see failures	Use service-level indicators tied to user outcomes
Flags become permanent forks	Teams never remove old conditional paths	Add flag ownership, expiry dates, and cleanup checks
Rollback is untested	The path exists only in runbooks	Run rollback drills and include reversal in rollout policy
GitOps fights emergency action	Manual rollback drifts from declared state	Represent rollback as a Git change or controller-owned state transition
Canary users are not representative	Early traffic misses the failing segment	Route by region, tenant class, endpoint, or workload shape where appropriate
Database changes are irreversible	Schema migration cannot be safely undone	Use expand-and-contract migrations before progressive exposure

The hardest boundary is data. Stateless service rollback is straightforward compared with schema changes, backfills, queue semantics, and external side effects. Progressive delivery does not remove that complexity. It exposes it earlier.

For database-backed systems, the platform should require backward-compatible migrations: expand the schema, deploy code that can read both shapes, migrate data, switch writes, then contract later. Rollback should not depend on restoring a database snapshot except in disaster recovery scenarios. A snapshot restore is not a release mechanism.

What to Do Next

Problem: Deploy pipelines often conflate artifact creation, environment convergence, user exposure, and release judgment. That creates fast systems that fail loudly and recover slowly.

Solution: Build a progressive delivery control plane with separate responsibilities: CI for evidence, GitOps for declared state, rollout controllers for staged traffic, feature flags for behavior, SLO evaluators for promotion decisions, and rollback automation for reversal.

Proof: Kubernetes, Argo Rollouts, Flagger, and Google SRE practices all point to the same architectural pattern: desired state is necessary, but production safety requires measured exposure against reliability signals.

Action: Start with one critical service. Require immutable image digests, define two or three user-impact guardrails, add a canary rollout, connect it to metrics, and rehearse rollback. Once the path is boring, turn it into a platform template rather than a team-by-team convention.

Situation

The Problem

Progressive Delivery Control Plane

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Platform Automation Maturity Model: Scripts, Modules, Catalogs, Pipelines, Control Planes

Automation Rollback Playbook: Disable, Revert, Repair State, and Reconcile Reality

DB Team Automation Roadmap: Backups, Patching, Refreshes, Provisioning, and Guardrails