Azure Multi-Region Design: Front Door, Cosmos DB, SQL Failover, and Operational Tradeoffs

A multi-region Azure architecture is not a diagram with two identical boxes; it is a set of explicit bets about which failures you will absorb, which inconsistencies you will tolerate, and which operations team will be awake during failover.

Situation

Cloud teams are under pressure to make regional outages uneventful. The business asks for active-active. The platform team hears global ingress, replicated data, zero downtime, and automated failover. Azure provides credible building blocks: Azure Front Door for global HTTP entry, Azure Cosmos DB for globally distributed NoSQL data, Azure SQL Database failover groups for relational continuity, and zone-redundant regional services for local resilience.

The trap is that these services do not compose into a single availability guarantee. Front Door can route traffic away from an unhealthy origin, but it cannot make a half-failed application safe. Cosmos DB can accept writes in multiple regions, but consistency and conflict behavior become application concerns. Azure SQL failover groups can redirect relational workloads, but forced failover can lose data because geo-replication is asynchronous. Each layer solves a different part of the failure.

The architecture has to start with failure ownership, not product selection.

The Problem

The naive design is symmetrical: deploy the same application into East US and West US, put Front Door in front, replicate Cosmos DB globally, configure SQL failover, and call the system active-active.

That design usually fails in the gaps between layers.

A user request can be routed to West US while its relational write path still depends on a primary SQL database in East US. A Cosmos DB document can be written locally under session consistency while a downstream relational transaction is serialized through a different region. Front Door health probes can mark an origin healthy because /healthz returns 200, while checkout, billing, or identity is degraded because a dependency is timing out. A failover group can move SQL to the secondary, but application connection pools, caches, background workers, and idempotency tables might still assume the old primary.

The hard question is not “how do we deploy two regions?” It is: which requests are allowed to continue when one region, one data system, or one replication path is impaired?

The Answer — Regional Stamps With Explicit Data Ownership

A safer Azure multi-region architecture uses regional stamps. Each stamp contains the compute, cache, queues, and regional dependencies needed to serve a bounded slice of traffic. Azure Front Door routes users to healthy stamps. Cosmos DB handles data that can tolerate distributed consistency semantics. Azure SQL Database remains the system of record only for data that needs relational constraints, with failover treated as a controlled operational event.

flowchart TD
  U[users — global clients] --> AFD[Azure Front Door — global ingress]
  AFD -->|latency routing| R1[region one stamp — app and workers]
  AFD -->|latency routing| R2[region two stamp — app and workers]

  R1 --> C1[Cosmos DB region one — local reads and writes]
  R2 --> C2[Cosmos DB region two — local reads and writes]
  C1 --> CR[Cosmos DB replication — consistency policy]
  C2 --> CR

  R1 --> S1[Azure SQL primary — relational system of record]
  R2 --> S2[Azure SQL secondary — failover target]
  S1 --> SG[SQL failover group — listener and replication]
  S2 --> SG

  R1 --> Q1[regional queue — retry and isolation]
  R2 --> Q2[regional queue — retry and isolation]
  SG --> OPS[operations runbook — failover decision]

Azure Front Door should route at the edge, not decide business correctness. Its job is to evaluate origin health, priority, latency, and weight, then send HTTP traffic to an origin group. Microsoft documents Front Door routing methods including latency and priority routing, and health probes are the signal used to evaluate origin health. That means the probe endpoint must represent real dependency readiness, not just process liveness.

Cosmos DB should be used deliberately. Multi-region writes can reduce regional write latency and improve availability, but conflict handling and consistency become part of the application contract. Microsoft documents five consistency levels: strong, bounded staleness, session, consistent prefix, and eventual. Strong consistency improves programmability but increases cross-region write latency and can reduce availability during failures. Session consistency is often the pragmatic default for user-facing workloads because it preserves read-your-writes within a client session, but it is not a global serial order.

Azure SQL failover groups are a different tool. They are appropriate when the relational model is required and the application can tolerate a failover event. The operational distinction matters: Cosmos DB can be designed for continuous regional writes, while SQL failover is usually a promotion decision. A forced failover prioritizes recovery time over potential data loss because replication to the secondary is asynchronous.

In Practice

Context: Microsoft’s Azure Well-Architected mission-critical guidance recommends multi-region deployment and scale-unit thinking for workloads with high availability requirements. The documented pattern is to avoid one large shared platform and instead use repeatable deployment units that can fail independently.

Action: Apply that pattern by making each Azure region a stamp with its own app instances, queue consumers, cache, observability, and dependency configuration. Put Front Door in front, but keep the routing policy simple enough to reason about during an incident. Use priority routing for active-passive systems and latency or weighted routing only when both regions can safely process the same class of request.

Result: The operational result is clearer blast radius. If one stamp loses its cache, queue, or regional app tier, Front Door can drain traffic from that origin. If Cosmos DB replication is delayed, the application can apply its documented consistency contract. If SQL must fail over, the team knows which write paths pause, which read paths remain available, and which workers must be restarted or re-pointed.

Learning: The documented pattern is not “make everything active-active.” It is to separate failure domains and match the data model to the recovery behavior. Cosmos DB is a good fit for globally distributed user state, catalogs, preferences, idempotency records, and event materialized views when the consistency model is explicit. Azure SQL is a better fit for relational invariants, financial ledgers, complex transactions, and reporting models that require schema constraints. Mixing both is normal; hiding their different failure modes is the mistake.

Where It Breaks

Decision	Benefit	Failure Mode	Mitigation
Front Door latency routing	Sends users to nearby healthy origins	Healthy probe does not mean healthy transaction path	Probe critical dependencies and expose degraded readiness
Front Door priority routing	Simple active-passive failover	Passive region can rot if it receives no real traffic	Send synthetic and controlled production traffic
Cosmos DB multi-region writes	Low regional write latency and high availability	Conflicts and stale reads become product behavior	Define partitioning, conflict policy, and consistency per workload
Cosmos DB strong consistency	Easier correctness model	Higher cross-region latency and lower failure tolerance	Reserve for data that truly needs linearizable reads
SQL failover groups	Relational disaster recovery with listener abstraction	Forced failover can lose recent committed primary writes	Define RPO, rehearse failover, and pause unsafe writers
Shared global cache	Simpler application code	Cross-region dependency becomes hidden single point of failure	Prefer regional caches with explicit invalidation
Background workers in both regions	Faster recovery and local processing	Duplicate side effects during failover	Use idempotency keys and lease ownership
One global deployment pipeline	Consistent releases	Bad release reaches every region quickly	Use staged regional rollout and automatic rollback

What to Do Next

Problem: Start by listing failure modes, not Azure services. For each user journey, decide what happens when the local app, remote app, Cosmos DB region, SQL primary, queue, cache, or Front Door origin is impaired.

Solution: Build regional stamps behind Azure Front Door. Use Cosmos DB for data that can live with an explicit distributed consistency contract. Use Azure SQL failover groups for relational state, but treat failover as an operational mode with runbooks, alerts, and rehearsals.

Proof: Test the architecture with regional game days. Disable one origin, block SQL primary connectivity, inject Cosmos DB latency, poison a queue consumer, and verify that routing, retries, idempotency, and dashboards show the expected behavior.

Action: Write the failover contract before the next implementation sprint: routing policy, data ownership, consistency level, SQL RPO and RTO, manual approval points, rollback steps, and the exact request classes that must stop rather than run incorrectly.

Situation

The Problem

The Answer — Regional Stamps With Explicit Data Ownership

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Designing for Peak Traffic Without Designing for Permanent Waste

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse