GCP Multi-Region Architecture: Global Load Balancing, Spanner, Pub/Sub, and Failure Testing

A multi-region architecture does not fail when a region goes dark; it fails earlier, when the control plane, data model, and test discipline quietly assume the region will never go dark.

Situation

Cloud teams move to multi-region GCP for predictable reasons: lower user latency, higher availability targets, regulatory placement, and protection from regional incidents. The default architecture often starts cleanly: Cloud Load Balancing in front, stateless services on GKE or Cloud Run, Cloud Spanner for globally replicated state, Pub/Sub for asynchronous work, and Cloud Monitoring for visibility.

That design is directionally right. It uses managed primitives that were built for global systems. Google’s external HTTP load balancer is a global entry point. Spanner provides synchronous replication with strong consistency across configured replicas. Pub/Sub decouples request paths from background processing and supports replay-oriented recovery patterns.

The operational question is not whether these services can run across regions. They can. The question is whether the application, deployment system, and failure tests agree on what “multi-region” actually means.

The Problem

Most failed multi-region designs are not missing regions. They are missing decision boundaries.

A global load balancer can route around an unhealthy backend, but only if the health check represents real service health. A backend that returns 200 while its regional Spanner access path is saturated is not healthy. A service that accepts writes but cannot publish required events is not healthy. A cache that serves stale entitlement data may look fast while violating business correctness.

Spanner can replicate data across regions, but it does not remove the cost of coordination. Strong consistency is useful because it gives the application a clear correctness contract. It also means write latency, leader placement, schema design, and transaction shape become architectural concerns. A careless transaction that spans user profile, billing state, and workflow history may work in one region and become expensive under global replication.

Pub/Sub can absorb spikes and help recover work, but it changes the failure mode. Instead of a synchronous request failing visibly, work may queue, retry, duplicate, or arrive later than the caller expects. That is a better failure mode only when handlers are idempotent, ordering assumptions are explicit, and backlog age is treated as production health.

The core question: how do you design a GCP multi-region system that survives regional failure without pretending every dependency is equally global?

A Control Plane for Regional Failure

The answer is to separate global routing, regional execution, globally consistent state, asynchronous work, and failure testing into different responsibilities.

flowchart TD
  U[users — global traffic] --> LB[global load balancer — policy and health]
  LB --> R1[region one — stateless services]
  LB --> R2[region two — stateless services]

  R1 --> S[spanner — multi-region database]
  R2 --> S

  R1 --> P[pubsub — durable event intake]
  R2 --> P

  P --> W1[workers region one — idempotent handlers]
  P --> W2[workers region two — idempotent handlers]

  T[failure tests — regional drills] --> LB
  T --> R1
  T --> R2
  T --> P
  T --> S

  O[observability — user visible health] --> LB
  O --> R1
  O --> R2
  O --> P
  O --> S

The global load balancer should make traffic decisions based on meaningful health. A shallow process check is insufficient. Health should include whether the service can reach its critical dependencies, whether it can complete a representative read path, and whether regional queues are within acceptable lag. Not every dependency belongs in every health check, but the check should match the promise the endpoint makes to users.

Regional services should stay stateless where possible. If a regional instance disappears, another region should be able to serve new requests without local disk recovery, manual cache promotion, or hidden singleton ownership. Session state, workflow state, and idempotency records belong in durable stores, not inside regional processes.

Spanner should hold state that truly requires strong consistency: account balances, ownership, entitlements, inventory, global uniqueness, and workflow state machines. The schema should reflect access patterns. Keep write transactions narrow. Avoid cross-entity transactions unless the invariant demands them. Choose leader placement deliberately because it affects write latency. Multi-region Spanner is not a latency eraser; it is a consistency system with explicit topology.

Pub/Sub should carry work that can be retried safely: email delivery, projection updates, audit fanout, search indexing, billing workflow steps, and integration calls. Consumers should use stable idempotency keys. Message handlers should tolerate duplicate delivery. Backlog age, dead-letter volume, and retry rate should be first-class service indicators.

The architecture also needs a small but explicit operational control plane. That can be a runbook, an internal tool, or automated policy, but the decisions must be named: drain region, disable writes for a path, pause consumers, replay subscription, promote read-only mode, or fail closed for a sensitive operation.

In Practice

Context: Google published Spanner as a globally distributed database providing externally consistent transactions across replicated data. The documented pattern is not “put every query in a global transaction.” The pattern is to use strong consistency where the business invariant needs it and to understand that replication topology affects latency and availability behavior.

Action: In a GCP architecture, place Spanner behind service APIs that own transaction boundaries. Do not let every caller compose arbitrary cross-table writes. Keep the transactional surface narrow: one aggregate, one workflow transition, one ownership decision. Use asynchronous Pub/Sub fanout for derived state.

Result: The system has a smaller correctness core. Regional services can fail over without also moving hidden state. Pub/Sub consumers can rebuild projections after interruption. Spanner remains responsible for authoritative state, not every operational side effect.

Learning: Multi-region reliability improves when strong consistency and eventual completion are separated. Spanner is the authority for invariants. Pub/Sub is the recovery channel for work. The load balancer is the traffic decision point. Each has a different contract.

Context: Google’s SRE material emphasizes testing reliability assumptions through controlled failure exercises and disaster recovery planning. The documented pattern is that availability is not only a design property; it is an operational practice.

Action: Test regional failure before it is needed. Run drills that remove one regional backend from service, block a dependency from a region, pause a subscription, and inject latency into a critical path. Measure user-visible success rate, write latency, queue backlog age, and recovery time.

Result: The team learns which failures are automatic and which require human judgment. A load balancer failover that works for reads may still expose write hot spots. A Pub/Sub backlog may drain cleanly in normal load and fail under catch-up pressure. A region may be removable only after a deployment dependency is made global.

Learning: Failure tests turn architecture diagrams into contracts. If a diagram says traffic can move from one region to another, the drill must prove it under realistic dependency behavior.

Where It Breaks

Area	Failure mode	Mitigation
Load balancing	Health check passes while the service cannot complete real work	Use endpoint-specific health and synthetic transactions
Spanner	Global writes become slow because transactions are too broad	Model aggregates carefully and keep write paths narrow
Pub/Sub	Duplicate or delayed messages corrupt derived state	Require idempotency keys and replay-safe consumers
Regional services	Local state prevents clean failover	Move durable state to Spanner or another managed store
Deployment	A bad rollout reaches every region at once	Use staged regional rollout and fast rollback
Observability	Metrics show infrastructure health but not user impact	Track success rate, latency, backlog age, and correctness signals
Runbooks	Engineers know the design but not the emergency decisions	Predefine drain, pause, replay, and read-only procedures

What to Do Next

Problem: The architecture claims multi-region availability, but health checks, transaction boundaries, and recovery paths may still be regional assumptions.
Solution: Put global load balancing at the edge, keep services stateless, use Spanner for authoritative invariants, use Pub/Sub for retryable work, and define explicit regional control actions.
Proof: Validate the design with failure drills: drain a region, pause consumers, inject dependency latency, replay messages, and measure user-visible outcomes.
Action: Before calling the system multi-region, write down the top five failure scenarios and run them in staging or production under controlled conditions. The architecture is not complete until the tests can fail honestly and recover predictably.

Situation

The Problem

A Control Plane for Regional Failure

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Designing for Peak Traffic Without Designing for Permanent Waste

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse