Cloud Spanner vs Cloud SQL: The Real Distributed Database Decision

Most teams do not outgrow Cloud SQL because they need a more interesting database. They outgrow it when the failure domain of a single primary stops matching the business contract.

Situation

Attribute	Cloud SQL	Cloud Spanner
Architecture	Single primary, optional replicas	Distributed, multi-region native
Write scaling	Primary is the ceiling	Horizontal by key design and split routing
Read scaling	Cross-region replicas (async)	Global reads from nearest replica
Consistency	Strong within region	Externally consistent globally (TrueTime)
Failover	Managed event, HA standby in secondary zone (~60s)	Built-in; no promotion event
Engine compatibility	PostgreSQL, MySQL, SQL Server	Spanner SQL + PostgreSQL-compatible API
Schema changes	Standard DDL	Online schema changes, fully managed
Starting cost	Low	Significant base cost (minimum 1 processing unit)
Choose when	Regional system, standard engine tooling needed	Global writes, distributed consistency, horizontal scale

The usual database decision starts too low in the stack. Teams compare PostgreSQL compatibility, MySQL familiarity, query syntax, managed backups, pricing pages, and migration tooling. Those details matter, but they are rarely the real decision between Cloud SQL and Cloud Spanner.

Cloud SQL is a managed relational database service for engines teams already know: PostgreSQL, MySQL, and SQL Server. Its operating model is familiar: one writable primary, optional replicas, managed backups, maintenance windows, and high availability inside the constraints of a traditional database architecture.

Cloud Spanner is a distributed relational database. It is built for horizontal scale, synchronous replication, strong consistency, and multi-region availability. Its operating model is less familiar because the database is not a single machine with replicas attached. It is a distributed system that happens to expose SQL and transactions.

That difference changes the architecture conversation. The question is not “which one is better?” The question is whether your system can survive the operational shape of a primary database.

The Problem

Cloud SQL works extremely well when the write path fits on a primary, the application can tolerate regional recovery behavior, and scaling pressure is mostly read-heavy. In that world, replicas absorb analytics and reporting, indexes are tuned, connection pools are sized, and vertical scaling buys time.

The trouble begins when the application contract quietly becomes distributed while the database contract stays centralized.

A checkout system wants writes accepted during regional impairment. A financial ledger wants globally ordered transactions. A SaaS control plane wants tenant placement across regions without writing custom shard routing. A mobile backend wants low-latency reads from multiple continents but cannot allow stale business invariants. A marketplace wants inventory decrements, payment state, and fulfillment reservations to commit consistently even as traffic shifts between regions.

Teams often respond by building the missing distribution layer above Cloud SQL. They introduce application-level sharding, dual writes, queue-based reconciliation, read-your-writes exceptions, regional failover procedures, and increasingly complicated runbooks. The database remains familiar, but the system becomes less honest. The hard part moved into application code.

So the real question is: do you need a managed relational database, or do you need the database itself to own distributed consistency and failure recovery?

The Real Decision Boundary

The clean decision boundary is the write contract.

Use Cloud SQL when the system has a natural primary region, write throughput is within the practical limits of a single primary, and failover can be treated as an operational event. Use Cloud Spanner when the write contract is distributed, the data model must scale horizontally, and consistency across failure domains is part of the product requirement rather than an optimization.

flowchart TD
    A[database decision — start with failure contract] --> B[Cloud SQL — primary database architecture]
    A --> C[Cloud Spanner — distributed database architecture]

    B --> D[single writable primary — familiar operations]
    B --> E[read replicas — scale read paths]
    B --> F[regional HA — managed failover event]

    C --> G[synchronous replication — database owned consistency]
    C --> H[horizontal splits — scale write paths]
    C --> I[multi-region topology — failure domain in design]

    D --> J[best fit — monoliths and regional services]
    E --> J
    F --> J

    G --> K[best fit — ledgers and global control planes]
    H --> K
    I --> K

Cloud SQL’s advantage is operational simplicity. You get standard engines, deep ecosystem support, straightforward local development, and a migration path that most engineers understand. If your bottleneck is schema design, query performance, connection management, or basic high availability, Cloud SQL is usually the sharper tool.

Cloud Spanner’s advantage is removing a category of application-owned distributed systems work. It gives up some engine-specific compatibility and some familiar tuning knobs, but it replaces them with a database architecture designed around replication, partitioning, and strong consistency. That trade is worth making only when the system’s correctness depends on it.

The mistake is choosing Spanner as an expensive scaling talisman. Spanner does not fix unclear ownership boundaries, unbounded transactions, careless indexes, or chatty request paths. It rewards teams that model access patterns deliberately. Poor key design can create hot ranges. Cross-region writes still pay physics. Distributed transactions are powerful, not free.

The opposite mistake is staying on Cloud SQL after the architecture has already become distributed. Once teams are coordinating shards, replaying outboxes, reconciling duplicate writes, and maintaining regional promotion playbooks, they are already paying the complexity cost. They are just paying it in application code, incident response, and human judgment.

In Practice

Context: Google’s Spanner paper, “Spanner: Google’s Globally-Distributed Database,” documents the core pattern: a database designed to distribute data across datacenters while still supporting externally consistent transactions. The important lesson is not that every company needs global SQL. The lesson is that once correctness spans datacenters, the transaction protocol and clock uncertainty become first-class architecture concerns.

Action: Spanner exposes a model where replication and transaction ordering are part of the database contract. Google’s public documentation describes TrueTime and external consistency as mechanisms for making transaction order match real-time ordering. That is a database-level answer to a problem many teams otherwise approximate with queues, timestamps, locks, and compensating jobs.

Result: The documented pattern is simpler application reasoning at the cost of a more specialized database architecture. Application code can rely on strong consistency guarantees instead of encoding a large amount of regional coordination logic itself. The tradeoff is that schema design, key choice, and transaction shape become central performance decisions.

Learning: Cloud SQL follows the traditional managed relational pattern. Google Cloud’s documentation for Cloud SQL high availability and read replicas describes a familiar architecture: a primary instance, standby or failover behavior, backups, and replicas used to offload reads. That pattern is excellent when the system can name a primary write location. It becomes strained when the product needs the database to behave like a multi-region coordination system.

The practical conclusion is not “Spanner for scale, Cloud SQL for small.” Many large systems should stay on Cloud SQL because their data ownership is regional, their operational model is simple, and their engineering leverage comes from standard PostgreSQL or MySQL behavior. Some smaller systems may need Spanner because their correctness boundary is global from day one: payments, identity, inventory, entitlement, or control-plane state.

Where It Breaks

Decision area	Cloud SQL failure mode	Cloud Spanner failure mode
Write scaling	Primary becomes the ceiling for write throughput	Hot keys or poor split behavior concentrate load
Regional resilience	Failover is an event the system must tolerate	Multi-region writes pay latency and topology costs
Consistency	Cross-region correctness often moves into application code	Strong consistency can encourage oversized transactions
Ecosystem	Excellent compatibility with PostgreSQL, MySQL, or SQL Server tooling	SQL support is relational but not identical to a chosen engine
Operations	Familiar tuning can hide growing sharding complexity	Distributed design requires deliberate schema and key choices
Cost model	Starts simple, then grows through replicas, larger instances, and operations	Starts higher, but may replace custom coordination machinery

What to Do Next

Problem: Write down the failure contract before choosing the database. Name the maximum acceptable write outage, recovery point, recovery time, and regions that must continue accepting writes.
Solution: Choose Cloud SQL when a primary-region relational database satisfies that contract. Choose Cloud Spanner when consistency, availability, and horizontal write scale must be owned by the database across failure domains.
Proof: Test the architecture under the failure it claims to survive. Promote replicas, block regions, replay writes, measure stale reads, and verify whether application invariants still hold without manual reconciliation.
Action: Do not migrate because “distributed” sounds safer. Migrate when the current architecture has already forced you to build a distributed database outside the database.

Situation

The Problem

The Real Decision Boundary

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Designing for Peak Traffic Without Designing for Permanent Waste

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse