CAP Theorem in Operational Terms

CAP theorem is not an academic curiosity. It tells you what your distributed database will do when the network between its nodes fails — and that is exactly when the wrong answer causes data loss or an outage. Most engineers have heard of CAP and most have the wrong mental model for applying it.

Situation

CAP theorem, stated by Eric Brewer in 2000 and proved by Gilbert and Lynch in 2002, says that a distributed system can guarantee at most two of three properties: Consistency, Availability, and Partition Tolerance. In practice, network partitions happen — so every distributed system must choose between consistency and availability when a partition occurs.

This is the trade-off that matters operationally: when two nodes in your database cluster cannot communicate, what does the system do?

The Problem

Engineers designing distributed systems often say “we chose a CP database” or “we chose an AP database” without being able to answer a concrete operational question: if two of your five Cassandra nodes lose connectivity to the other three, what happens to reads and writes? What does a “consistent” or “available” choice mean in practice during a partial outage?

CAP is only useful if you can translate it into a failure scenario answer.

CP vs AP in Operational Terms

CP (Consistency + Partition Tolerance): During a partition, the system refuses to serve reads or writes that could return stale data or lose acknowledged writes. This means the system becomes unavailable for some or all operations during the partition. Correctness is preserved; availability is sacrificed.

Examples of CP systems: PostgreSQL with synchronous replication (primary refuses writes if the synchronous standby is unreachable), etcd, ZooKeeper, HBase (when configured conservatively).

AP (Availability + Partition Tolerance): During a partition, the system continues to serve reads and writes from whichever nodes are reachable, accepting that different nodes may diverge and return different data. After the partition heals, the system reconciles the divergent state (using last-write-wins, vector clocks, or application-level conflict resolution). Availability is preserved; consistency is sacrificed temporarily.

Examples of AP systems: Cassandra (by default with eventual consistency), DynamoDB (with eventual consistency reads), CouchDB.

Partition occurs between Node A and Node B

CP system:
  - Node A: "I cannot confirm my data is consistent — refusing reads/writes"
  - Clients: receive errors or timeouts

AP system:
  - Node A: "I'll serve what I have"
  - Node B: "I'll serve what I have"
  - Clients: may get different answers from A and B
  - After partition heals: A and B reconcile (last-write-wins or merge)

In Practice

PostgreSQL’s documented behavior during replication failure depends on synchronous_commit setting. With synchronous_commit = on and a synchronous standby, the primary will not acknowledge writes that have not been confirmed by the standby — this is CP behavior. If the standby disconnects, the primary waits for wal_sender_timeout before giving up and continuing without the standby. During that wait, writes are blocked — the system chooses consistency over availability.

Cassandra’s documented consistency levels operationalize the tradeoff explicitly: QUORUM reads and writes require a majority of replicas to respond — this provides a stronger consistency guarantee but will fail if too many nodes are unreachable. ONE reads and writes require only one replica to respond — maximizing availability at the cost of potentially reading stale data.

The practical insight from Brewer’s later work (CAP Twelve Years Later, 2012): most distributed systems are not purely CP or AP — they allow the tradeoff to be tuned per-operation. This is the more useful mental model.

Where It Breaks

Scenario	CP choice	AP choice
Payment processing	Correct — cannot accept double-spend or lost payment	Dangerous — inconsistent state during partition
User session data	Usually unnecessary — stale session is acceptable	Correct — availability matters more than freshness
Inventory count	Depends — over-selling may be acceptable; negative inventory is not	Risky without application-level conflict resolution
Distributed counter	CP is expensive (coordination cost); AP requires conflict resolution	Use CRDT or centralized counter

What to Do Next

Problem: Distributed databases make different choices during network partitions, and engineers must understand those choices before selecting a database for a use case — not after a partition happens in production.
Solution: For each data entity in your system, ask: during a 60-second network partition, is it acceptable for two nodes to return different answers? If no, you need CP semantics for that entity.
Proof: Run a partition test in staging — use tc netem to drop packets between nodes — and observe whether your database returns errors (CP) or potentially stale data (AP).
Action: Identify the one table in your system where a consistency failure would cause the most business harm, and verify that your database’s consistency configuration matches the requirement you assumed it had.

Situation

The Problem

CP vs AP in Operational Terms

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Caches, Queues, and Databases: When to Use Each

B-tree vs LSM Tree: The Storage Engine Tradeoff

Consistency Models Your Application Actually Needs