CAP Theorem in Operational Terms
CAP theorem is not an academic curiosity. It tells you what your distributed database will do when the network between its nodes fails — and that is exactly when the wrong answer causes data loss or an outage. Most engineers have heard of CAP and most have the wrong mental model for applying it.
Situation
CAP theorem, stated by Eric Brewer in 2000 and proved by Gilbert and Lynch in 2002, says that a distributed system can guarantee at most two of three properties: Consistency, Availability, and Partition Tolerance. In practice, network partitions happen — so every distributed system must choose between consistency and availability when a partition occurs.
This is the trade-off that matters operationally: when two nodes in your database cluster cannot communicate, what does the system do?
The Problem
Engineers designing distributed systems often say “we chose a CP database” or “we chose an AP database” without being able to answer a concrete operational question: if two of your five Cassandra nodes lose connectivity to the other three, what happens to reads and writes? What does a “consistent” or “available” choice mean in practice during a partial outage?
CAP is only useful if you can translate it into a failure scenario answer.
CP vs AP in Operational Terms
CP (Consistency + Partition Tolerance): During a partition, the system refuses to serve reads or writes that could return stale data or lose acknowledged writes. This means the system becomes unavailable for some or all operations during the partition. Correctness is preserved; availability is sacrificed.
Examples of CP systems: PostgreSQL with synchronous replication (primary refuses writes if the synchronous standby is unreachable), etcd, ZooKeeper, HBase (when configured conservatively).
AP (Availability + Partition Tolerance): During a partition, the system continues to serve reads and writes from whichever nodes are reachable, accepting that different nodes may diverge and return different data. After the partition heals, the system reconciles the divergent state (using last-write-wins, vector clocks, or application-level conflict resolution). Availability is preserved; consistency is sacrificed temporarily.
Examples of AP systems: Cassandra (by default with eventual consistency), DynamoDB (with eventual consistency reads), CouchDB.
Partition occurs between Node A and Node B
CP system:
- Node A: "I cannot confirm my data is consistent — refusing reads/writes"
- Clients: receive errors or timeouts
AP system:
- Node A: "I'll serve what I have"
- Node B: "I'll serve what I have"
- Clients: may get different answers from A and B
- After partition heals: A and B reconcile (last-write-wins or merge)
In Practice
PostgreSQL’s documented behavior during replication failure depends on synchronous_commit setting. With synchronous_commit = on and a synchronous standby, the primary will not acknowledge writes that have not been confirmed by the standby — this is CP behavior. If the standby disconnects, the primary waits for wal_sender_timeout before giving up and continuing without the standby. During that wait, writes are blocked — the system chooses consistency over availability.
Cassandra’s documented consistency levels operationalize the tradeoff explicitly: QUORUM reads and writes require a majority of replicas to respond — this provides a stronger consistency guarantee but will fail if too many nodes are unreachable. ONE reads and writes require only one replica to respond — maximizing availability at the cost of potentially reading stale data.
The practical insight from Brewer’s later work (CAP Twelve Years Later, 2012): most distributed systems are not purely CP or AP — they allow the tradeoff to be tuned per-operation. This is the more useful mental model.
Where It Breaks
| Scenario | CP choice | AP choice |
|---|---|---|
| Payment processing | Correct — cannot accept double-spend or lost payment | Dangerous — inconsistent state during partition |
| User session data | Usually unnecessary — stale session is acceptable | Correct — availability matters more than freshness |
| Inventory count | Depends — over-selling may be acceptable; negative inventory is not | Risky without application-level conflict resolution |
| Distributed counter | CP is expensive (coordination cost); AP requires conflict resolution | Use CRDT or centralized counter |
What to Do Next
- Problem: Distributed databases make different choices during network partitions, and engineers must understand those choices before selecting a database for a use case — not after a partition happens in production.
- Solution: For each data entity in your system, ask: during a 60-second network partition, is it acceptable for two nodes to return different answers? If no, you need CP semantics for that entity.
- Proof: Run a partition test in staging — use
tc netemto drop packets between nodes — and observe whether your database returns errors (CP) or potentially stale data (AP). - Action: Identify the one table in your system where a consistency failure would cause the most business harm, and verify that your database’s consistency configuration matches the requirement you assumed it had.