Commerce platforms do not fail because they lack databases; they fail because every datastore is asked to be the source of truth during the same incident.

Situation

A commerce platform starts with one obvious requirement: take orders correctly. Then the surface area expands. Catalog pages need fast filters. Carts need low latency reads. Checkout needs transactional guarantees. Inventory changes need fanout. Finance needs warehouse-grade history. Fraud, personalization, search, fulfillment, support, and analytics all want the same facts at different latencies.

The usual early architecture is simple: one OLTP database, one cache, one search index, and some jobs. That works while humans can reason about the order of writes. It breaks when the business adds marketplaces, promotions, cross-region traffic, flash sales, and asynchronous fulfillment.

At that point, “the database” is no longer a single technology. It is a data plane: OLTP for truth, search for discovery, cache for serving pressure, queue for ordered propagation, and warehouse for analytical memory.

The Problem

The common failure is treating these systems as interchangeable replicas.

Search is allowed to lag, so it cannot decide whether an item is sellable. Cache is allowed to evict, so it cannot be the only copy of a cart. A queue can preserve order within a partition, but it cannot magically make downstream consumers correct. A warehouse can explain what happened, but it cannot sit in checkout’s critical path. The OLTP database can enforce invariants, but it cannot absorb every read, query shape, and analytical scan without becoming the platform bottleneck.

The question is not “which datastore should we use?” The question is: which system owns each failure mode, and how does every other system recover from being wrong?

The Data Plane Contract

The commerce data plane should be designed around ownership, latency, and repair.

flowchart TD
  A[clients — storefront and admin] --> B[API layer — command validation]
  B --> C[OLTP store — orders carts inventory payments]
  B --> D[cache — hot reads and session state]
  C --> E[outbox table — committed domain events]
  E --> F[queue — ordered propagation]
  F --> G[search index — catalog discovery]
  F --> H[warehouse lake — analytical history]
  F --> I[read models — account and fulfillment views]
  C --> J[replicas — operational reads]
  K[repair workers — reconciliation and replay] --> G
  K --> D
  K --> I
  H --> L[metrics and finance — reporting]

The OLTP store owns irreversible business facts: order placement, payment state, inventory reservation, refund state, merchant configuration, and customer entitlements. It should be normalized enough to enforce constraints and partitioned along a business boundary that keeps most transactions local.

Search owns discovery, not truth. It can answer “what products match this query?” It should not answer “can this exact unit be sold right now?” The product page can show indexed attributes, but checkout must re-read sellability from the transactional path.

Cache owns latency relief, not correctness. It is allowed to be stale, absent, and rebuilt. That means every cached value needs a source, a TTL or invalidation strategy, and a clear behavior on miss. If the cache is down, the platform should degrade by shedding noncritical reads before it risks order correctness.

The queue owns propagation. It is the buffer between the write model and every derived model. The outbox pattern is the important boundary: commit the business transaction and the event record together, then publish from the committed log. Without that, a platform eventually sees the worst split-brain: an order exists without downstream visibility, or downstream systems react to an order that never committed.

The warehouse owns history and reconciliation. It is not just for dashboards. It should be the place where finance, audit, merchandising, and anomaly detection can ask questions across time without punishing the checkout database.

In Practice

Context: Shopify documents a commerce platform split into pods, where each pod contains a subset of shops and includes a MySQL shard plus datastores such as Redis and Memcached. Their engineering writing also describes moving shops between MySQL shards without downtime. Sources: Shopify shard balancing and Shopify Rails patterns.

Action: The documented pattern is tenant-aware partitioning: keep a merchant’s core transactional workload local to one shard boundary, then build operational tooling for movement, isolation, and balancing.

Result: The result is not “sharding solves commerce.” The result is a manageable failure domain: a hot or oversized tenant can be reasoned about as a unit, and platform teams can move load without redefining every table relationship.

Learning: Partition by the business invariant you need to protect. For commerce, merchant, store, region, or marketplace boundary usually matters more than evenly distributing row counts.

Context: LinkedIn’s Kafka work describes Kafka as a distributed messaging system for log processing, built for activity streams and operational data. Source: Kafka paper.

Action: The documented pattern is append-first propagation: write immutable records to a partitioned log, then let many consumers build their own views.

Result: The important result for commerce is decoupling. Search indexing, fraud signals, fulfillment views, warehouse ingestion, and notifications do not need to run inside the checkout transaction.

Learning: A queue is not merely background jobs. It is the contract for every derived state. Partition keys, idempotency keys, schema evolution, and replay procedures are part of the data model.

Context: Amazon’s Dynamo paper documents a highly available key-value store motivated by services such as shopping cart, where write availability was a core requirement. Source: Dynamo paper.

Action: The documented pattern is making the availability tradeoff explicit: some user-facing state can accept reconciliation, while other state requires stronger coordination.

Result: For a commerce platform, that distinction separates carts from orders. A cart can merge or be repaired. An order cannot be double-charged, silently dropped, or ambiguously fulfilled.

Learning: Do not apply the same consistency model to every commerce object. Model the cost of being stale, duplicated, missing, or delayed for each object.

Where It Breaks

ComponentFailure modeSymptomDesign response
OLTPHot partitionCheckout slows for one merchant or product dropPartition by business boundary, add admission control, isolate noisy tenants
SearchStale indexProduct appears available after selloutTreat search as discovery, revalidate at product page and checkout
CacheStale or missing valueWrong price, cart mismatch, thundering herdVersion cache keys, use TTLs, protect origins with request coalescing
QueueConsumer lagOrders placed but fulfillment view is delayedTrack lag by topic and partition, expose derived state freshness
WarehouseLate or duplicated eventsFinance reports disagree with operationsUse immutable event IDs, replayable ingestion, reconciliation jobs
OutboxPublisher stuckOLTP has facts that downstream systems cannot seeAlert on unpublished rows, make publishing idempotent
SchemaEvent driftConsumers parse old meanings incorrectlyVersion schemas, enforce compatibility, publish deprecation windows

The architecture breaks when teams hide these failure modes behind generic “eventual consistency” language. Eventual consistency is not a repair plan. It is a warning label. A commerce data plane needs explicit freshness indicators, replay tooling, poison message handling, and runbooks that say which user promises still hold when each component is impaired.

What to Do Next

  • Problem: List the commerce facts that must never be ambiguous: order state, payment state, inventory reservation, refund state, merchant entitlement, tax basis.
  • Solution: Assign each fact one writer in OLTP, then derive every other view through an outbox and queue contract.
  • Proof: For each derived system, run a replay test, a lag test, a stale read test, and a source outage test before calling the design production-ready.
  • Action: Build the first version around boring boundaries: transactional core, cache-as-optimization, search-as-discovery, queue-as-propagation, warehouse-as-memory. Then document exactly how each system is allowed to be wrong.