Overselling inventory is not a traffic problem; it is a truth problem disguised as a scaling problem.

Situation

E-commerce inventory systems used to be dominated by synchronous request flows: product page reads stock, cart reserves stock, checkout decrements stock, warehouse systems reconcile later. That model works while the business is small enough for one database, one warehouse, and one operational clock.

The failure arrives when inventory becomes multi-channel. A single SKU can be sold through the website, mobile app, marketplace integrations, customer support tooling, backorder workflows, promotions, and warehouse adjustments. Each channel wants low latency. Each channel also wants the right to say, with confidence, that an item can be sold.

On Google Cloud, the natural architecture often reaches for Spanner, Pub/Sub, Dataflow, and BigQuery. Spanner becomes the transactional inventory system. Pub/Sub carries committed inventory events. Dataflow derives stream projections. BigQuery serves analytics, reconciliation, and planning.

That stack can work well, but only if the ownership boundary is explicit. Spanner should not be “one more database in the pipeline.” It should be the system that decides whether inventory exists. Everything else should derive, distribute, or analyze that decision.

The Problem

The common failure mode is treating inventory as a cacheable attribute instead of a ledgered constraint.

A product detail page can tolerate stale stock counts. A merchandising dashboard can tolerate delayed aggregates. A warehouse forecast can tolerate batch correction. Checkout cannot tolerate ambiguity. If two customers attempt to buy the last unit of a SKU, only one transaction can win.

Event-driven systems make this more subtle. Pub/Sub can move updates quickly, but messaging speed does not create transactional correctness. Dataflow can compute reliable stream results, but stream correctness is not the same as reservation correctness. BigQuery can expose powerful analytical views, but analytical truth is not operational authority.

The architecture breaks when downstream projections are allowed to answer upstream questions. A search index says five units remain, a cached product page says three, BigQuery says seven, and the order service tries to reconcile the conflict after payment authorization. At that point the business is no longer choosing between consistency models. It is choosing between customer apologies, manual fulfillment work, and hidden financial leakage.

The question is: how do you keep checkout strongly correct while still letting the rest of the commerce platform move asynchronously?

Core Concept

The answer is to make inventory a ledger in Spanner and make every other system downstream of committed ledger mutations.

The operational model has three tables: current inventory, reservations, and inventory movements. The checkout service writes through a Spanner transaction that verifies available quantity, creates a reservation, appends a movement record, and updates the current balance. If the transaction cannot prove availability, it fails before payment capture or order confirmation.

Pub/Sub is not the authority. It is the distribution layer. After Spanner commits, an outbox table or Spanner change stream emits inventory mutations to Pub/Sub. Dataflow consumes those events to maintain read-optimized projections: product availability feeds, search index updates, alerting streams, warehouse deltas, and BigQuery fact tables.

BigQuery is not asked whether an item can be sold. It is asked what happened, where drift is emerging, and which SKUs require operational attention.

flowchart TD
  Checkout[Checkout service — reserve inventory] --> Spanner[Spanner inventory ledger — transactional authority]
  Spanner --> Current[Current inventory — committed balance]
  Spanner --> Reservations[Reservations — expiring holds]
  Spanner --> Movements[Inventory movements — immutable facts]
  Spanner --> ChangeStream[Spanner change stream — committed mutations]
  ChangeStream --> PubSub[PubSub topic — inventory events]
  PubSub --> Dataflow[Dataflow pipeline — derived projections]
  Dataflow --> Search[Search index — availability hints]
  Dataflow --> Cache[Product cache — read path acceleration]
  Dataflow --> BigQuery[BigQuery warehouse — analytics and reconciliation]
  BigQuery --> Ops[Operations dashboards — drift and planning]

This design separates decisions from distribution. The decision path is short, transactional, and owned by Spanner. The distribution path is elastic, asynchronous, and owned by event processing.

A reservation should have an expiration timestamp and a state machine: pending, confirmed, released, expired. The expiration path must be idempotent because retries are normal in distributed systems. A release event for an already released reservation should not add stock twice. A confirmation event for an expired reservation should fail unless the checkout flow creates a new valid reservation.

SKU partitioning also matters. A hot SKU during a flash sale can turn one logical product into a write hotspot. The usual mitigation is to model inventory at the right granularity: SKU, location, fulfillment pool, and sometimes allocation bucket. The goal is not to avoid contention entirely. The goal is to put contention exactly where the business requires serial decisions.

In Practice

Context: Google’s Spanner documentation describes external consistency as its strongest transaction guarantee, and the original Spanner paper explains how TrueTime supports globally ordered transactions. The documented pattern is that Spanner is appropriate when the system needs SQL transactions with strong consistency across distributed data, not merely high availability storage. See Google’s Spanner documentation on TrueTime and external consistency and the Spanner OSDI paper, “Spanner: Google’s Globally-Distributed Database”.

Action: Put the inventory invariant inside Spanner transactions. The invariant is simple: available quantity cannot go below zero for the sellable unit being reserved. Write the reservation and movement record in the same transaction that changes the balance. Do not rely on a Pub/Sub consumer to repair oversell after checkout.

Result: The system narrows its correctness boundary. If Spanner commits, the reservation exists and the ledger records why stock changed. If Spanner rejects the write, the order path has no ambiguous intermediate state to explain later.

Learning: Strong consistency should be spent where the business invariant lives. Most of the platform can be eventually consistent, but the moment that decides whether money can be accepted for scarce inventory should not be.

Context: Pub/Sub documentation states that default delivery is at least once and that ordering requires explicit ordering keys. It also documents exactly-once delivery options with scope and subscriber requirements. See Google Cloud Pub/Sub docs on subscription behavior, message ordering, and exactly-once delivery.

Action: Treat Pub/Sub messages as repeatable notifications, not single-use commands. Give every inventory event a stable event ID, reservation ID, SKU, location, sequence, and committed timestamp. Consumers should deduplicate by event ID and update projections idempotently.

Result: Redelivery becomes a normal case. Replaying the same event may refresh a projection, but it does not double-count inventory, duplicate a warehouse task, or corrupt an analytical aggregate.

Learning: Messaging guarantees do not remove the need for idempotent application semantics. The event contract must make duplicate handling boring.

Context: Dataflow documentation describes exactly-once processing behavior and the constraints around timely records and streaming sources. See Google Cloud Dataflow’s documentation on exactly-once processing.

Action: Use Dataflow for projections whose correctness is defined by event processing: availability feeds, low-stock alerts, BigQuery loads, and reconciliation streams. Keep checkout outside this path.

Result: Stream processing can scale independently from the checkout transaction rate. If a Dataflow job lags, product pages may show conservative availability or temporarily hide stock, but confirmed orders remain correct.

Learning: Stream processors are excellent at deriving state from facts. They should not be the first place where scarce inventory is promised.

Context: BigQuery descends from Google’s Dremel architecture for interactive analysis of large read-only datasets, and Google’s Dremel papers describe the analytical model behind BigQuery’s scale. See “Dremel: Interactive Analysis of Web-Scale Datasets” and “Dremel: A Decade of Interactive SQL Analysis at Web Scale”.

Action: Load inventory movements into BigQuery as facts, not mutable truth. Build reconciliation queries that compare Spanner balances, movement sums, warehouse adjustments, and order states.

Result: BigQuery becomes the place to find drift, not the place to authorize sales. Analysts can ask why inventory moved without adding latency or coupling to checkout.

Learning: Analytical systems should explain operational truth after the fact. They should not own the write path that creates it.

Where It Breaks

Failure modeWhy it happensMitigation
Hot SKU contentionMany buyers reserve the same scarce item at oncePartition by fulfillment pool, use explicit reservation limits, and accept serialization where correctness requires it
Duplicate eventsPub/Sub redelivers or consumers retry after partial workUse event IDs, idempotent writes, and projection checkpoints
Stale product availabilityCache and search projections lag committed inventoryShow conservative states, expire cache aggressively, and re-check availability at checkout
Reservation leaksHolds are created but never confirmed or releasedUse expiration timestamps, scheduled cleanup, and state transition guards
Analytics disagreementBigQuery loads lag or late events arriveModel event time and processing time separately, then reconcile with Spanner snapshots
Warehouse driftPhysical counts diverge from system countsAppend adjustment movements rather than rewriting balances silently

What to Do Next

  • Problem: Checkout correctness fails when inventory is treated as a distributed cache value.
  • Solution: Put the sellable inventory invariant inside Spanner transactions and publish committed changes downstream.
  • Proof: Spanner provides the transactional consistency boundary, Pub/Sub distributes committed facts, Dataflow builds repeatable projections, and BigQuery explains history.
  • Action: Start by defining the inventory ledger schema, reservation state machine, event ID contract, and reconciliation queries before optimizing the read path.