Order State Machines: The Database Model Behind Checkout Reliability

Checkout does not fail because a button was clicked twice; it fails because the database allowed the same business fact to be represented twice.

Situation

Modern checkout paths are distributed long before the architecture diagram admits it. The browser retries after a timeout. The API gateway retries after a connection reset. The payment provider responds slowly, then eventually succeeds. Inventory reservation, tax calculation, fraud review, fulfillment, email, and analytics all want to react to the same order.

The mistake is treating orders.status as a display field instead of the control plane for money movement. A checkout system needs a database-backed state machine: a constrained model of valid transitions, idempotent commands, auditable attempts, and recoverable side effects.

The core design is not exotic. It is usually a relational table, a few uniqueness constraints, transaction boundaries, and an outbox. The hard part is refusing to let application code improvise around those constraints.

The Problem

The naive model starts clean:

orders(id, user_id, status, total_amount, created_at)

Then production arrives.

A shopper submits checkout and sees a network timeout. The browser retries. The first request is still charging the card while the second request creates another order. A worker polls pending orders and races with the API thread. A webhook says payment succeeded after the order has already been canceled. Inventory is reserved for an order that never reaches fulfillment. Customer support sees three rows that each look plausible.

The operational failure is not merely duplicate orders. It is ambiguous authority. Which row owns the payment? Which transition is legal? Which retry is safe? Which side effect has already happened? Which subsystem is allowed to move the order forward?

When the database only stores the latest status, every caller becomes a partial state machine with a different memory of the world.

The question is: how do you model checkout so retries, workers, webhooks, and human recovery all converge on one order history instead of multiplying failure modes?

Answer: Make The Database Own The State Machine

A reliable checkout model separates identity, state, attempts, and side effects.

flowchart TD
  A[checkout request — idempotency key] -->|unique insert| B[order row — pending checkout]
  B -->|create attempt| C[payment attempt row — authorization pending]
  C -->|conditional transition| D[order row — payment authorized]
  D -->|reserve stock| E[inventory reservation — confirmed]
  E -->|append message| F[outbox event — order placed]
  F -->|retry delivery| G[worker delivery — acknowledged]

The orders table is the aggregate root. It stores the current state and a monotonic version.

orders(
  id,
  customer_id,
  checkout_id,
  state,
  state_version,
  total_amount,
  created_at,
  updated_at,
  UNIQUE(customer_id, checkout_id)
)

The checkout_id is supplied by the caller or generated before submission. It is not a tracing field. It is the idempotency boundary for creating the order. If the same customer retries the same checkout, the database must return the same order, not create a sibling.

Valid transitions should be represented explicitly:

order_state_transitions(
  from_state,
  to_state,
  command,
  PRIMARY KEY(from_state, to_state, command)
)

Application code can still contain transition logic, but the database model should make illegal transitions hard to persist. The important rule is that every command updates from an expected state:

UPDATE orders
SET state = 'payment_authorized',
    state_version = state_version + 1,
    updated_at = now()
WHERE id = $1
  AND state = 'payment_pending'
  AND state_version = $2;

If zero rows update, the command did not own the transition. It must reload and decide whether the desired result already happened, became impossible, or should be retried.

Payment attempts should not be collapsed into the order row. They are separate facts:

payment_attempts(
  id,
  order_id,
  provider,
  provider_request_id,
  provider_payment_id,
  state,
  amount,
  created_at,
  updated_at,
  UNIQUE(provider, provider_request_id)
)

This gives the system a place to record uncertainty. authorization_pending, authorized, declined, timed_out, and reversed are attempt states, not always order states. The order should advance only when the attempt produces a business fact the order can consume.

Side effects need the same discipline. Sending an email, publishing OrderPlaced, or notifying fulfillment should be driven through an outbox table written in the same transaction as the order transition:

order_outbox(
  id,
  order_id,
  event_type,
  payload,
  published_at,
  created_at
)

The transition and the event become atomic. Delivery can be retried without re-deciding whether the order was placed.

In Practice

Context: Stripe documents idempotent requests as a way for clients to safely retry create or update operations, with the first result saved and returned for later requests using the same key. Stripe also notes that keys should be unique and that parameter mismatches are rejected to prevent accidental key reuse. Stripe API docs

Action: The checkout command should persist an idempotency key at the boundary where money movement begins. The database equivalent is a uniqueness constraint on the caller, checkout key, and operation, plus a stored response or stored aggregate reference. This matches the documented pattern: retry returns the original result instead of executing the mutation again. Stripe API docs

Result: Duplicate HTTP requests stop being duplicate business commands. They become repeated reads of the same command result. The learning is that idempotency is not a middleware concern; it is a persisted contract.

Context: Shopify’s engineering write-up on payment idempotency describes tracking incoming requests by client and idempotency key, and using a lock around the API call so simultaneous duplicate requests do not both proceed. Shopify Engineering

Action: A checkout system should record the command before doing external work and mark whether it is in progress, completed, or failed in a retryable way. A concurrent duplicate can then return a conflict or pollable result instead of entering the payment path twice. Shopify Engineering

Result: The database becomes the rendezvous point for concurrent retries. The learning is that idempotency keys need an in-progress state, not only a completed-response cache.

Context: PostgreSQL documents row-level locking with SELECT FOR UPDATE, and SKIP LOCKED for cases where locked rows should be skipped rather than waited on. PostgreSQL documentation

Action: Workers that advance orders from payment_authorized to ready_for_fulfillment can claim rows with explicit locks, or use conditional updates that move exactly one expected state. For queue-like recovery jobs, SKIP LOCKED lets multiple workers avoid processing the same locked row. PostgreSQL documentation

Result: Background processors stop competing through stale reads. The learning is that state machines need concurrency control at the row that owns the transition.

Context: DynamoDB condition expressions allow writes only when an expression evaluates true, such as inserting an item only when the key does not already exist. AWS DynamoDB documentation

Action: The same state-machine model works outside SQL when transitions are conditional writes: create only if absent, advance only if the current state and version match, and treat failed conditions as a signal to reload. AWS DynamoDB documentation

Result: The pattern is not tied to one database engine. The learning is that checkout reliability comes from conditional ownership of business facts.

Where It Breaks

Failure mode	What happens	Mitigation
State explosion	Every provider callback becomes a new order state	Keep provider details in attempt tables and promote only business-level states to the order
Long transactions	Payment calls hold database locks while waiting on the network	Persist intent first, call the provider outside the lock, then conditionally apply the result
Weak idempotency scope	The same key is reused across different carts or amounts	Store a request fingerprint and reject mismatched retries
Outbox backlog	Order transitions succeed but downstream delivery lags	Monitor unpublished event age and retry count as production health signals
Manual repair bypasses rules	Support edits `orders.state` directly	Build repair commands that use the same transition table and append audit records
Webhook races	Provider success arrives before the API request finishes	Record provider events independently, then reconcile through conditional transitions

What to Do Next

Problem: Checkout failures become expensive when retries and callbacks can create new business facts.
Solution: Model orders as database-owned state machines with idempotent commands, conditional transitions, separate attempt records, and an outbox.
Proof: Stripe and Shopify document idempotency as a persisted retry contract, while PostgreSQL and DynamoDB expose the locking and conditional-write primitives needed to enforce transition ownership.
Action: Start by adding checkout_id, state_version, payment attempt records, and an outbox. Then change every checkout mutation to update from an expected state instead of assigning a new status directly.

Situation

The Problem

Answer: Make The Database Own The State Machine

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Designing for Peak Traffic Without Designing for Permanent Waste

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse