Idempotency Keys: The Small Table That Saves Distributed Systems
The most reliable distributed systems often depend on an unimpressive table with a unique constraint, a request hash, and a saved response.
Situation
Distributed systems no longer fail as single, clean transactions. A client submits a payment, the API times out, the load balancer retries, the worker restarts, the message broker redelivers, and the user refreshes the page. Each component is doing something reasonable. Together, they can charge twice, create duplicate orders, send duplicate emails, or enqueue the same downstream workflow more than once.
Retries are now part of the contract. Cloud SDKs retry transient failures. Queue consumers retry failed messages. Frontends retry after ambiguous network errors. Operators replay jobs after incidents. The system has to assume that a request may arrive again even after the original request succeeded.
This is why idempotency is not a payment feature. It is a control plane pattern for uncertainty.
The Problem
The dangerous failure is not a clean error. The dangerous failure is an unknown result.
A client sends POST /charges. The service writes the charge to the payment processor. Before the response reaches the client, the connection drops. From the client’s point of view, nothing happened. From the service’s point of view, the side effect may already be committed.
If the client retries a normal POST, the service cannot tell whether this is a new business action or the same action arriving again. Timestamps do not solve it. Request bodies do not solve it by themselves. “Check whether a similar row exists” usually becomes a race condition under concurrency.
The core question is: how can a service make retries safe when it cannot know whether the previous attempt succeeded?
The Idempotency Ledger
The answer is to turn each client intent into a named operation.
An idempotency key is a caller-provided identifier for one logical command. The server records that key before or during execution, associates it with a canonical request hash, and returns the same final result for repeated attempts with the same key.
flowchart TD
A[client sends command — idempotency key] --> B[api validates request — canonical hash]
B --> C[idempotency table — unique key]
C -->|new key| D[execute side effect — payment order message]
D --> E[store final response — status and body]
E --> F[return cached response — same key]
C -->|seen key| F
B -->|hash mismatch| G[reject mismatch — same key different request]
What this diagram shows: The client sends a command with an idempotency key. The API hashes it and checks the idempotency table. A new key executes the side effect and caches the response. A duplicate key returns the cached response without re-executing. A mismatched key — same idempotency key, different request body — is rejected, preventing the subtle class of double-execution bugs that occur when clients change payloads on retry.
The table is small, but the contract is strong:
idempotency_key: unique per caller scope.request_hash: canonical representation of the intended command.status:processing,succeeded, orfailed.response_codeandresponse_body: what the caller should receive on replay.resource_id: optional pointer to the created domain object.expires_at: retention boundary for operational cleanup.
The important detail is that idempotency is not deduplication after the fact. It is a write path protocol. The service must reserve the key with an atomic operation, usually a unique constraint, before allowing duplicate execution.
A typical flow looks like this:
- Validate the request enough to build a stable hash.
- Insert the key into the idempotency table.
- If insert succeeds, execute the command.
- Persist the final response against the key.
- If insert conflicts, compare the stored hash.
- If the hash matches, return the stored result or wait for the in-flight operation.
- If the hash differs, reject the request as a key reuse error.
This lets the client retry until it receives a response. The system stops treating retry as a suspicious event and starts treating it as normal recovery behavior.
In Practice
Context: Stripe documents idempotency keys for POST requests and stores the resulting status code and body for a key, including failures. Their public guidance says subsequent requests with the same key return the same result, and that keys should be unique and removable after a retention window.
Action: The architectural pattern is to bind the key to the parameters of the original request. Stripe’s documentation says the idempotency layer compares incoming parameters with the original request and errors if they differ. That prevents a client from accidentally reusing order-123 for a different charge.
Result: The retry contract becomes simple. If the original request succeeded but the response was lost, a retry receives the original success. If the original request failed after execution produced a stored failure response, the retry receives the same failure. The client no longer has to guess whether it should issue a second business command.
Learning: The key is not just a cache key. It is evidence of caller intent. A good implementation protects both sides: the client can retry safely, and the server can reject ambiguous reuse.
Context: AWS APIs commonly expose client tokens for idempotent requests. The Amazon EC2 API documentation describes client tokens as a way to make mutating calls idempotent, so retries do not create duplicate resources when the original result is unknown.
Action: The caller supplies a token when creating resources such as instances. The service uses that token to identify retries of the same operation within the idempotency scope defined by the API.
Result: Resource creation becomes safer under network failures, SDK retries, and operator replays. The caller can repeat the same command with the same token instead of building custom duplicate detection around resource names, tags, or timing.
Learning: Idempotency belongs at the API boundary because only the caller can reliably name the logical command. The server can enforce uniqueness, but the caller supplies intent.
Context: PostgreSQL unique constraints and INSERT ... ON CONFLICT provide the database behavior needed for an idempotency ledger. The documented behavior is that a unique index prevents two committed rows from holding the same key.
Action: Use a unique constraint on (tenant_id, idempotency_key) and reserve the key inside the same transactional boundary used to coordinate command execution metadata.
Result: Concurrent duplicate requests collapse into one winner and one conflict path. Without the unique constraint, two workers can both observe “no existing request” and execute the side effect.
Learning: Idempotency is only as strong as the atomicity of the reservation. A table without a uniqueness guarantee is an audit log, not a concurrency control mechanism.
Where It Breaks
| Failure mode | Why it happens | Mitigation |
|---|---|---|
| Key reused for a different command | Client generates predictable or coarse keys | Store a canonical request hash and reject mismatches |
| Duplicate side effect before key reservation | Service performs work before the atomic insert | Reserve the key before side effects |
In-flight retry sees processing forever | Worker crashes after reserving the key | Add leases, heartbeats, timeout recovery, or reconciliation |
| Response body changes across deployments | Replay recomputes the response from current code | Persist the original response or stable resource reference |
| Retention window too short | Client retries after cleanup | Align expiration with retry policies, queue retention, and dispute windows |
| Downstream system is not idempotent | Your boundary is safe but the next one is not | Pass idempotency keys downstream or create a local outbox |
| Global key namespace collision | Multiple tenants or clients use the same key | Scope uniqueness by tenant, account, or caller |
| Treating all failures as final | Transient infrastructure failure gets cached as a permanent response | Decide which failures are stored and which keep the operation retryable |
The hardest case is the gap between reserving the key and committing the external side effect. If the service calls a payment provider and crashes before recording the response, the ledger may say processing while the payment may exist. That is not solved by idempotency alone. It needs reconciliation: query the downstream provider by its own idempotency key, repair the local state, and then complete the original response.
For message-driven systems, pair the idempotency table with an outbox. The command handler records intent and emits work from a durable table. Consumers also need idempotency at their boundary, because brokers usually promise at-least-once delivery, not exactly-once business effects.
What to Do Next
- Problem: Retries turn ambiguous outcomes into duplicate side effects when a service cannot distinguish a new command from a repeated one.
- Solution: Require idempotency keys on mutating API calls, reserve them with a unique constraint, bind them to a request hash, and replay the stored result.
- Proof: Stripe’s idempotency-key contract, AWS client-token APIs, and PostgreSQL uniqueness behavior all support the same pattern: name the intent, reserve it atomically, and make retries converge.
- Action: Add an idempotency ledger to the write paths where duplicate execution is expensive, externally visible, or difficult to reverse. Start with payments, orders, provisioning, notifications, and workflow launches.