Retries are not reliability unless the second execution is harmless.

Situation

Python is everywhere in platform engineering because it is the shortest path from operational intent to automation. A small job opens a pull request, syncs permissions, backfills a table, refreshes a cache, exports billing data, or reconciles cloud resources. The job starts as a script. Then it gets scheduled. Then it gets retried. Then it becomes part of the production control plane.

That change matters. A local script can fail loudly and wait for a human. A platform job is expected to recover from transient failures: network timeouts, rate limits, dead database connections, worker restarts, queue redelivery, deploy interruptions, and expired credentials. The operational reflex is to add retry logic.

Retry is necessary, but retry alone only answers one question: can the operation be attempted again? It does not answer the more important one: what happens if the first attempt partially succeeded?

Idempotency is the boundary between recovery and duplicate damage.

The Problem

A Python job rarely fails at the clean boundary the author had in mind. It fails after the database row was inserted but before the outbound API returned. It fails after the ticket was created but before the local state was marked complete. It fails after sending the notification but before acknowledging the queue message. It fails after claiming work but before writing the final status.

From the job runner’s point of view, the attempt failed. From the outside world’s point of view, something may already have happened.

That gap creates duplicate damage. The retry opens a second ticket. The replay sends a second email. The worker provisions a second resource. The batch process double-counts revenue. The cleanup job deletes something that was recreated between attempts. The CI automation posts the same comment on every retry until a pull request becomes unreadable.

The trap is that unit tests often miss this. They validate the happy path and maybe the exception path, but not the ambiguous path where a side effect succeeded and the acknowledgement failed. That is the path production retries find first.

The core question is not “how many times should this job retry?” It is “what state transition makes every retry converge on one correct outcome?”

Idempotency as a Job Contract

An idempotent job is not a job that never runs twice. It is a job whose repeated executions produce the same durable result for the same logical request.

That contract usually needs three pieces:

  1. A stable operation key.
  2. A durable record of progress.
  3. Side effects guarded by uniqueness, compare-and-set, or provider idempotency.

In Python, the mistake is often putting idempotency inside process memory: a set of seen IDs, an object cache, a module-level lock. That helps only until the worker restarts, the job moves to another machine, or the queue redelivers the message. Idempotency belongs in durable state.

flowchart TD
    A[Job starts — input received] --> B[Derive operation key — stable identity]
    B --> C[Claim work — durable uniqueness]
    C --> D{Already completed}
    D -->|yes| E[Return prior result — no new side effect]
    D -->|no| F[Execute guarded side effect — provider key or local constraint]
    F --> G[Persist outcome — completed state]
    G --> H[Acknowledge message — retry no longer needed]
    F --> I[Failure after side effect — ambiguous state]
    I --> B

The operation key is the identity of the intent, not the identity of the attempt. A retry should not get a new key. A queue message ID can work if the queue message is the logical operation. A pull request number plus check name can work for CI comments. A customer ID plus billing period can work for invoice generation. A migration name plus target table can work for backfills.

The durable record is what lets the next attempt know whether it is starting, resuming, or returning an existing result. A simple table is often enough:

  • operation_key
  • status
  • attempt_count
  • locked_until
  • result_reference
  • error_code
  • created_at
  • updated_at

The side effect guard is the most important part. If the side effect is local, use database constraints. If the side effect is external, use the provider’s idempotency feature when available. If neither exists, store enough remote identity to detect and reconcile prior work before creating anything new.

This turns retry from “run the function again” into “advance the operation toward a known terminal state.”

In Practice

Context: Stripe publicly documents idempotency keys for API requests. The documented behavior is that clients can send an idempotency key with a request so retried calls do not create duplicate operations for the same intent. Stripe also stores the response associated with the key, allowing a retry to receive the same result rather than blindly executing another side effect. See Stripe’s documentation on idempotent requests.

Action: The architectural pattern is to generate the key at the workflow boundary and pass it through the job, not generate it inside the retry loop. For a Python billing job, that means the key should look like a business operation: invoice:{customer_id}:{period}, not uuid4() per attempt.

Result: Retries become safe because the external system can recognize the duplicate intent. The job still needs local state, but the highest-risk side effect is protected by the system that owns it.

Learning: Idempotency keys are not retry counters. They are part of the operation identity. If the key changes on every attempt, the system has retry behavior without duplicate protection.

Context: PostgreSQL documents INSERT ... ON CONFLICT, which lets a write handle uniqueness conflicts deterministically. This is the database-level foundation for many idempotent job claims and result records. See the PostgreSQL documentation for INSERT.

Action: A Python worker can insert an operation_key into a table with a unique constraint. If the insert succeeds, it owns the first execution. If the insert conflicts, it reads the existing row and decides whether to return, resume, or wait.

Result: The database becomes the arbiter of duplicate work. This is stronger than checking first and inserting later, because the check-then-insert pattern races under concurrency.

Learning: Idempotency is a consistency problem before it is a Python problem. The code should ask the database to enforce the invariant, not merely hope all workers observe it.

Context: AWS Lambda Powertools for Python includes an idempotency utility that records invocation state in a persistence layer such as DynamoDB. Its documentation frames idempotency as protection against repeated Lambda invocations with the same payload. See AWS Lambda Powertools for Python on idempotency.

Action: The documented pattern is to extract an idempotency key from the event, persist execution state, and return a stored response for duplicate invocations.

Result: The handler can tolerate platform-level retries, client retries, and duplicate events without treating every invocation as new work.

Learning: Serverless and queued jobs make duplicate execution normal. The correct design assumption is at-least-once execution, not exactly-once execution.

Where It Breaks

Failure modeWhy it happensMitigationTradeoff
Key is generated inside the retryEvery attempt looks like new workDerive the key from business identityRequires stable input modeling
Claim table is separate from side effectLocal state says pending while remote work succeededStore remote identifiers and reconcile before creatingMore code paths and provider reads
Check-then-insert raceTwo workers observe missing stateUse unique constraints or atomic conditional writesPushes design into storage semantics
Long-running job holds a lock foreverWorker dies mid-operationUse leases with locked_until and heartbeatsRequires timeout tuning
Result cannot be replayedDuplicate attempt cannot return prior outputPersist result references or normalized responsesMore storage and schema design
External API has no idempotency keyProvider cannot detect duplicate intentSearch by deterministic metadata before createReconciliation may be imperfect
Side effect is not reversibleDuplicate damage cannot be cheaply repairedGuard before the side effect and add manual repair workflowSlower first implementation
Batch job mixes many identitiesOne failed item causes whole batch replayTrack idempotency per item, not only per batchMore rows and more observability needed

What to Do Next

  • Problem: Treat every retryable Python job as an at-least-once workflow. Assume the worker can crash after any side effect and before any acknowledgement.

  • Solution: Add a durable operation key, a uniqueness-backed claim record, explicit statuses, and guarded side effects. Prefer provider idempotency keys for external APIs and database constraints for local writes.

  • Proof: Test the ambiguous failures. Force exceptions after the database write, after the API call, before the queue acknowledgement, and during concurrent execution. The second attempt should converge, not duplicate.

  • Action: Pick one production job with retry logic and trace its side effects. If the retry generates a new identifier, performs a check-then-create, or lacks a durable completed state, it is not idempotent yet.