Idempotent Python Jobs: The Difference Between Retry and Duplicate Damage
Retries are not reliability unless the second execution is harmless.
Situation
Python is everywhere in platform engineering because it is the shortest path from operational intent to automation. A small job opens a pull request, syncs permissions, backfills a table, refreshes a cache, exports billing data, or reconciles cloud resources. The job starts as a script. Then it gets scheduled. Then it gets retried. Then it becomes part of the production control plane.
That change matters. A local script can fail loudly and wait for a human. A platform job is expected to recover from transient failures: network timeouts, rate limits, dead database connections, worker restarts, queue redelivery, deploy interruptions, and expired credentials. The operational reflex is to add retry logic.
Retry is necessary, but retry alone only answers one question: can the operation be attempted again? It does not answer the more important one: what happens if the first attempt partially succeeded?
Idempotency is the boundary between recovery and duplicate damage.
The Problem
A Python job rarely fails at the clean boundary the author had in mind. It fails after the database row was inserted but before the outbound API returned. It fails after the ticket was created but before the local state was marked complete. It fails after sending the notification but before acknowledging the queue message. It fails after claiming work but before writing the final status.
From the job runner’s point of view, the attempt failed. From the outside world’s point of view, something may already have happened.
That gap creates duplicate damage. The retry opens a second ticket. The replay sends a second email. The worker provisions a second resource. The batch process double-counts revenue. The cleanup job deletes something that was recreated between attempts. The CI automation posts the same comment on every retry until a pull request becomes unreadable.
The trap is that unit tests often miss this. They validate the happy path and maybe the exception path, but not the ambiguous path where a side effect succeeded and the acknowledgement failed. That is the path production retries find first.
The core question is not “how many times should this job retry?” It is “what state transition makes every retry converge on one correct outcome?”
Idempotency as a Job Contract
An idempotent job is not a job that never runs twice. It is a job whose repeated executions produce the same durable result for the same logical request.
That contract usually needs three pieces:
- A stable operation key.
- A durable record of progress.
- Side effects guarded by uniqueness, compare-and-set, or provider idempotency.
In Python, the mistake is often putting idempotency inside process memory: a set of seen IDs, an object cache, a module-level lock. That helps only until the worker restarts, the job moves to another machine, or the queue redelivers the message. Idempotency belongs in durable state.
flowchart TD
A[Job starts — input received] --> B[Derive operation key — stable identity]
B --> C[Claim work — durable uniqueness]
C --> D{Already completed}
D -->|yes| E[Return prior result — no new side effect]
D -->|no| F[Execute guarded side effect — provider key or local constraint]
F --> G[Persist outcome — completed state]
G --> H[Acknowledge message — retry no longer needed]
F --> I[Failure after side effect — ambiguous state]
I --> B
The operation key is the identity of the intent, not the identity of the attempt. A retry should not get a new key. A queue message ID can work if the queue message is the logical operation. A pull request number plus check name can work for CI comments. A customer ID plus billing period can work for invoice generation. A migration name plus target table can work for backfills.
The durable record is what lets the next attempt know whether it is starting, resuming, or returning an existing result. A simple table is often enough:
operation_keystatusattempt_countlocked_untilresult_referenceerror_codecreated_atupdated_at
The side effect guard is the most important part. If the side effect is local, use database constraints. If the side effect is external, use the provider’s idempotency feature when available. If neither exists, store enough remote identity to detect and reconcile prior work before creating anything new.
This turns retry from “run the function again” into “advance the operation toward a known terminal state.”
In Practice
Context: Stripe publicly documents idempotency keys for API requests. The documented behavior is that clients can send an idempotency key with a request so retried calls do not create duplicate operations for the same intent. Stripe also stores the response associated with the key, allowing a retry to receive the same result rather than blindly executing another side effect. See Stripe’s documentation on idempotent requests.
Action: The architectural pattern is to generate the key at the workflow boundary and pass it through the job, not generate it inside the retry loop. For a Python billing job, that means the key should look like a business operation: invoice:{customer_id}:{period}, not uuid4() per attempt.
Result: Retries become safe because the external system can recognize the duplicate intent. The job still needs local state, but the highest-risk side effect is protected by the system that owns it.
Learning: Idempotency keys are not retry counters. They are part of the operation identity. If the key changes on every attempt, the system has retry behavior without duplicate protection.
Context: PostgreSQL documents INSERT ... ON CONFLICT, which lets a write handle uniqueness conflicts deterministically. This is the database-level foundation for many idempotent job claims and result records. See the PostgreSQL documentation for INSERT.
Action: A Python worker can insert an operation_key into a table with a unique constraint. If the insert succeeds, it owns the first execution. If the insert conflicts, it reads the existing row and decides whether to return, resume, or wait.
Result: The database becomes the arbiter of duplicate work. This is stronger than checking first and inserting later, because the check-then-insert pattern races under concurrency.
Learning: Idempotency is a consistency problem before it is a Python problem. The code should ask the database to enforce the invariant, not merely hope all workers observe it.
Context: AWS Lambda Powertools for Python includes an idempotency utility that records invocation state in a persistence layer such as DynamoDB. Its documentation frames idempotency as protection against repeated Lambda invocations with the same payload. See AWS Lambda Powertools for Python on idempotency.
Action: The documented pattern is to extract an idempotency key from the event, persist execution state, and return a stored response for duplicate invocations.
Result: The handler can tolerate platform-level retries, client retries, and duplicate events without treating every invocation as new work.
Learning: Serverless and queued jobs make duplicate execution normal. The correct design assumption is at-least-once execution, not exactly-once execution.
Where It Breaks
| Failure mode | Why it happens | Mitigation | Tradeoff |
|---|---|---|---|
| Key is generated inside the retry | Every attempt looks like new work | Derive the key from business identity | Requires stable input modeling |
| Claim table is separate from side effect | Local state says pending while remote work succeeded | Store remote identifiers and reconcile before creating | More code paths and provider reads |
| Check-then-insert race | Two workers observe missing state | Use unique constraints or atomic conditional writes | Pushes design into storage semantics |
| Long-running job holds a lock forever | Worker dies mid-operation | Use leases with locked_until and heartbeats | Requires timeout tuning |
| Result cannot be replayed | Duplicate attempt cannot return prior output | Persist result references or normalized responses | More storage and schema design |
| External API has no idempotency key | Provider cannot detect duplicate intent | Search by deterministic metadata before create | Reconciliation may be imperfect |
| Side effect is not reversible | Duplicate damage cannot be cheaply repaired | Guard before the side effect and add manual repair workflow | Slower first implementation |
| Batch job mixes many identities | One failed item causes whole batch replay | Track idempotency per item, not only per batch | More rows and more observability needed |
What to Do Next
-
Problem: Treat every retryable Python job as an at-least-once workflow. Assume the worker can crash after any side effect and before any acknowledgement.
-
Solution: Add a durable operation key, a uniqueness-backed claim record, explicit statuses, and guarded side effects. Prefer provider idempotency keys for external APIs and database constraints for local writes.
-
Proof: Test the ambiguous failures. Force exceptions after the database write, after the API call, before the queue acknowledgement, and during concurrent execution. The second attempt should converge, not duplicate.
-
Action: Pick one production job with retry logic and trace its side effects. If the retry generates a new identifier, performs a check-then-create, or lacks a durable completed state, it is not idempotent yet.