Queues vs Streams: The Decision Engineers Keep Reversing
The queue looked cheaper until the first replay request turned a clean incident into a data archaeology exercise.
Situation
| Attribute | Queue | Stream |
|---|---|---|
| Primary invariant | Task completion — work disappears after success | Event retention — facts persist until retention expires |
| Delivery model | At-most-once or at-least-once; broker assigns work | At-least-once; consumers track own offset |
| Consumer model | Work pool — claim, process, delete | Consumer group — track offset, replay independently |
| Replay | No — messages deleted on success | Yes — any consumer can reread from any offset |
| Multiple consumers | Requires fanout or pub/sub layer | Native consumer groups, each at own position |
| Evidence after success | Gone — observability must be externalized | Retained — log is the audit trail |
| AWS examples | SQS, Amazon MQ | Kinesis, Amazon MSK (Kafka) |
| Open-source examples | RabbitMQ, Celery | Apache Kafka, Apache Pulsar, Redpanda |
| Use when | Job queues, email delivery, API calls, one-time work | CDC, analytics pipelines, audit logs, event sourcing |
Most teams choose between queues and streams too early. The decision is usually framed as an API preference: push work into a queue, or publish events into a stream. That framing is too small.
The real decision is about operational memory.
A queue is optimized for work assignment. A producer creates a task, a worker claims it, and successful processing removes it from the system. That is the right shape for email delivery, image resizing, webhook dispatch, fraud checks, and other jobs where the business cares that work completes once.
A stream is optimized for durable event history. A producer appends facts, consumers track their own position, and the log remains available for replay until retention expires. That is the right shape for audit pipelines, analytics feeds, change data capture, machine learning features, and projections where multiple consumers need different interpretations of the same event.
The confusion starts because both can move messages asynchronously. Both can buffer spikes. Both can decouple producers from consumers. Under light load, the first implementation often works either way.
Then production starts asking questions the original abstraction cannot answer.
The Problem
The failure mode is not that engineers pick the wrong technology. It is that requirements change direction after the system already encodes a delivery model.
A team starts with a queue because there is one consumer and the task should disappear after completion. Three months later, analytics wants the same events. Compliance wants a retained trail. A backfill is needed because a bug dropped a field. The queue has already deleted the evidence.
Another team starts with a stream because replay sounds powerful. The workload is actually command execution: charge this invoice, send this notification, call this partner API. Consumers retry, fall behind, and duplicate side effects because the system stored history but did not define ownership of work.
The question is not, “Should we use Kafka or SQS?”
The question is: is this data a disposable unit of work, or a durable fact that future systems must reinterpret?
The Decision Boundary
Use queues when the system’s primary invariant is task completion. Use streams when the system’s primary invariant is event retention.
flowchart TD
A[producer — business change] --> B{primary invariant}
B --> C[queue — assign work]
B --> D[stream — retain facts]
C --> E[worker pool — claim task]
E --> F[acknowledge — remove task]
D --> G[event log — append record]
G --> H[consumer group — track offset]
G --> I[new consumer — replay history]
H --> J[projection — current view]
I --> K[backfill — rebuild view]
What this diagram shows: A single producer branches into two fundamentally different systems. A queue assigns work — tasks are claimed by a worker pool and removed on acknowledgment. A stream retains facts — events are appended to a durable log, consumer groups track their read position via offset, and new consumers can replay the full history. The branching point is whether the event is a unit of work (queue) or a permanent fact (stream).
A queue makes work distribution easy because the broker owns the claim. Visibility timeouts, acknowledgements, dead letter queues, and retry policies exist to answer one question: which worker is responsible for this task now?
A stream makes replay easy because the broker owns the ordered log. Offsets, partitions, retention, compaction, and consumer groups exist to answer a different question: which part of the history has this consumer observed?
Those are not cosmetic differences. They determine how incidents are debugged.
With a queue, the happy path deletes evidence. Observability must be externalized into logs, traces, metrics, or a separate audit store. With a stream, the happy path preserves evidence, but every consumer must handle replay, ordering limits, duplicate delivery, and offset management.
A queue turns time into responsibility.
A stream turns time into data.
In Practice
Context: Amazon SQS documents a queue model built around message visibility, deletion after successful processing, and dead letter queues for messages that cannot be processed. The documented pattern is work dispatch: a consumer receives a message, processes it, and deletes it.
Action: That model fits workloads where the system can tolerate a message becoming invisible while a worker owns it, and where completion removes the need for the broker to retain the task. Engineers should pair it with idempotent handlers because SQS standard queues can deliver messages more than once.
Result: The operational surface is simple for worker pools. Scaling consumers increases throughput. Failed jobs can be isolated. But replaying a historical business event is not a native operation once messages are deleted.
Learning: A queue is not a database of facts. If the business later needs audit, analytics, or reconstruction, the architecture needs a separate durable event store or an outbox before the queue boundary.
Context: Apache Kafka’s design, as described by Jay Kreps and the original LinkedIn engineering work, treats the log as a durable, partitioned sequence of records. Consumers maintain positions independently, which lets multiple applications read the same event history at different speeds.
Action: That model fits event propagation, change data capture, and derived views. A payments service can publish an invoice event once while accounting, analytics, and search indexers consume independently.
Result: New consumers can be introduced without changing the producer. A broken projection can be rebuilt from retained events. But the cost moves into schema discipline, partition design, consumer lag management, and careful handling of side effects during replay.
Learning: A stream is not a magic queue with history. If a consumer sends emails or charges cards, replay can repeat the real world unless the side effect is guarded by idempotency keys and durable execution records.
Context: PostgreSQL logical decoding and replication slots show the same boundary in database form. The write ahead log can be consumed as a stream of changes, but slots also retain WAL until consumers advance.
Action: Teams use this behavior for change data capture into search, caches, warehouses, and event pipelines.
Result: The database becomes a source of ordered change history, but slow consumers create retention pressure. If lag is ignored, disk growth becomes an availability risk.
Learning: Replayable history is an operational liability as well as a capability. Retention must be budgeted, monitored, and owned.
Where It Breaks
| Decision | Works When | Breaks When | Engineering Control |
|---|---|---|---|
| Queue | One logical owner must complete work | Later consumers need old events | Add outbox, audit table, or stream before deletion |
| Stream | Events need replay or multiple independent consumers | Consumers perform non-idempotent side effects | Store execution records and idempotency keys |
| Queue with fanout | Several workers perform equivalent work | Each downstream needs its own interpretation | Use pub sub or stream with separate consumer groups |
| Stream as task queue | Ordering and history matter more than claiming | Work must be leased to exactly one worker | Add task ownership table or use a real queue |
| Long stream retention | Backfills and delayed consumers are expected | Storage and lag ownership are unclear | Define retention, compaction, and lag alerts |
| Short queue retention | Failures are resolved quickly | Incidents require forensic reconstruction | Persist facts before enqueueing tasks |
The most expensive architecture is the hybrid built accidentally: a queue used as a stream, with teams copying messages into side stores after the fact; or a stream used as a queue, with every consumer reinventing leases, retries, and dead letter behavior.
The right hybrid is deliberate. A common pattern is transactional outbox first, then two paths: publish durable facts to a stream, and enqueue derived commands for workers. The outbox records what happened. The queue drives what must be done. The stream lets future systems reinterpret the facts.
That split keeps the system honest.
What to Do Next
-
Problem: If the message represents work that should disappear after success, a stream will force every consumer to carry task execution semantics.
-
Solution: Use a queue for command execution, retries, worker scaling, and dead letter isolation.
-
Proof: If the message represents a business fact that future consumers may need, a queue will delete the source of truth too early.
-
Action: Put durable facts in an outbox or stream, put disposable work in a queue, and make the boundary explicit in design reviews.