Queues vs Streams: The Decision Engineers Keep Reversing

The queue looked cheaper until the first replay request turned a clean incident into a data archaeology exercise.

Situation

Attribute	Queue	Stream
Primary invariant	Task completion — work disappears after success	Event retention — facts persist until retention expires
Delivery model	At-most-once or at-least-once; broker assigns work	At-least-once; consumers track own offset
Consumer model	Work pool — claim, process, delete	Consumer group — track offset, replay independently
Replay	No — messages deleted on success	Yes — any consumer can reread from any offset
Multiple consumers	Requires fanout or pub/sub layer	Native consumer groups, each at own position
Evidence after success	Gone — observability must be externalized	Retained — log is the audit trail
AWS examples	SQS, Amazon MQ	Kinesis, Amazon MSK (Kafka)
Open-source examples	RabbitMQ, Celery	Apache Kafka, Apache Pulsar, Redpanda
Use when	Job queues, email delivery, API calls, one-time work	CDC, analytics pipelines, audit logs, event sourcing

Most teams choose between queues and streams too early. The decision is usually framed as an API preference: push work into a queue, or publish events into a stream. That framing is too small.

The real decision is about operational memory.

A queue is optimized for work assignment. A producer creates a task, a worker claims it, and successful processing removes it from the system. That is the right shape for email delivery, image resizing, webhook dispatch, fraud checks, and other jobs where the business cares that work completes once.

A stream is optimized for durable event history. A producer appends facts, consumers track their own position, and the log remains available for replay until retention expires. That is the right shape for audit pipelines, analytics feeds, change data capture, machine learning features, and projections where multiple consumers need different interpretations of the same event.

The confusion starts because both can move messages asynchronously. Both can buffer spikes. Both can decouple producers from consumers. Under light load, the first implementation often works either way.

Then production starts asking questions the original abstraction cannot answer.

The Problem

The failure mode is not that engineers pick the wrong technology. It is that requirements change direction after the system already encodes a delivery model.

A team starts with a queue because there is one consumer and the task should disappear after completion. Three months later, analytics wants the same events. Compliance wants a retained trail. A backfill is needed because a bug dropped a field. The queue has already deleted the evidence.

Another team starts with a stream because replay sounds powerful. The workload is actually command execution: charge this invoice, send this notification, call this partner API. Consumers retry, fall behind, and duplicate side effects because the system stored history but did not define ownership of work.

The question is not, “Should we use Kafka or SQS?”

The question is: is this data a disposable unit of work, or a durable fact that future systems must reinterpret?

The Decision Boundary

Use queues when the system’s primary invariant is task completion. Use streams when the system’s primary invariant is event retention.

flowchart TD
    A[producer — business change] --> B{primary invariant}
    B --> C[queue — assign work]
    B --> D[stream — retain facts]
    C --> E[worker pool — claim task]
    E --> F[acknowledge — remove task]
    D --> G[event log — append record]
    G --> H[consumer group — track offset]
    G --> I[new consumer — replay history]
    H --> J[projection — current view]
    I --> K[backfill — rebuild view]

What this diagram shows: A single producer branches into two fundamentally different systems. A queue assigns work — tasks are claimed by a worker pool and removed on acknowledgment. A stream retains facts — events are appended to a durable log, consumer groups track their read position via offset, and new consumers can replay the full history. The branching point is whether the event is a unit of work (queue) or a permanent fact (stream).

A queue makes work distribution easy because the broker owns the claim. Visibility timeouts, acknowledgements, dead letter queues, and retry policies exist to answer one question: which worker is responsible for this task now?

A stream makes replay easy because the broker owns the ordered log. Offsets, partitions, retention, compaction, and consumer groups exist to answer a different question: which part of the history has this consumer observed?

Those are not cosmetic differences. They determine how incidents are debugged.

With a queue, the happy path deletes evidence. Observability must be externalized into logs, traces, metrics, or a separate audit store. With a stream, the happy path preserves evidence, but every consumer must handle replay, ordering limits, duplicate delivery, and offset management.

A queue turns time into responsibility.

A stream turns time into data.

In Practice

Context: Amazon SQS documents a queue model built around message visibility, deletion after successful processing, and dead letter queues for messages that cannot be processed. The documented pattern is work dispatch: a consumer receives a message, processes it, and deletes it.

Action: That model fits workloads where the system can tolerate a message becoming invisible while a worker owns it, and where completion removes the need for the broker to retain the task. Engineers should pair it with idempotent handlers because SQS standard queues can deliver messages more than once.

Result: The operational surface is simple for worker pools. Scaling consumers increases throughput. Failed jobs can be isolated. But replaying a historical business event is not a native operation once messages are deleted.

Learning: A queue is not a database of facts. If the business later needs audit, analytics, or reconstruction, the architecture needs a separate durable event store or an outbox before the queue boundary.

Context: Apache Kafka’s design, as described by Jay Kreps and the original LinkedIn engineering work, treats the log as a durable, partitioned sequence of records. Consumers maintain positions independently, which lets multiple applications read the same event history at different speeds.

Action: That model fits event propagation, change data capture, and derived views. A payments service can publish an invoice event once while accounting, analytics, and search indexers consume independently.

Result: New consumers can be introduced without changing the producer. A broken projection can be rebuilt from retained events. But the cost moves into schema discipline, partition design, consumer lag management, and careful handling of side effects during replay.

Learning: A stream is not a magic queue with history. If a consumer sends emails or charges cards, replay can repeat the real world unless the side effect is guarded by idempotency keys and durable execution records.

Context: PostgreSQL logical decoding and replication slots show the same boundary in database form. The write ahead log can be consumed as a stream of changes, but slots also retain WAL until consumers advance.

Action: Teams use this behavior for change data capture into search, caches, warehouses, and event pipelines.

Result: The database becomes a source of ordered change history, but slow consumers create retention pressure. If lag is ignored, disk growth becomes an availability risk.

Learning: Replayable history is an operational liability as well as a capability. Retention must be budgeted, monitored, and owned.

Where It Breaks

Decision	Works When	Breaks When	Engineering Control
Queue	One logical owner must complete work	Later consumers need old events	Add outbox, audit table, or stream before deletion
Stream	Events need replay or multiple independent consumers	Consumers perform non-idempotent side effects	Store execution records and idempotency keys
Queue with fanout	Several workers perform equivalent work	Each downstream needs its own interpretation	Use pub sub or stream with separate consumer groups
Stream as task queue	Ordering and history matter more than claiming	Work must be leased to exactly one worker	Add task ownership table or use a real queue
Long stream retention	Backfills and delayed consumers are expected	Storage and lag ownership are unclear	Define retention, compaction, and lag alerts
Short queue retention	Failures are resolved quickly	Incidents require forensic reconstruction	Persist facts before enqueueing tasks

The most expensive architecture is the hybrid built accidentally: a queue used as a stream, with teams copying messages into side stores after the fact; or a stream used as a queue, with every consumer reinventing leases, retries, and dead letter behavior.

The right hybrid is deliberate. A common pattern is transactional outbox first, then two paths: publish durable facts to a stream, and enqueue derived commands for workers. The outbox records what happened. The queue drives what must be done. The stream lets future systems reinterpret the facts.

That split keeps the system honest.

What to Do Next

Problem: If the message represents work that should disappear after success, a stream will force every consumer to carry task execution semantics.
Solution: Use a queue for command execution, retries, worker scaling, and dead letter isolation.
Proof: If the message represents a business fact that future consumers may need, a queue will delete the source of truth too early.
Action: Put durable facts in an outbox or stream, put disposable work in a queue, and make the boundary explicit in design reviews.

Situation

The Problem

The Decision Boundary

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Designing for Peak Traffic Without Designing for Permanent Waste

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse