The queue looked cheaper until the first replay request turned a clean incident into a data archaeology exercise.

Situation

AttributeQueueStream
Primary invariantTask completion — work disappears after successEvent retention — facts persist until retention expires
Delivery modelAt-most-once or at-least-once; broker assigns workAt-least-once; consumers track own offset
Consumer modelWork pool — claim, process, deleteConsumer group — track offset, replay independently
ReplayNo — messages deleted on successYes — any consumer can reread from any offset
Multiple consumersRequires fanout or pub/sub layerNative consumer groups, each at own position
Evidence after successGone — observability must be externalizedRetained — log is the audit trail
AWS examplesSQS, Amazon MQKinesis, Amazon MSK (Kafka)
Open-source examplesRabbitMQ, CeleryApache Kafka, Apache Pulsar, Redpanda
Use whenJob queues, email delivery, API calls, one-time workCDC, analytics pipelines, audit logs, event sourcing

Most teams choose between queues and streams too early. The decision is usually framed as an API preference: push work into a queue, or publish events into a stream. That framing is too small.

The real decision is about operational memory.

A queue is optimized for work assignment. A producer creates a task, a worker claims it, and successful processing removes it from the system. That is the right shape for email delivery, image resizing, webhook dispatch, fraud checks, and other jobs where the business cares that work completes once.

A stream is optimized for durable event history. A producer appends facts, consumers track their own position, and the log remains available for replay until retention expires. That is the right shape for audit pipelines, analytics feeds, change data capture, machine learning features, and projections where multiple consumers need different interpretations of the same event.

The confusion starts because both can move messages asynchronously. Both can buffer spikes. Both can decouple producers from consumers. Under light load, the first implementation often works either way.

Then production starts asking questions the original abstraction cannot answer.

The Problem

The failure mode is not that engineers pick the wrong technology. It is that requirements change direction after the system already encodes a delivery model.

A team starts with a queue because there is one consumer and the task should disappear after completion. Three months later, analytics wants the same events. Compliance wants a retained trail. A backfill is needed because a bug dropped a field. The queue has already deleted the evidence.

Another team starts with a stream because replay sounds powerful. The workload is actually command execution: charge this invoice, send this notification, call this partner API. Consumers retry, fall behind, and duplicate side effects because the system stored history but did not define ownership of work.

The question is not, “Should we use Kafka or SQS?”

The question is: is this data a disposable unit of work, or a durable fact that future systems must reinterpret?

The Decision Boundary

Use queues when the system’s primary invariant is task completion. Use streams when the system’s primary invariant is event retention.

flowchart TD
    A[producer — business change] --> B{primary invariant}
    B --> C[queue — assign work]
    B --> D[stream — retain facts]
    C --> E[worker pool — claim task]
    E --> F[acknowledge — remove task]
    D --> G[event log — append record]
    G --> H[consumer group — track offset]
    G --> I[new consumer — replay history]
    H --> J[projection — current view]
    I --> K[backfill — rebuild view]

What this diagram shows: A single producer branches into two fundamentally different systems. A queue assigns work — tasks are claimed by a worker pool and removed on acknowledgment. A stream retains facts — events are appended to a durable log, consumer groups track their read position via offset, and new consumers can replay the full history. The branching point is whether the event is a unit of work (queue) or a permanent fact (stream).

A queue makes work distribution easy because the broker owns the claim. Visibility timeouts, acknowledgements, dead letter queues, and retry policies exist to answer one question: which worker is responsible for this task now?

A stream makes replay easy because the broker owns the ordered log. Offsets, partitions, retention, compaction, and consumer groups exist to answer a different question: which part of the history has this consumer observed?

Those are not cosmetic differences. They determine how incidents are debugged.

With a queue, the happy path deletes evidence. Observability must be externalized into logs, traces, metrics, or a separate audit store. With a stream, the happy path preserves evidence, but every consumer must handle replay, ordering limits, duplicate delivery, and offset management.

A queue turns time into responsibility.

A stream turns time into data.

In Practice

Context: Amazon SQS documents a queue model built around message visibility, deletion after successful processing, and dead letter queues for messages that cannot be processed. The documented pattern is work dispatch: a consumer receives a message, processes it, and deletes it.

Action: That model fits workloads where the system can tolerate a message becoming invisible while a worker owns it, and where completion removes the need for the broker to retain the task. Engineers should pair it with idempotent handlers because SQS standard queues can deliver messages more than once.

Result: The operational surface is simple for worker pools. Scaling consumers increases throughput. Failed jobs can be isolated. But replaying a historical business event is not a native operation once messages are deleted.

Learning: A queue is not a database of facts. If the business later needs audit, analytics, or reconstruction, the architecture needs a separate durable event store or an outbox before the queue boundary.

Context: Apache Kafka’s design, as described by Jay Kreps and the original LinkedIn engineering work, treats the log as a durable, partitioned sequence of records. Consumers maintain positions independently, which lets multiple applications read the same event history at different speeds.

Action: That model fits event propagation, change data capture, and derived views. A payments service can publish an invoice event once while accounting, analytics, and search indexers consume independently.

Result: New consumers can be introduced without changing the producer. A broken projection can be rebuilt from retained events. But the cost moves into schema discipline, partition design, consumer lag management, and careful handling of side effects during replay.

Learning: A stream is not a magic queue with history. If a consumer sends emails or charges cards, replay can repeat the real world unless the side effect is guarded by idempotency keys and durable execution records.

Context: PostgreSQL logical decoding and replication slots show the same boundary in database form. The write ahead log can be consumed as a stream of changes, but slots also retain WAL until consumers advance.

Action: Teams use this behavior for change data capture into search, caches, warehouses, and event pipelines.

Result: The database becomes a source of ordered change history, but slow consumers create retention pressure. If lag is ignored, disk growth becomes an availability risk.

Learning: Replayable history is an operational liability as well as a capability. Retention must be budgeted, monitored, and owned.

Where It Breaks

DecisionWorks WhenBreaks WhenEngineering Control
QueueOne logical owner must complete workLater consumers need old eventsAdd outbox, audit table, or stream before deletion
StreamEvents need replay or multiple independent consumersConsumers perform non-idempotent side effectsStore execution records and idempotency keys
Queue with fanoutSeveral workers perform equivalent workEach downstream needs its own interpretationUse pub sub or stream with separate consumer groups
Stream as task queueOrdering and history matter more than claimingWork must be leased to exactly one workerAdd task ownership table or use a real queue
Long stream retentionBackfills and delayed consumers are expectedStorage and lag ownership are unclearDefine retention, compaction, and lag alerts
Short queue retentionFailures are resolved quicklyIncidents require forensic reconstructionPersist facts before enqueueing tasks

The most expensive architecture is the hybrid built accidentally: a queue used as a stream, with teams copying messages into side stores after the fact; or a stream used as a queue, with every consumer reinventing leases, retries, and dead letter behavior.

The right hybrid is deliberate. A common pattern is transactional outbox first, then two paths: publish durable facts to a stream, and enqueue derived commands for workers. The outbox records what happened. The queue drives what must be done. The stream lets future systems reinterpret the facts.

That split keeps the system honest.

What to Do Next

  • Problem: If the message represents work that should disappear after success, a stream will force every consumer to carry task execution semantics.

  • Solution: Use a queue for command execution, retries, worker scaling, and dead letter isolation.

  • Proof: If the message represents a business fact that future consumers may need, a queue will delete the source of truth too early.

  • Action: Put durable facts in an outbox or stream, put disposable work in a queue, and make the boundary explicit in design reviews.