AWS Reference Architecture: ALB, ECS, RDS, ElastiCache, and SQS

Most AWS reference architectures look clean until the first dependency slows down, the cache goes cold, or a queue starts redriving poison messages faster than the service can recover.

Situation

A common production web architecture on AWS starts with an Application Load Balancer, routes traffic to ECS services, stores transactional state in RDS, uses ElastiCache for low-latency reads or coordination, and pushes asynchronous work through SQS.

On paper, this stack is straightforward. ALB terminates HTTP traffic and performs health checks. ECS runs stateless containers. RDS provides durable relational storage. ElastiCache absorbs read pressure and expensive computed lookups. SQS decouples slow work from request latency.

The architecture becomes interesting when each managed service is treated less like a box on a diagram and more like an operational contract. ALB does not know whether a task is logically healthy, only whether its configured health check passes. ECS can replace containers, but replacement does not fix a bad deploy, an exhausted connection pool, or a database migration that locks hot tables. RDS is durable, but durability does not remove the need to manage connections, failover behavior, read amplification, and transaction scope. ElastiCache is fast, but it is not a source of truth. SQS gives buffering, but also at-least-once delivery, retries, and duplicate processing risk.

The reference architecture is not the answer by itself. The answer is where failure boundaries are drawn.

The Problem

The failure mode usually begins with a small latency shift.

A downstream dependency slows. ECS tasks hold request threads longer. Connection pools fill. ALB continues sending traffic because the health endpoint still returns 200. Application retries multiply the load against RDS. Cache misses increase because requests are timing out before warming shared keys. SQS consumers fall behind, visibility timeouts expire, and the same messages are processed again.

Nothing has fully failed, so every layer keeps trying.

That is the dangerous state: partial failure with automated persistence. The system is alive enough to create more work and unhealthy enough to make that work more expensive.

The core question is: how should ALB, ECS, RDS, ElastiCache, and SQS be arranged so that each layer limits blast radius instead of amplifying it?

Core Concept

A practical AWS reference architecture separates synchronous request handling from asynchronous work, treats RDS as the source of truth, treats ElastiCache as disposable acceleration, and makes SQS consumers idempotent by default.

flowchart TD
  U[users — browsers and clients] --> A[ALB — public entry]
  A --> W[ECS web service — stateless requests]
  W --> C[ElastiCache — hot reads and short lived coordination]
  W --> D[RDS — transactional source of truth]
  W --> Q[SQS — durable work buffer]
  Q --> P[ECS worker service — async processors]
  P --> D
  P --> C
  D --> B[RDS backups — recovery point]
  W --> M[CloudWatch — metrics and alarms]
  P --> M
  Q --> M

The ALB should protect the service from dead tasks, not certify the whole application. Health checks should be cheap and specific: process up, listener responsive, local dependencies initialized. Deep health checks that query RDS on every probe can turn a database incident into a load balancer incident.

The ECS web service should stay stateless. Session state belongs outside the task, usually in cookies, RDS, or ElastiCache depending on durability requirements. Tasks should be replaceable without draining user identity, shopping carts, workflow state, or background progress.

RDS should own facts. Orders, payments, permissions, inventory, audit records, and workflow transitions should not depend on cache survival. Use transactions where correctness requires atomicity. Keep transactions short. Avoid holding database locks across network calls.

ElastiCache should reduce pressure, not define truth. Cache-aside is the default pattern: read from cache, fall back to RDS, then populate cache with a bounded TTL. When correctness matters, invalidate or version keys after writes rather than assuming TTLs will converge fast enough.

SQS should absorb work that does not need to complete inside the user request. Email sends, webhook delivery, media processing, search indexing, ledger fanout, and third-party synchronization are better behind a queue than inside an ALB request path. The user request records intent in RDS, enqueues work, and returns.

The worker service then processes messages with idempotency. A message can be delivered more than once. A worker can crash after performing a side effect but before deleting the message. The handler must be safe under replay.

In Practice

Context: AWS documents ALB target health checks as a routing signal, not an application correctness proof. A target can be considered healthy when it responds successfully to the configured check path, even if a deeper dependency is degraded.

Action: Keep ALB health checks shallow and use separate readiness, dependency, and business health metrics in CloudWatch. Route traffic based on whether the task can accept work; alert based on whether the system can complete work.

Result: The documented pattern separates traffic eligibility from operational diagnosis. The load balancer removes dead targets, while alarms catch rising RDS latency, cache error rates, SQS age, and application-level failures.

Learning: A health check is a routing primitive. It should not become a distributed transaction across every dependency.

Context: Amazon’s Builders’ Library describes timeouts, retries, and backoff with jitter as essential tools for avoiding retry amplification during overload. The pattern is explicit: retries can help transient faults, but unbounded synchronized retries make incidents worse.

Action: Put tight timeouts on calls from ECS to RDS, ElastiCache, and external APIs. Use bounded retries with exponential backoff and jitter. Do not retry every failed operation at every layer. For non-urgent work, prefer SQS retry behavior over holding an ALB request open.

Result: The documented pattern turns retry behavior into load control. When a dependency slows, callers stop waiting indefinitely and avoid synchronized retry spikes.

Learning: Retry policy is capacity policy. Treat it as part of the architecture, not as an SDK default.

Context: Amazon SQS standard queues document at-least-once delivery. Messages can be delivered more than once, and consumers must tolerate duplicates. Visibility timeout controls when an in-flight message can be received again.

Action: Design workers around idempotency keys stored in RDS. Record message handling state before or inside the same transaction as the durable side effect. Set visibility timeout longer than normal processing time, and send failed messages to a dead-letter queue after a bounded number of receives.

Result: The documented pattern makes duplicate delivery survivable. Redrive becomes an operational tool rather than a correctness hazard.

Learning: SQS decouples availability, not correctness. Correctness still belongs in the consumer and the database schema.

Context: Redis and ElastiCache are commonly used for cache-aside reads, but Redis persistence and replication settings do not make cached values the system of record. AWS ElastiCache documentation emphasizes in-memory performance and managed cache operations.

Action: Keep source-of-truth writes in RDS. Use ElastiCache for derived values, hot keys, rate counters, and short-lived coordination only when stale or lost data is acceptable. Add TTLs to all cache keys unless there is a specific invalidation mechanism.

Result: The documented pattern allows cache nodes to fail, restart, or evict keys without losing durable business state.

Learning: Cache failure should hurt latency before it hurts correctness.

Where It Breaks

Component	Failure Mode	Mitigation	Residual Risk
ALB	Health check passes while business flow fails	Separate shallow health checks from deep alarms	Bad deploys can still pass routing checks
ECS	Tasks scale out but all block on RDS	Connection limits, timeouts, backpressure	Scaling compute cannot fix database contention
RDS	Locking, failover, or connection exhaustion	Short transactions, pool sizing, read replicas where appropriate	Failover can still create brief write unavailability
ElastiCache	Hot key, eviction, stale value	TTLs, key versioning, cache-aside fallback	Cache loss can expose database capacity limits
SQS	Duplicate or poison messages	Idempotency keys, DLQs, visibility timeout tuning	Reprocessing still requires operational judgment
Workers	Side effect succeeds before message delete	Durable processing records	External APIs may not support idempotency

The most common mistake is treating this architecture as independently scalable boxes. ECS scales horizontally, but RDS has shared limits. ElastiCache lowers read load, but cold-start traffic can still hit the database. SQS buffers work, but a growing queue is deferred user pain, not free capacity.

The second mistake is placing too much logic in the synchronous request. If the user does not need the result immediately, persist intent and enqueue work. This shortens request latency, reduces ALB exposure to downstream slowness, and creates a controlled retry surface.

The third mistake is ignoring deletion semantics. A worker that completes work but fails to delete the SQS message has created a duplicate. A worker that deletes first and then performs work has created possible data loss. The only robust answer is idempotent processing with durable state.

What to Do Next

Problem: The stack fails badly when partial dependency slowness causes every layer to retry, wait, and amplify load.
Solution: Use ALB for traffic routing, ECS for stateless execution, RDS for durable truth, ElastiCache for disposable acceleration, and SQS for asynchronous buffering.
Proof: The architecture follows documented AWS patterns: ALB target health checks, SQS at-least-once delivery, cache-aside behavior, bounded retries, visibility timeouts, dead-letter queues, and durable relational transactions.
Action: Review one production request path and mark every synchronous dependency, retry, timeout, cache read, database transaction, and queued side effect. Then decide which failures should return fast, which should retry later, and which must stop the workflow entirely.

Situation

The Problem

Core Concept

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

CI/CD Observability: Queue Time, Flake Rate, Lead Time, Failure Domains, and Change Risk

Argo CD Deployment Workflow: Sync Waves, Health Checks, Rollbacks, and Drift

Python Automation Needs an API Contract, Not a Folder of Scripts