AWS Database Cost Triage: RDS, Aurora, DynamoDB, ElastiCache, and OpenSearch

Database bills rarely explode because one engineer chose the wrong service. They usually grow because ownership, workload shape, and control loops drift apart until nobody can explain which queries, tenants, indexes, caches, or shards are buying what outcome.

Situation

AWS gives teams a broad database portfolio: RDS for conventional relational workloads, Aurora for managed high-availability relational systems, DynamoDB for key-value and document access patterns, ElastiCache for Redis or Memcached acceleration, and OpenSearch for search and analytical indexing.

That portfolio is useful because workloads are not uniform. A checkout path, a feature flag read, a session cache, a text search endpoint, and an operational dashboard should not all be forced through the same persistence layer.

The cost problem begins when each service is treated as an isolated bill line. RDS cost is reviewed by instance class. Aurora cost is reviewed by cluster. DynamoDB cost is reviewed by table. OpenSearch cost is reviewed by domain. ElastiCache cost is reviewed by node group.

Those views are necessary, but insufficient. They show what was purchased. They rarely show whether the purchase still matches the access pattern.

The Problem

The failure mode is not “databases are expensive.” The failure mode is unmanaged mismatch.

A relational workload moves to Aurora but keeps inefficient polling queries. DynamoDB gets adopted for scale but receives ad hoc access patterns that force scans or secondary indexes nobody budgeted. ElastiCache is added to reduce database load, but eviction policy and key design cause poor hit rates. OpenSearch becomes the destination for every debug query and slowly turns into a second data warehouse.

The team then enters cost triage under pressure. Finance wants a reduction. Engineering wants reliability. Product wants no visible regression. The easy move is to resize or delete capacity. The safer move is to identify the cost control plane: the few measurements and architectural decisions that connect dollars to workload behavior.

The core question is: how do you reduce database cost without turning cost cutting into an availability incident?

Core Concept

Treat database cost as an operational signal attached to workload intent. The unit of analysis is not the AWS service. It is the access pattern.

flowchart TD
    A[monthly bill spike — unknown workload] --> B[classify access pattern — transactional or cache or search]
    B --> C[RDS and Aurora — relational query pressure]
    B --> D[DynamoDB — key access and capacity mode]
    B --> E[ElastiCache — hit rate and memory pressure]
    B --> F[OpenSearch — index and shard pressure]

    C --> G[query plan review — indexes and connection shape]
    C --> H[capacity review — instance and storage and replicas]

    D --> I[partition review — hot keys and scans]
    D --> J[capacity review — on demand or provisioned]

    E --> K[key review — ttl and eviction]
    E --> L[node review — memory and network]

    F --> M[index review — mappings and retention]
    F --> N[cluster review — shards and replicas]

    G --> O[cost decision — remove waste with rollback]
    H --> O
    I --> O
    J --> O
    K --> O
    L --> O
    M --> O
    N --> O

For RDS and Aurora, start with query behavior before instance behavior. Expensive instances are often compensating for missing indexes, unbounded result sets, inefficient joins, chatty connection pools, or read replicas used as a substitute for query ownership. Right-sizing helps only after the workload is legible.

For DynamoDB, cost follows request shape. A table with clean partition keys and predictable access can be cheap at high scale. A table with scans, hot keys, oversized items, or poorly chosen global secondary indexes can become expensive while still looking “serverless” from the application side. Triage must inspect consumed capacity, throttling, partition heat, item size, and index usage together.

For ElastiCache, the key question is whether the cache is reducing origin work. A cache with low hit rate, excessive churn, large values, or no meaningful TTL discipline can add cost without reducing database pressure. The control plane is hit rate, eviction, memory fragmentation, network throughput, and the shape of misses.

For OpenSearch, cost is dominated by index design, shard count, retention, replica policy, and query fanout. A domain can be oversized because ingestion is too broad, mappings are too loose, shards are too small, or retention is treated as infinite. Search clusters need lifecycle management, not just bigger nodes.

In Practice

Context: Amazon’s DynamoDB documentation describes capacity modes, partition keys, secondary indexes, item size, and scan behavior as central to table performance and cost. This is a documented system behavior, not an anecdote.

Action: During cost triage, separate DynamoDB tables by access pattern: predictable high-volume tables, bursty tables, tables with global secondary indexes, and tables showing scan-heavy behavior in CloudWatch or Contributor Insights. Check whether on-demand mode is buying useful elasticity or masking a workload that should be provisioned with autoscaling.

Result: The documented pattern is that DynamoDB cost optimization comes from aligning capacity mode and key design with access shape. Cutting capacity without fixing scans, hot keys, or oversized indexes only moves the failure from the bill to throttling.

Learning: DynamoDB triage should begin with key and index behavior, then capacity mode. The billing model is downstream of the data model.

Context: AWS RDS and Aurora expose database load through tools such as Performance Insights, Enhanced Monitoring, slow query logs, and engine-native explain plans. PostgreSQL and MySQL behavior around indexes, joins, locks, and connection pressure is documented and observable.

Action: Group RDS and Aurora spend by cluster role: write primary, read replica, reporting replica, and idle legacy instance. For high-cost clusters, inspect top SQL, wait events, storage growth, replica lag, and connection count before resizing. Validate reserved capacity or savings plans only after the steady-state footprint is understood.

Result: The documented pattern is that relational cost optimization depends on workload diagnosis. A larger instance may be hiding missing indexes, lock contention, or application pooling problems. A smaller instance may be safe only after query pressure is reduced.

Learning: For relational systems, instance size is the last mile of triage. Query shape, storage growth, and availability requirements decide the real envelope.

Context: Redis and Memcached are documented as memory-backed caching systems. ElastiCache pricing follows nodes and capacity, while operational value depends on reducing backend work through cache hits and predictable eviction.

Action: Review cache hit rate, evictions, memory utilization, key cardinality, TTL distribution, and value size. Identify caches used for durable state, caches with no expiry discipline, and caches that duplicate data already served cheaply by DynamoDB or Aurora replicas.

Result: The documented pattern is that cache cost is justified only when it reduces more expensive work or protects latency. A cache with poor hit rate is not an optimization layer; it is another production datastore.

Learning: ElastiCache triage should ask what origin load disappears because the cache exists.

Context: OpenSearch documentation emphasizes shard sizing, index lifecycle management, mappings, replicas, and query design. These are known drivers of cluster stability and cost.

Action: Split indexes by purpose: product search, logs, metrics, audit, and exploratory debugging. Apply retention rules, reduce unnecessary replicas, fix oversharding, and move non-search analytics to more appropriate storage when search is being used as a warehouse.

Result: The documented pattern is that OpenSearch cost is often index lifecycle cost. Compute, storage, and memory pressure follow from how much data is indexed, how it is mapped, and how widely queries fan out.

Learning: OpenSearch is expensive when it becomes the universal answer to “we might need to query this later.”

Where It Breaks

Service	Common Cost Failure	Safer Triage Move	Risk
RDS	Oversized instances hiding inefficient SQL	Review top queries, waits, indexes, and storage before resizing	Latency regression from premature downsizing
Aurora	Read replicas used to absorb avoidable query load	Separate read scaling from query cleanup	Replica lag or failover surprises
DynamoDB	Scans, hot keys, oversized items, unused indexes	Inspect consumed capacity and access patterns per table	Throttling if capacity is cut first
ElastiCache	Low hit rate or unbounded key growth	Measure hit rate, eviction, TTLs, and origin reduction	Cache removal can overload the origin
OpenSearch	Oversharding and infinite retention	Fix index lifecycle, mappings, replicas, and shard count	Search latency or recovery impact

What to Do Next

Problem: The database bill is not actionable when it is grouped only by AWS service.
Solution: Build a cost control plane around access patterns: relational queries, key-value reads, cache behavior, and search indexes.
Proof: Use documented service signals: Performance Insights, CloudWatch capacity metrics, cache hit rate, eviction behavior, shard health, index retention, and query fanout.
Action: For each expensive datastore, write down the workload it serves, the metric proving it earns its cost, the rollback plan for any reduction, and the owner who can change the access pattern.

Situation

The Problem

Core Concept

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Designing for Peak Traffic Without Designing for Permanent Waste

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse