Capacity Planning From First Principles: QPS, Fanout, and Hot Keys

Capacity planning fails when teams size the average request and forget that production traffic is a graph, not a number.

Situation

Most capacity reviews start with a deceptively clean question: how many requests per second can this service handle?

That question is useful, but incomplete. A service does not handle a request in isolation. It fans out to caches, databases, queues, search indexes, feature stores, payment gateways, and internal APIs. Each hop has its own concurrency limit, latency distribution, retry policy, and partitioning model.

The result is that user-visible QPS is only the first term in the equation. The system’s real load is shaped by fanout, amplification, skew, and recovery behavior.

A homepage endpoint at 2,000 QPS may look safe if the service can serve 3,000 QPS in a benchmark. It is not safe if each request reads 12 downstream resources, retries twice during brownouts, and concentrates half its reads on one tenant, celebrity account, or trending object.

The capacity question is not “can one service handle X QPS?” The question is whether every constrained resource in the request path can survive the worst credible product behavior.

The Problem

Averages hide the failure mode.

If one request performs one database read, 5,000 frontend QPS means 5,000 database reads per second. If one request performs 20 reads, it means 100,000 reads per second. If p95 latency rises and clients retry once, the downstream system may now see 200,000 reads per second while the user-facing traffic graph still says 5,000 QPS.

That is fanout.

Hot keys make the problem sharper. A distributed datastore can have enormous aggregate capacity and still fail because one logical key, partition, row range, or tenant receives more traffic than a single shard can serve. Adding more machines does not help if the routing function keeps sending the hot workload to the same place.

This is why “we have enough total capacity” is not a proof. Total capacity answers the wrong question. The practical question is:

Can the hottest constrained unit in the system handle peak amplified demand while dependencies are slow, retries are active, and traffic is uneven?

Capacity as a Load Graph

Capacity planning should begin with a request graph and a budget for every edge.

flowchart TD
    A[user traffic — peak QPS] --> B[entry service — admission control]
    B --> C[fanout map — downstream calls]
    C --> D[cache tier — key distribution]
    C --> E[database tier — partition limits]
    C --> F[queue tier — write amplification]
    E --> G[hot key analysis — tenant and object skew]
    F --> H[consumer capacity — drain rate]
    G --> I[capacity envelope — steady state and failure state]
    H --> I

The first-principles model is simple:

downstream_qps = user_qps × calls_per_request × retry_multiplier × amplification_factor

That formula is not sufficient, but it prevents magical thinking. It forces the review to name the multipliers.

user_qps should be peak, not average. Use launch traffic, daily peak, regional failover, batch overlap, and marketing events as separate scenarios.

calls_per_request should count actual downstream operations. A single API call may perform one cache read, three database reads, one authorization lookup, one feature flag fetch, and one async write.

retry_multiplier should reflect client behavior under partial failure. Retries are useful when they are bounded, jittered, and budgeted. They are dangerous when every layer retries independently.

amplification_factor captures work created after the synchronous path: denormalized writes, index updates, queue messages, CDC consumers, search indexing, cache invalidation, and analytics events.

Then the model must be projected onto physical constraints: connection pools, thread pools, database partitions, row ranges, shard leaders, queue partitions, cache nodes, and rate limits.

The unit that matters is the smallest thing that can become hot.

In Practice

Context

Amazon’s Dynamo paper describes the use of consistent hashing and virtual nodes to distribute key ranges across storage nodes. The documented design addresses load distribution and membership changes in a highly available key-value store, rather than assuming that a single global capacity number is enough. See Dynamo: Amazon’s Highly Available Key-value Store.

Action

The architectural pattern is to hash keys into many ownership ranges, assign multiple virtual nodes to each physical node, and rebalance ownership as nodes enter or leave the cluster.

Result

This improves distribution when traffic is broad across keys. It does not eliminate hot keys. If one logical key dominates request volume, hashing can place that key on exactly one ownership path. The cluster may be balanced by bytes and still overloaded by requests.

Learning

Partitioning solves aggregate distribution. It does not solve popularity skew by itself. Capacity planning must model both total keyspace distribution and hottest-key demand.

Context

Google Cloud Bigtable documentation explains that row keys are stored in lexicographic order and warns that poor row-key design can create hotspotting. Google’s schema guidance recommends designing keys around access patterns and using techniques such as salting when needed. See Bigtable schema design best practices and Google’s key salting discussion.

Action

The documented pattern is to avoid monotonically increasing or highly clustered row keys when write traffic is high. For skewed workloads, prepend or otherwise include a distribution component so adjacent hot writes do not land on the same tablet range.

Result

The system gets a chance to use more of its physical capacity because the write path is spread across multiple ranges. The tradeoff is query complexity: reads may need to scan multiple salted ranges and merge results.

Learning

You cannot choose partition keys only for query convenience. The key must also carry enough entropy to distribute peak write and read load.

Context

AWS DynamoDB documentation describes adaptive capacity for uneven access patterns and separately documents throttling caused by hot key ranges. AWS notes that adaptive capacity can help with hot partitions, but within table and partition limits. See DynamoDB adaptive capacity and hot partition mitigation.

Action

The documented pattern is to design partition keys for uniform access, monitor throttling at the key-range level, and rely on adaptive behavior as a mitigation rather than the primary design.

Result

A workload may run normally until one tenant, item, or time bucket becomes dominant. At that point, provisioned or on-demand capacity at the table level is less important than whether the hot key range can absorb the concentrated request stream.

Learning

Managed services reduce operational burden, but they do not remove the need to understand the unit of isolation. Capacity planning still has to ask which key range, partition, or item becomes hot first.

Where It Breaks

Failure mode	Why the plan looked safe	What actually failed	Better capacity question
Fanout explosion	Frontend QPS was below service benchmark	Downstream reads multiplied per request	What is peak QPS at every dependency?
Retry storm	Normal latency was acceptable	Slow dependencies triggered synchronized retries	What is the retry budget during brownout?
Hot tenant	Aggregate database capacity was high	One tenant exceeded one partition’s capacity	What is max QPS for the busiest tenant?
Hot object	Cache hit rate looked strong globally	One key overloaded one cache node or shard	What is per-key request concentration?
Queue backlog	Producers were healthy	Consumers could not drain amplified writes	What is sustained drain rate under peak writes?
Regional failover	Each region passed steady-state load tests	One region received another region’s traffic	Can one region absorb failover plus retries?

The common theme is that the failing unit was smaller than the dashboard. Service-level QPS, cluster CPU, and average latency are necessary signals, but they are not capacity guarantees.

A useful review works from the bottom up:

Identify the constrained units.
Estimate demand per constrained unit.
Add amplification from fanout, retries, and async work.
Test the highest-risk skew scenarios.
Put admission control before irreversible overload.

Admission control matters because overload changes the system. Queues grow, caches churn, connection pools saturate, thread pools block, and clients retry. Once the system enters that state, raw capacity is no longer the only problem. Recovery becomes a separate capacity event.

What to Do Next

Problem — Your service-level QPS target is not a capacity plan. It is only the first input. Expand it into a request graph that includes synchronous calls, async writes, retries, cache behavior, and database partitions.
Solution — Build capacity budgets per constrained unit: per dependency, per shard, per partition, per queue, per tenant, and per hot object. Treat fanout and write amplification as first-class multipliers.
Proof — Validate the model with load tests that include skew. Test one hot tenant, one hot key, one slow dependency, one retrying client population, and one regional failover case. Compare observed downstream QPS against the budget.
Action — Before the next launch, write the capacity equation beside the architecture diagram. Name the hottest unit in the design. If no one can say what fails first, the system is not capacity planned; it is only benchmarked.

Situation

The Problem

Capacity as a Load Graph

In Practice

Context

Action

Result

Learning

Context

Action

Result

Learning

Context

Action

Result

Learning

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Designing for Peak Traffic Without Designing for Permanent Waste

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse