Rate Limiting Is a Product Contract, Not Just a Redis Counter
The failure mode is not that too many requests reached Redis. The failure mode is that the product promised one behavior, the platform enforced another, and clients learned the difference in production.
Situation
Rate limiting usually enters the design review as an infrastructure problem. Someone draws a gateway, a Redis cluster, a token bucket, and a 429 Too Many Requests response. That is a useful mechanism, but it is not the architecture.
The architecture starts earlier: who is entitled to do what, at what cost, under which plan, from which identity, and with what recovery semantics when they exceed the boundary. A free user sending ten expensive export jobs is not the same as an enterprise tenant sending ten cheap metadata reads. A customer retrying after a timeout is not the same as a bot scanning every endpoint. A batch integration that can wait is not the same as a checkout path that must preserve latency.
Modern APIs are product surfaces. Their limits shape customer onboarding, billing, abuse protection, fairness between tenants, and incident blast radius. Once customers automate against the limit, the limit becomes part of the contract whether the team wrote it down or not.
The Problem
The common implementation is deceptively simple: increment a key in Redis, set an expiry, reject when the count crosses a threshold. It works for a single endpoint, a single identity model, and a single failure budget. It collapses when the system needs to express product reality.
The first break is identity. Is the unit of fairness an API key, OAuth app, user, tenant, IP address, organization, workload, or billing account? If the limiter uses the wrong key, one noisy integration can starve an entire customer, or one customer can bypass protection by fanning out credentials.
The second break is cost. One request is not one unit of work. A cache hit, a paginated search, a graph expansion, and a report generation path may all look like HTTP requests while consuming radically different CPU, database, queue, and third-party quota.
The third break is communication. If clients only receive 429, they do not know whether to retry in one second, one hour, with a smaller page size, with a different credential, or never. Bad limit responses create retry storms. Good limit responses create coordinated backpressure.
The fourth break is operations. During an incident, teams need to lower limits for one route, exempt one tenant, shed one class of work, and observe which contracts are being enforced. A hard-coded Redis counter gives the operator a knob. A contract-oriented limiter gives the operator a control plane.
The question is not “which rate limiting algorithm should we use?” The question is: what product contract should the platform enforce when demand exceeds safe capacity?
Make the Limit a Contract
A rate limit contract has five parts: identity, budget, scope, response, and observability.
Identity defines who owns the budget. Budget defines the allowed cost over time. Scope defines where the budget applies: route, method, feature, tenant, region, or dependency. Response defines what the client can rely on when it is throttled. Observability proves whether the contract is fair, effective, and safe.
The implementation can still use token buckets, leaky buckets, fixed windows, sliding windows, or distributed counters. Those are enforcement details. The durable design decision is to separate policy from enforcement.
flowchart TD
A[product plan — entitlement] --> B[policy compiler — routes and budgets]
B --> C[edge gateway — cheap rejection]
B --> D[global limiter — shared quota]
B --> E[service guardrail — expensive work]
C -->|allow| F[request handler — business path]
D -->|allow| F
E -->|allow| F
C -->|deny| G[limit response — status and reset]
D -->|deny| G
E -->|deny| G
F --> H[response contract — headers and retry]
G --> H
C -->|events| I[observability — tenant and route]
D -->|events| I
E -->|events| I
The edge gateway should reject obviously over-budget traffic before it consumes expensive resources. The global limiter should coordinate shared tenant or account budgets across regions and workers. The service guardrail should protect the scarce dependency the gateway cannot understand: a database connection pool, a model inference queue, an export worker, or a search cluster.
The response contract matters as much as the rejection. Clients need stable status codes, remaining budget headers where appropriate, reset information, and retry guidance. Some limits should be documented as hard product limits. Others should be documented as protective limits that may vary during abuse or incidents.
The contract should also admit hierarchy. A platform may need an account-level daily quota, a per-route burst limit, a concurrency cap for expensive jobs, and an emergency regional drain rule. Treating all of that as “requests per minute” hides the product decision inside infrastructure syntax.
In Practice
Context: GitHub’s REST API documentation describes primary rate limits, secondary rate limits, response headers such as remaining quota, and 403 or 429 behavior when limits are exceeded. The documented pattern is that client-visible limits are not just counters; they are part of the API behavior clients must code against. GitHub REST API rate limits
Action: A contract-oriented design copies that separation. Primary limits express the normal entitlement. Secondary limits protect platform health when behavior is abusive, highly concurrent, or expensive even if the primary quota is not exhausted.
Result: The client can reason about normal consumption while the provider keeps room for protective enforcement. That is a better contract than pretending every unsafe behavior can be captured by a single remaining counter.
Learning: Publish the steady-state budget, but reserve an explicitly documented protective layer for overload and abuse. If the protective layer is invisible, customers experience it as randomness.
Context: AWS API Gateway usage plans associate API keys with throttling and quota settings, and AWS documents that throttling and quota limits for usage plans are applied across stages within a usage plan. AWS also documents method-level throttling for usage plans. API Gateway usage plans
Action: The useful pattern is plan-driven policy, not merely gateway-side rejection. Product packaging, API identity, route-level cost, and operational throttling meet in one control surface.
Result: Teams can express different budgets for different customers and methods without forcing every backend service to rediscover the commercial model.
Learning: Put product policy in a place where product, platform, and operations can all inspect it. If the policy only exists as scattered constants, no one owns the contract.
Context: Kubernetes API Priority and Fairness controls API server behavior under overload by classifying requests and managing fairness between flows. The documented pattern is load shedding with priority, not undifferentiated rejection. Kubernetes API Priority and Fairness
Action: Apply the same idea to product APIs. Separate interactive reads, background sync, admin operations, and bulk exports into classes with different queues, concurrency, and rejection behavior.
Result: A batch customer job can be slowed without taking down a latency-sensitive operational path. The system fails by policy instead of by accident.
Learning: Fairness is a product and reliability decision. A limiter that cannot distinguish work classes will eventually protect the wrong thing.
Where It Breaks
| Failure mode | What happens | Design response |
|---|---|---|
| Wrong identity key | One integration starves a tenant, or one tenant bypasses limits | Model budgets around the accountable product entity |
| Flat request pricing | Cheap reads and expensive jobs consume the same quota | Charge budget by cost class, not only request count |
| Hidden protective limits | Clients see random throttling and retry harder | Document secondary limits and retry behavior |
| Single enforcement point | Gateway allows work that later melts a dependency | Add service-level guardrails near scarce resources |
| No emergency controls | Incident response requires code deploys | Keep runtime policy overrides with audit trails |
| Poor observability | Operators cannot explain who was throttled or why | Emit decision events by tenant, route, class, and rule |
| Over-strict consistency | Limiter becomes a global latency dependency | Use approximate distributed enforcement where exactness is not worth the availability cost |
What to Do Next
- Problem: A Redis counter answers “how many requests arrived,” but the product needs to answer “which customer, plan, route, and work class is allowed to consume scarce capacity.”
- Solution: Define the rate limit contract first: identity, budget, scope, response, and observability. Then choose enforcement algorithms that fit each layer.
- Proof: Public systems such as GitHub, AWS API Gateway, and Kubernetes expose the same pattern in different forms: documented limits, plan-aware throttling, and fairness under overload.
- Action: Inventory every public and internal API limit. For each one, write down the accountable identity, the cost model, the client response, the operational override, and the dashboard that proves enforcement is behaving as intended.