The most reliable indicator that an AI feature has moved from prototype to production is the moment the team stops optimizing for intelligence and starts optimizing for cost per inference.

Situation

Engineering teams are embedding LLM calls into production application paths: search ranking, customer support routing, document processing, data extraction pipelines. At prototype scale these costs are invisible. At production scale — millions of requests per day, 50k–200k token prompts, hundreds of API keys across dozens of services — the unit economics become a board-level concern.

The initial response is to aggressively downgrade to smaller models. This reliably breaks edge-case reasoning that the larger models handled gracefully, and causes a wave of quality regressions that are expensive to diagnose. The industry pattern that emerges after that first cycle: treat LLM cost optimization as a distributed systems routing and caching problem, not a model selection problem.

The Problem

The naive production LLM architecture has a structural flaw: it sends the full context — system prompt, retrieved documents, conversation history, tool schemas — to a frontier model for every single user request, regardless of whether the request requires frontier-level reasoning.

This breaks in two compounding ways. First, large context windows are expensive. A 100k-token prompt costs roughly 100x more than a 1k-token prompt on most provider pricing tiers. Second, time-to-first-token degrades with context size for uncached requests, degrading user experience even when cost is not yet a concern.

Teams that try to fix this by blindly truncating context introduce hallucination — the model answers without necessary information. Teams that route everything to smaller models introduce quality regressions. The actual engineering problem is: how do you route each request to the cheapest model that can correctly handle it, while dynamically pruning context to only what that request needs?

Context-Aware Routing and Caching Architecture

The architecture that solves this decouples prompt construction from inference, introduces a routing classifier, and structures prompts for maximum cache hit rates.

flowchart TD
    Req[Incoming Request] --> R[Semantic Router — intent classifier]
    R -->|Simple intent — summarize, extract, format| S[Small Model — Llama 3 8B or Haiku-tier]
    R -->|Complex intent — reason, plan, multi-step| CP[Context Builder]
    
    CP --> Cache[Provider Cache Lookup]
    Cache -->|Hit — prefix cached| F[Frontier Model — cached rate]
    Cache -->|Miss| B[Frontier Model — full rate]
    
    S --> Res[Response]
    F --> Res
    B --> Res
    B --> Store[Cache warm — next request hits]

The system operates in three phases:

Phase 1 — Semantic routing. Every incoming request passes through a fast intent classifier — either an embedding similarity check or a locally hosted small model. The classifier assigns the request to one of two paths: trivial intent (summarization, data extraction, structured formatting) or complex intent (multi-step reasoning, planning, code generation, ambiguous queries). Trivial intent routes to the small model tier; complex intent proceeds to context construction.

Phase 2 — Structured context construction. For complex requests, the context is assembled deterministically. Static content — system prompt, tool schemas, domain rules, reference documents — is placed first in the prompt as a stable prefix. Dynamic content — the specific user query, retrieved documents, conversation history — is appended at the end. This ordering is not cosmetic; it is the structural requirement for provider-side prefix caching.

Phase 3 — Prefix caching. Anthropic’s documented prompt caching behavior (introduced 2024) requires that cached content appear as a continuous prefix. If you interleave dynamic content within the static block, the cache is invalidated on every request. Groups that structure prompts correctly — all static content at the top, all dynamic content at the bottom — achieve the documented 90% input token discount on cached tokens. The cache TTL is 5 minutes, meaning high-traffic services maintain warm caches naturally.

In Practice

A) Anthropic’s documented prefix caching behavior: When Anthropic released prompt caching in 2024, the published documentation specifies that the cache_control parameter must be applied to a continuous prefix block. The documented discount is up to 90% on cached input tokens, with a cache write surcharge of 25% on first insertion. The 5-minute TTL means applications with consistent traffic profiles will maintain warm caches; batch jobs or low-frequency services should pre-warm caches explicitly.

B) Cloudflare AI Gateway’s semantic routing behavior: Cloudflare’s AI Gateway intercepts requests before they reach providers and supports routing rules based on request metadata. The documented pattern is to configure routing rules that direct simple-intent requests to cheaper models (Llama 3 running on Workers AI or Groq) while passing complex requests through to OpenAI or Anthropic. This requires no application code changes — the gateway handles routing based on a configured intent classifier or explicit request headers.

C) OpenAI’s Automatic Prompt Caching behavior: OpenAI documented automatic prefix caching in 2024 for prompts over 1,024 tokens. The caching is implicit — no API parameter required — and the discount applies automatically to the cached prefix. The documented behavior is that the first 1,024-token boundary of repeated prefixes is cached after the first request. This means structuring your system prompts to front-load stable content produces cache benefits without explicit instrumentation.

The acknowledged production pattern for RAG pipelines is to apply context pruning before constructing the prompt. Rather than passing all retrieved documents, teams filter to the top 2–3 most relevant documents by a secondary re-ranking step, and apply a maximum token budget per document. This keeps the dynamic context block small enough that the static prefix represents a large proportion of total prompt tokens — maximizing the economic benefit of prefix caching.

Where It Breaks

StrategyFailure ModeMitigation
Semantic routingThe classifier misroutes a complex request to the small model, which returns a confident but wrong answer with no indication of uncertainty.Implement a rejection mechanism: the small model returns a structured “needs escalation” response if it detects ambiguous or multi-step reasoning. Route that response back through the frontier model path.
Prefix cachingLow-traffic services never keep the 5-minute TTL warm. Cache misses incur the full token cost plus the write surcharge.For low-frequency services, pre-warm the cache explicitly at service startup and on a scheduled refresh before the TTL expires. Only enable explicit caching for prompts that justify the write overhead.
Context truncationAggressively truncating retrieved documents to reduce token count causes the model to answer from incomplete information, producing confidently wrong responses.Set a minimum token budget per document based on empirical evaluation. Do not truncate below the threshold that your quality benchmarks require.
Static prefix driftSystem prompt or tool schema is updated by one team without notifying the routing/caching layer. The cache is invalidated on every request until the deployment propagates.Treat the static prefix block as a versioned artifact. Deploy prompt changes as versioned releases, not ad-hoc edits.

What to Do Next

  • Problem: Production LLM features that send full unoptimized context to frontier models for every request are structurally expensive — costs scale with context size, not with request complexity.
  • Solution: Implement semantic routing to separate trivial from complex requests, structure prompts for maximum prefix cache hit rates, and apply context size budgets per retrieved document.
  • Proof: Anthropic’s documented prefix caching discount (up to 90% on cached input tokens) and Cloudflare AI Gateway’s documented routing behavior provide the infrastructure primitives — both are deployed configuration, not custom code.
  • Action: Audit your five highest-volume LLM API calls. For each: identify what percentage of the prompt is static vs. dynamic, whether the static content is placed first, and whether the request complexity justifies a frontier model. Those three answers determine which optimization to apply first.