Skip to main content
AI

Prompt Caching: The Math

Ravinder··7 min read
AILLMPrompt CachingCost OptimizationAnthropic
Share:
Prompt Caching: The Math

The Optimization Everyone Skips

Teams spend weeks on RAG pipelines and vector search tuning, then send the same 50K-token system prompt with every single request and pay full price each time. Prompt caching is the highest-leverage LLM cost optimization available, and the math takes five minutes to do. Most teams do not do it.

Anthropic charges $0.30 per million tokens for cache reads versus $3.00 for regular input tokens — a 10× discount. OpenAI's automatic caching gives a 50% discount on repeated prefixes. At any meaningful scale, this is your biggest cost lever.


How Prompt Caching Works

The provider stores a KV representation of your prompt prefix server-side. Subsequent requests that share the same prefix pay a dramatically reduced rate for the cached portion. The prefix must be marked explicitly (Anthropic) or is detected automatically via longest-prefix matching (OpenAI).

sequenceDiagram participant C as Client participant P as Provider Cache participant M as Model C->>P: Request with 50K token prefix + 100 token user message P->>P: Cache MISS — prefix not stored P->>M: Full 50.1K tokens M-->>C: Response (billed: 50.1K input tokens) P->>P: Store prefix in cache (TTL = 5 min) C->>P: Request with same 50K prefix + 200 token user message P->>P: Cache HIT P->>M: 50K cached + 200 new tokens M-->>C: Response (billed: 50K @ cache rate + 200 @ full rate)

The key constraint: the cache prefix must be identical and stable. Any mutation — a timestamp, a user ID, a dynamic value — in the cached portion breaks the cache hit.


The Break-Even Calculation

Anthropic charges a 25% write premium on the first request that populates the cache (cache miss). The break-even is:

cache_miss_cost = tokens × (input_price × 1.25)
cache_hit_cost  = tokens × cache_read_price
 
savings_per_hit = (input_price - cache_read_price) × tokens
break_even_hits = ceil(cache_miss_cost / savings_per_hit)
                = ceil(1.25 / (1 - cache_read_price / input_price))

With Claude 3.5 Sonnet ($3.00 input, $0.30 cache read):

savings_per_hit = ($3.00 - $0.30) × tokens/M = $2.70 × tokens/M
miss_premium    = $3.00 × 0.25 × tokens/M    = $0.75 × tokens/M
break_even_hits = ceil(0.75 / 2.70) = ceil(0.278) = 1

You break even after a single cache hit. Every request after the first that hits cache is pure savings.

For OpenAI GPT-4o ($2.50 input, $1.25 cache):

savings_per_hit = $1.25 × tokens/M
break_even_hits = ceil(0 / 1.25) = 0  (no write premium)

OpenAI charges no write premium — every cache hit is immediate savings.


TTL Strategy

Anthropic's cache TTL is 5 minutes (ephemeral). This means your requests must arrive at a rate of at least one per 5 minutes to maintain the cache. If your system serves bursty traffic with quiet periods, you will pay for repeated cache writes.

gantt title Cache Write vs Hit Pattern (5-min TTL) dateFormat HH:mm axisFormat %H:%M section Steady Traffic (Cache Efficient) Cache Write (miss) :crit, a1, 09:00, 1m Cache Hits :active, a2, 09:01, 4m Cache Write (expiry):crit, a3, 09:05, 1m Cache Hits :active, a4, 09:06, 4m section Bursty Traffic (Cache Wasted) Cache Write :crit, b1, 10:00, 1m Idle :b2, 10:01, 9m Cache Write (expired):crit, b3, 10:10, 1m Idle :b4, 10:11, 9m

Strategies for Bursty Traffic

Option 1: Synthetic keep-alive requests Send a minimal no-op request every 4 minutes to keep the cache warm. This costs nearly nothing (100 tokens user message + cached prefix read) but maintains the cache entry.

import asyncio
 
async def keep_cache_warm(client, system_prompt: str):
    """Send a minimal request every 4 minutes to prevent cache expiry."""
    while True:
        await asyncio.sleep(240)  # 4 minutes
        try:
            client.messages.create(
                model="claude-opus-4-5",
                max_tokens=1,
                system=[{"type": "text", "text": system_prompt,
                         "cache_control": {"type": "ephemeral"}}],
                messages=[{"role": "user", "content": "ping"}],
            )
        except Exception as e:
            logger.warning("Cache keep-alive failed", exc_info=e)

Option 2: Schedule the cache write If you know traffic arrives in batches (e.g., nightly jobs, business hours), write the cache just before the batch and let it expire naturally during idle periods.

Option 3: Persistent cache (Beta) Anthropic's persistent cache (where available) extends TTL significantly. Prefer it for stable system prompts shared across your application.


Cache Key Design

The cache key is implicitly the full content of the prefix up to the cache_control breakpoint. This means cache key design is really about prompt structure.

Rule: Put Stable Content First

# WRONG: dynamic content before cache breakpoint breaks every request
messages = [
    {"role": "system", "content": [
        {"type": "text", "text": f"Today is {datetime.now().isoformat()}."},  # dynamic!
        {"type": "text", "text": LARGE_STATIC_CONTEXT,
         "cache_control": {"type": "ephemeral"}},  # cache breakpoint
    ]}
]
 
# RIGHT: stable content first, dynamic content after cache breakpoint
messages = [
    {"role": "system", "content": [
        {"type": "text", "text": LARGE_STATIC_CONTEXT,
         "cache_control": {"type": "ephemeral"}},  # cache breakpoint
        {"type": "text", "text": f"Today is {datetime.now().isoformat()}."},  # dynamic, not cached
    ]}
]

Multiple Cache Breakpoints

Anthropic supports up to 4 cache breakpoints per request. Use them to cache at multiple granularities:

messages = [
    {
        "role": "system",
        "content": [
            # Tier 1: Global knowledge base (rarely changes)
            {"type": "text", "text": GLOBAL_KNOWLEDGE_BASE,
             "cache_control": {"type": "ephemeral"}},
 
            # Tier 2: User-specific context (changes per user)
            {"type": "text", "text": user_context,
             "cache_control": {"type": "ephemeral"}},
 
            # Tier 3: Session instructions (not cached — too variable)
            {"type": "text", "text": session_instructions},
        ]
    }
]

The first breakpoint caches the global knowledge base for all users. The second caches per-user context across that user's session. The third is sent fresh each time.


Measuring Cache Performance

Never assume your cache is working. Measure it.

def log_cache_metrics(usage: dict, request_id: str):
    input_tokens = usage.get("input_tokens", 0)
    cache_creation = usage.get("cache_creation_input_tokens", 0)
    cache_read = usage.get("cache_read_input_tokens", 0)
 
    total_input = input_tokens + cache_creation + cache_read
    hit_rate = cache_read / total_input if total_input > 0 else 0
 
    # Cost breakdown
    cost_no_cache = total_input * 3.00 / 1_000_000
    cost_actual = (
        input_tokens * 3.00 / 1_000_000 +
        cache_creation * 3.75 / 1_000_000 +  # 25% write premium
        cache_read * 0.30 / 1_000_000
    )
    savings = cost_no_cache - cost_actual
 
    metrics.record({
        "request_id": request_id,
        "cache_hit_rate": hit_rate,
        "cost_usd": cost_actual,
        "savings_usd": savings,
        "cache_read_tokens": cache_read,
        "cache_creation_tokens": cache_creation,
    })

Track this per endpoint, not just globally. A chat endpoint may have 80% cache hit rate while a document analysis endpoint has 10% — they need different caching strategies.

Alerting Thresholds

Set alerts for:

  • Cache hit rate drops below 60% (something mutated the prefix)
  • Cache creation tokens spike (burst of cold starts, possible prompt change)
  • Cost per request rises 3× (cache broke)

Common Cache-Breaking Mistakes

# Mistake 1: Conversation history in the system prompt
# Each turn adds tokens, shifting the cache boundary
system = f"History: {json.dumps(conversation_history)}"  # BREAKS CACHE
 
# Fix: Keep history in messages array, not system prompt
messages = [{"role": "user", "content": "..."}, ...]  # history here
 
# Mistake 2: Unique IDs or timestamps in cached portion
system = f"Request ID: {uuid.uuid4()}\n{STATIC_CONTENT}"  # BREAKS CACHE
 
# Fix: Move dynamic values after the cache breakpoint
 
# Mistake 3: Whitespace differences from template rendering
# A trailing newline difference = cache miss
# Fix: Strip and normalize before constructing messages
static_content = textwrap.dedent(TEMPLATE).strip()

Real Savings at Scale

A team with a 40K-token system prompt, 20K requests/day, at Claude 3.5 Sonnet pricing:

  • Without caching: 40K × 20K requests × $3.00/M = $2,400/day
  • With caching (85% hit rate): (40K × 3K cold starts × $3.75/M) + (40K × 17K hits × $0.30/M) = $450 + $204 = $654/day
  • Savings: $1,746/day = ~$52,000/month

That is not a rounding error. That is a hiring decision.


Key Takeaways

  • Prompt caching break-even is 1 cache hit for Anthropic (due to the 25% write premium) — essentially always worth enabling on stable system prompts.
  • The 5-minute TTL requires traffic at least every 4 minutes to stay warm; use synthetic keep-alive requests for bursty workloads.
  • Cache key = the exact content of your prompt prefix — any dynamic value before the cache breakpoint breaks every hit.
  • Use multiple cache breakpoints (up to 4) to cache at different granularities: global knowledge, per-user context, per-session state.
  • Measure cache hit rate per endpoint and alert on drops — a broken cache is invisible without instrumentation.
  • At 20K requests/day with a 40K-token system prompt, caching saves ~$52K/month on Claude 3.5 Sonnet.