Skip to main content

Context Caching

Prompt Caching reduces repeated input token billing, suitable for scenarios like fixed system prompts in multi-turn conversations and long document analysis.

How It Works

In multi-turn conversations, each request must carry the complete message history, meaning fixed content (such as system prompts, tool definitions, long documents) is billed repeatedly.

The core idea of prompt caching: For repeated prefix portions in requests, only charge full price on first write; subsequent cache hits are billed at a significantly reduced rate.

Request 1: [system prompt + tool definitions + user message 1]
├── First write to cache ──┤
Request 2: [system prompt + tool definitions + user message 1 + reply 1 + user message 2]
├── Cache hit (low price) ──┤ ├── Normal billing ──┤

Supported Protocols

CloudBase AI supports prompt caching through the Anthropic Messages API protocol:

ProtocolCache SupportDescription
Anthropic Messages API✅ SupportedExplicit caching + automatic caching
Chat Completions APINo client-side cache control
CloudBase SDKNo client-side cache control
note

When using the Chat Completions protocol or CloudBase SDK, the server may have internal caching optimizations, but the client cannot explicitly control it. For precise cache control, use the Anthropic Messages API.

Usage

Add cache_control markers to content blocks that should be cached:

curl "https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase/v1/messages" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR_API_KEY>" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "hy3-preview",
"max_tokens": 1024,
"system": [
{
"type": "text",
"text": "You are a legal document analysis expert. Here is a complete contract specification (50,000 words)...",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "Find the penalty clauses in this contract"}
]
}'

The first request writes the system prompt to cache. Subsequent requests with the same prefix will hit the cache, significantly reducing token costs.

Checking Cache Hits

Determine cache status through the usage field in the response:

{
"usage": {
"input_tokens": 350,
"output_tokens": 120,
"cache_creation_input_tokens": 4200,
"cache_read_input_tokens": 0
}
}
FieldDescriptionBilling
input_tokensInput tokens not cachedNormal price
cache_creation_input_tokensTokens newly written to cache~1.25x normal price
cache_read_input_tokensTokens read from cache~0.1x normal price

Logic:

  • cache_creation_input_tokens > 0: First write to cache (slightly more expensive)
  • cache_read_input_tokens > 0: Cache hit (only 10% of normal price)
  • Both are 0: Cache not used

Use Cases

ScenarioCache BenefitDescription
Long system prompt reuse⭐⭐⭐Thousands of characters of role settings, rules
Long document analysis⭐⭐⭐Put document in system, ask multiple questions
Fixed tool definitions⭐⭐Same tools list in every request
Multi-turn conversation⭐⭐Previous rounds' history gets cached
Single-round short chatNo repeated prefix, limited benefit

Cache Invalidation

The following situations cause cache misses:

ConditionDescription
Content changeCached prefix must exactly match; any character difference causes a miss
Tool definition changeAny field modified in the tools list
Cache expirationCache has a TTL (typically 5 minutes); expired caches need rewriting
Breakpoint limit exceededExplicit caching supports up to 4 cache_control breakpoints
tip

To maximize cache hit rate:

  1. Put fixed content (system prompt, tool definitions) at the beginning of the message list
  2. Avoid inserting dynamic information (like timestamps) in fixed content
  3. Keep tool definitions stable

Cost Optimization Example

Suppose your app has a 4000-token system prompt with 10 conversation rounds:

Without caching:

Per round input cost = 4000 × normal price
10 rounds total = 4000 × 10 × normal price = 40000 × normal price

With caching:

Round 1 = 4000 × 1.25 (write to cache)
Rounds 2-10 = 4000 × 0.1 × 9 (read from cache)
Total = 5000 + 3600 = 8600 × normal price

Saves approximately 78% of system prompt costs.