Context Caching
Prompt Caching reduces repeated input token billing, suitable for scenarios like fixed system prompts in multi-turn conversations and long document analysis.
How It Works
In multi-turn conversations, each request must carry the complete message history, meaning fixed content (such as system prompts, tool definitions, long documents) is billed repeatedly.
The core idea of prompt caching: For repeated prefix portions in requests, only charge full price on first write; subsequent cache hits are billed at a significantly reduced rate.
Request 1: [system prompt + tool definitions + user message 1]
├── First write to cache ──┤
Request 2: [system prompt + tool definitions + user message 1 + reply 1 + user message 2]
├── Cache hit (low price) ──┤ ├── Normal billing ──┤
Supported Protocols
CloudBase AI supports prompt caching through the Anthropic Messages API protocol:
| Protocol | Cache Support | Description |
|---|---|---|
| Anthropic Messages API | ✅ Supported | Explicit caching + automatic caching |
| Chat Completions API | — | No client-side cache control |
| CloudBase SDK | — | No client-side cache control |
When using the Chat Completions protocol or CloudBase SDK, the server may have internal caching optimizations, but the client cannot explicitly control it. For precise cache control, use the Anthropic Messages API.
Usage
- Explicit Caching
- Automatic Caching
Add cache_control markers to content blocks that should be cached:
curl "https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase/v1/messages" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR_API_KEY>" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "hy3-preview",
"max_tokens": 1024,
"system": [
{
"type": "text",
"text": "You are a legal document analysis expert. Here is a complete contract specification (50,000 words)...",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "Find the penalty clauses in this contract"}
]
}'
The first request writes the system prompt to cache. Subsequent requests with the same prefix will hit the cache, significantly reducing token costs.
Declare cache_control at the top level of the request body, and the system automatically identifies repeated static prefixes for caching:
curl "https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase/v1/messages" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR_API_KEY>" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "hy3-preview",
"max_tokens": 1024,
"cache_control": {"type": "ephemeral"},
"system": "You are a professional weather query assistant.",
"tools": [
{
"name": "get_weather",
"description": "Get real-time weather for a specified city",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"]
}
}
],
"messages": [
{"role": "user", "content": "What is the weather in Beijing today?"}
]
}'
Automatic caching is suitable for multi-turn conversation scenarios where the system automatically determines which prefix content can be cached.
Checking Cache Hits
Determine cache status through the usage field in the response:
{
"usage": {
"input_tokens": 350,
"output_tokens": 120,
"cache_creation_input_tokens": 4200,
"cache_read_input_tokens": 0
}
}
| Field | Description | Billing |
|---|---|---|
input_tokens | Input tokens not cached | Normal price |
cache_creation_input_tokens | Tokens newly written to cache | ~1.25x normal price |
cache_read_input_tokens | Tokens read from cache | ~0.1x normal price |
Logic:
cache_creation_input_tokens > 0: First write to cache (slightly more expensive)cache_read_input_tokens > 0: Cache hit (only 10% of normal price)- Both are 0: Cache not used
Use Cases
| Scenario | Cache Benefit | Description |
|---|---|---|
| Long system prompt reuse | ⭐⭐⭐ | Thousands of characters of role settings, rules |
| Long document analysis | ⭐⭐⭐ | Put document in system, ask multiple questions |
| Fixed tool definitions | ⭐⭐ | Same tools list in every request |
| Multi-turn conversation | ⭐⭐ | Previous rounds' history gets cached |
| Single-round short chat | ⭐ | No repeated prefix, limited benefit |
Cache Invalidation
The following situations cause cache misses:
| Condition | Description |
|---|---|
| Content change | Cached prefix must exactly match; any character difference causes a miss |
| Tool definition change | Any field modified in the tools list |
| Cache expiration | Cache has a TTL (typically 5 minutes); expired caches need rewriting |
| Breakpoint limit exceeded | Explicit caching supports up to 4 cache_control breakpoints |
To maximize cache hit rate:
- Put fixed content (system prompt, tool definitions) at the beginning of the message list
- Avoid inserting dynamic information (like timestamps) in fixed content
- Keep tool definitions stable
Cost Optimization Example
Suppose your app has a 4000-token system prompt with 10 conversation rounds:
Without caching:
Per round input cost = 4000 × normal price
10 rounds total = 4000 × 10 × normal price = 40000 × normal price
With caching:
Round 1 = 4000 × 1.25 (write to cache)
Rounds 2-10 = 4000 × 0.1 × 9 (read from cache)
Total = 5000 + 3600 = 8600 × normal price
Saves approximately 78% of system prompt costs.