Context Caching

Prompt Caching reduces repeated input token billing, suitable for scenarios like fixed system prompts in multi-turn conversations and long document analysis.

How It Works

In multi-turn conversations, each request must carry the complete message history, meaning fixed content (such as system prompts, tool definitions, long documents) is billed repeatedly.

The core idea of prompt caching: For repeated prefix portions in requests, only charge full price on first write; subsequent cache hits are billed at a significantly reduced rate.

Request 1: [system prompt + tool definitions + user message 1]
             ├── First write to cache ──┤
Request 2: [system prompt + tool definitions + user message 1 + reply 1 + user message 2]
             ├── Cache hit (low price) ──┤  ├── Normal billing ──┤

Supported Protocols

CloudBase AI supports prompt caching through the Anthropic Messages API protocol:

Protocol	Cache Support	Description
Anthropic Messages API	✅ Supported	Explicit caching + automatic caching
Chat Completions API	—	No client-side cache control
CloudBase SDK	—	No client-side cache control

note

When using the Chat Completions protocol or CloudBase SDK, the server may have internal caching optimizations, but the client cannot explicitly control it. For precise cache control, use the Anthropic Messages API.

Usage

Explicit Caching
Automatic Caching

Add cache_control markers to content blocks that should be cached:

curl "https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase/v1/messages" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "hy3",
    "max_tokens": 1024,
    "system": [
      {
        "type": "text",
        "text": "You are a legal document analysis expert. Here is a complete contract specification (50,000 words)...",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    "messages": [
      {"role": "user", "content": "Find the penalty clauses in this contract"}
    ]
  }'

The first request writes the system prompt to cache. Subsequent requests with the same prefix will hit the cache, significantly reducing token costs.

Declare cache_control at the top level of the request body, and the system automatically identifies repeated static prefixes for caching:

curl "https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase/v1/messages" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "hy3",
    "max_tokens": 1024,
    "cache_control": {"type": "ephemeral"},
    "system": "You are a professional weather query assistant.",
    "tools": [
      {
        "name": "get_weather",
        "description": "Get real-time weather for a specified city",
        "input_schema": {
          "type": "object",
          "properties": {
            "city": {"type": "string", "description": "City name"}
          },
          "required": ["city"]
        }
      }
    ],
    "messages": [
      {"role": "user", "content": "What is the weather in Beijing today?"}
    ]
  }'

Automatic caching is suitable for multi-turn conversation scenarios where the system automatically determines which prefix content can be cached.

Online Example

Open full example code in CodeSandbox →

Checking Cache Hits

Determine cache status through the usage field in the response:

{
  "usage": {
    "input_tokens": 350,
    "output_tokens": 120,
    "cache_creation_input_tokens": 4200,
    "cache_read_input_tokens": 0
  }
}

Field	Description	Billing
`input_tokens`	Input tokens not cached	Normal price
`cache_creation_input_tokens`	Tokens newly written to cache	~1.25x normal price
`cache_read_input_tokens`	Tokens read from cache	~0.1x normal price

Logic:

cache_creation_input_tokens > 0: First write to cache (slightly more expensive)
cache_read_input_tokens > 0: Cache hit (only 10% of normal price)
Both are 0: Cache not used

Use Cases

Scenario	Cache Benefit	Description
Long system prompt reuse	⭐⭐⭐	Thousands of characters of role settings, rules
Long document analysis	⭐⭐⭐	Put document in system, ask multiple questions
Fixed tool definitions	⭐⭐	Same tools list in every request
Multi-turn conversation	⭐⭐	Previous rounds' history gets cached
Single-round short chat	⭐	No repeated prefix, limited benefit

Cache Invalidation

The following situations cause cache misses:

Condition	Description
Content change	Cached prefix must exactly match; any character difference causes a miss
Tool definition change	Any field modified in the tools list
Cache expiration	Cache has a TTL (typically 5 minutes); expired caches need rewriting
Breakpoint limit exceeded	Explicit caching supports up to 4 `cache_control` breakpoints

tip

To maximize cache hit rate:

Put fixed content (system prompt, tool definitions) at the beginning of the message list
Avoid inserting dynamic information (like timestamps) in fixed content
Keep tool definitions stable

Cost Optimization Example

Suppose your app has a 4000-token system prompt with 10 conversation rounds:

Without caching:

Per round input cost = 4000 × normal price
10 rounds total = 4000 × 10 × normal price = 40000 × normal price

With caching:

Round 1 = 4000 × 1.25 (write to cache)
Rounds 2-10 = 4000 × 0.1 × 9 (read from cache)
Total = 5000 + 3600 = 8600 × normal price

Saves approximately 78% of system prompt costs.

How It Works​

Supported Protocols​

Usage​

Checking Cache Hits​

Use Cases​

Cache Invalidation​

Cost Optimization Example​