Multi-turn Conversation
Multi-turn conversation is one of the most common use cases for large language models. This document explains how to build multi-turn context, message format specifications, and history management strategies for production environments.
How It Works
LLM APIs are stateless — the server does not save any conversation history. Each request is independent, and the model does not "remember" previous conversations.
To implement multi-turn conversation, you need to pass the complete message history as input with each request:
Round 1: messages = [user message 1]
Round 2: messages = [user message 1, assistant reply 1, user message 2]
Round 3: messages = [user message 1, assistant reply 1, user message 2, assistant reply 2, user message 3]
Each round's messages array contains all previous conversation records, and the model generates responses based on the complete context.
This means that as conversation rounds increase, the number of input tokens grows continuously, directly affecting call costs and response speed. The history management strategies section below explains how to solve this problem.
Message Format
The message list messages is an array where each message contains two core fields: role and content:
[
{ "role": "system", "content": "You are a professional translation assistant" },
{ "role": "user", "content": "Translate 'hello' to Chinese" },
{ "role": "assistant", "content": "你好" },
{ "role": "user", "content": "Now translate it to Japanese" }
]
Role Description
| Role | Description | Count |
|---|---|---|
system | System prompt, defines model behavior and background | 0 or 1, placed first |
user | User input message | At least 1 |
assistant | Model's response | Generated by model, passed back in multi-turn |
tool | Tool call execution result | Only used in tool calling scenarios |
Content Format
The content field supports two formats:
Plain text (most common):
{ "role": "user", "content": "Tell me about Li Bai" }
Multimodal content (for image understanding etc.):
{
"role": "user",
"content": [
{ "type": "text", "text": "What's in this image?" },
{ "type": "image_url", "image_url": { "url": "https://example.com/photo.png" } }
]
}
Quick Start
The examples below demonstrate using the Chat Completions protocol. The core mechanism of multi-turn conversations (maintaining a complete messages array) applies to all protocols, including CloudBase SDK, OpenAI SDK, and Anthropic SDK compatible protocols. For details on each protocol, see the Access Methods documentation.
- CloudBase SDK
- OpenAI SDK
- cURL
- Mini Program
Implement multi-turn conversation using CloudBase SDK (Web / Node.js):
const model = ai.createModel("cloudbase");
// Maintain message history
const messages = [
{ role: "system", content: "You are a poetry expert" }
];
async function chat(userInput) {
// 1. Append user message
messages.push({ role: "user", content: userInput });
// 2. Call the model
const result = await model.generateText({
model: "deepseek-v4-flash",
messages
});
// 3. Append assistant reply to history
messages.push({ role: "assistant", content: result.text });
return result.text;
}
// Multi-turn conversation
await chat("What is Li Bai's most famous poem?");
// → "One of Li Bai's most famous poems is 'Quiet Night Thought'..."
await chat("What is the background of this poem?");
// → "Quiet Night Thought was written in 726 AD..."
// The model understands "this poem" refers to the one mentioned in the previous round
Implement multi-turn conversation using OpenAI SDK:
const OpenAI = require("openai");
const client = new OpenAI({
apiKey: "<YOUR_API_KEY>",
baseURL: "https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase"
});
const messages = [
{ role: "system", content: "You are a poetry expert" }
];
async function chat(userMessage) {
messages.push({ role: "user", content: userMessage });
const completion = await client.chat.completions.create({
model: "deepseek-v4-flash",
messages
});
const assistantMessage = completion.choices[0].message;
messages.push(assistantMessage);
return assistantMessage.content;
}
await chat("What is Li Bai's most famous poem?");
await chat("What is the background of this poem?");
Using HTTP API for multi-turn conversation, each request must carry the complete history:
curl -X POST 'https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase/chat/completions' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-v4-flash",
"messages": [
{"role": "system", "content": "You are a poetry expert"},
{"role": "user", "content": "What is Li Bai most famous poem?"},
{"role": "assistant", "content": "One of Li Bai most famous poems is Quiet Night Thought..."},
{"role": "user", "content": "What is the background of this poem?"}
]
}'
Implement multi-turn conversation in WeChat Mini Program:
Page({
data: {
chatHistory: [],
inputValue: ""
},
async sendMessage() {
const { inputValue, chatHistory } = this.data;
if (!inputValue.trim()) return;
const messages = [
{ role: "system", content: "You are a poetry expert" },
...chatHistory,
{ role: "user", content: inputValue }
];
const model = wx.cloud.extend.AI.createModel("cloudbase");
let assistantContent = "";
const res = await model.streamText({
data: { model: "deepseek-v4-flash", messages }
});
for await (const text of res.textStream) {
assistantContent += text;
this.setData({ currentReply: assistantContent });
}
this.setData({
chatHistory: [
...chatHistory,
{ role: "user", content: inputValue },
{ role: "assistant", content: assistantContent }
],
inputValue: ""
});
}
});
Streaming Multi-turn
In streaming scenarios, you need to wait for the stream to complete before appending the full assistant reply to history:
const model = ai.createModel("cloudbase");
const messages = [
{ role: "system", content: "You are a helpful assistant" }
];
async function chatStream(userInput) {
messages.push({ role: "user", content: userInput });
const res = await model.streamText({
model: "deepseek-v4-flash",
messages
});
let fullText = "";
for await (const text of res.textStream) {
fullText += text;
process.stdout.write(text);
}
// After stream ends, append complete reply to history
messages.push({ role: "assistant", content: fullText });
return fullText;
}
History Management Strategies
As conversation rounds increase, the messages array grows longer, causing two problems:
- Token consumption growth: Input tokens per round = all history tokens + current new message tokens
- Exceeding context window: When total history tokens exceed the model's maximum context length, the request will fail
Strategy 1: Context Truncation
Keep only the most recent N rounds and discard older history. Simple to implement, suitable for most scenarios.
const MAX_ROUNDS = 10; // Keep most recent 10 rounds
function trimMessages(messages) {
// Always keep the system message
const systemMsg = messages.find(m => m.role === "system");
const history = messages.filter(m => m.role !== "system");
// Each round = 1 user + 1 assistant = 2 messages
const trimmed = history.slice(-MAX_ROUNDS * 2);
return systemMsg ? [systemMsg, ...trimmed] : trimmed;
}
// Trim before each call
const trimmedMessages = trimMessages(messages);
const result = await model.generateText({
model: "deepseek-v4-flash",
messages: trimmedMessages
});
Strategy 2: Token Budget Control
Control history length by token count rather than rounds for more precision.
const MAX_INPUT_TOKENS = 8000; // Reserve 8000 tokens for input
function trimByTokens(messages, maxTokens) {
const systemMsg = messages.find(m => m.role === "system");
const history = messages.filter(m => m.role !== "system");
let totalTokens = estimateTokens(systemMsg?.content || "");
const result = systemMsg ? [systemMsg] : [];
// Start from the newest message, accumulate backwards
for (let i = history.length - 1; i >= 0; i--) {
const msgTokens = estimateTokens(history[i].content);
if (totalTokens + msgTokens > maxTokens) break;
totalTokens += msgTokens;
result.splice(systemMsg ? 1 : 0, 0, history[i]);
}
return result;
}
// Rough token estimation (Chinese ~1.5 tokens/char, English ~0.75 tokens/word)
function estimateTokens(text) {
if (!text) return 0;
return Math.ceil(text.length * 1.5);
}
Strategy 3: Rolling Summary
When history is too long, use the model to generate a summary of earlier conversations, compressing history while preserving key information.
async function summarizeHistory(messages) {
const model = ai.createModel("cloudbase");
const result = await model.generateText({
model: "deepseek-v4-flash",
messages: [
{
role: "user",
content: `Please briefly summarize the key information from this conversation:\n\n${
messages.map(m => `${m.role}: ${m.content}`).join("\n")
}`
}
]
});
return result.text;
}
// When history exceeds threshold, compress early conversations into a summary
async function manageHistory(messages, maxRounds = 10) {
const systemMsg = messages.find(m => m.role === "system");
const history = messages.filter(m => m.role !== "system");
if (history.length <= maxRounds * 2) return messages;
// Compress early conversations into a summary
const earlyHistory = history.slice(0, -maxRounds * 2);
const recentHistory = history.slice(-maxRounds * 2);
const summary = await summarizeHistory(earlyHistory);
return [
...(systemMsg ? [systemMsg] : []),
{ role: "system", content: `[Conversation history summary] ${summary}` },
...recentHistory
];
}
Strategy Comparison
| Strategy | Pros | Cons | Use Cases |
|---|---|---|---|
| Context Truncation | Simple, zero overhead | Loses early information | Simple Q&A, customer service |
| Token Budget Control | Precise cost control | Slightly complex | Cost-sensitive scenarios |
| Rolling Summary | Preserves core info | Extra API call needed | Long conversations, complex tasks |
Thinking Models in Multi-turn
When using deep thinking models (e.g., deepseek-r1), the model returns both reasoning_content (thinking process) and content (final answer).
Key rule: When updating messages, only keep content, ignore reasoning_content.
const OpenAI = require("openai");
const client = new OpenAI({
apiKey: "<YOUR_API_KEY>",
baseURL: "https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase"
});
const messages = [];
async function chatWithThinking(userMessage) {
messages.push({ role: "user", content: userMessage });
const completion = await client.chat.completions.create({
model: "deepseek-r1",
messages
});
const choice = completion.choices[0];
// ⚠️ Do NOT append reasoning_content to messages
console.log("Thinking:", choice.message.reasoning_content);
console.log("Answer:", choice.message.content);
// ✅ Only append content to history
messages.push({
role: "assistant",
content: choice.message.content
});
return choice.message.content;
}
Appending reasoning_content to messages will cause format errors or degraded response quality in subsequent requests.
Cost Optimization Tips
| Direction | Approach |
|---|---|
| Simplify system prompt | Use concise instructions instead of verbose descriptions |
| Control history length | Use truncation or summary strategies above |
| Choose appropriate model | Use deepseek-v4-flash for simple tasks, hy3-preview for complex reasoning |
| Use caching | For fixed long system prompts, use prompt caching to reduce costs |