Skip to main content

Multi-turn Conversation

Multi-turn conversation is one of the most common use cases for large language models. This document explains how to build multi-turn context, message format specifications, and history management strategies for production environments.

How It Works

LLM APIs are stateless — the server does not save any conversation history. Each request is independent, and the model does not "remember" previous conversations.

To implement multi-turn conversation, you need to pass the complete message history as input with each request:

Round 1: messages = [user message 1]
Round 2: messages = [user message 1, assistant reply 1, user message 2]
Round 3: messages = [user message 1, assistant reply 1, user message 2, assistant reply 2, user message 3]

Each round's messages array contains all previous conversation records, and the model generates responses based on the complete context.

note

This means that as conversation rounds increase, the number of input tokens grows continuously, directly affecting call costs and response speed. The history management strategies section below explains how to solve this problem.

Message Format

The message list messages is an array where each message contains two core fields: role and content:

[
{ "role": "system", "content": "You are a professional translation assistant" },
{ "role": "user", "content": "Translate 'hello' to Chinese" },
{ "role": "assistant", "content": "你好" },
{ "role": "user", "content": "Now translate it to Japanese" }
]

Role Description

RoleDescriptionCount
systemSystem prompt, defines model behavior and background0 or 1, placed first
userUser input messageAt least 1
assistantModel's responseGenerated by model, passed back in multi-turn
toolTool call execution resultOnly used in tool calling scenarios

Content Format

The content field supports two formats:

Plain text (most common):

{ "role": "user", "content": "Tell me about Li Bai" }

Multimodal content (for image understanding etc.):

{
"role": "user",
"content": [
{ "type": "text", "text": "What's in this image?" },
{ "type": "image_url", "image_url": { "url": "https://example.com/photo.png" } }
]
}

Quick Start

Protocol Note

The examples below demonstrate using the Chat Completions protocol. The core mechanism of multi-turn conversations (maintaining a complete messages array) applies to all protocols, including CloudBase SDK, OpenAI SDK, and Anthropic SDK compatible protocols. For details on each protocol, see the Access Methods documentation.

Implement multi-turn conversation using CloudBase SDK (Web / Node.js):

const model = ai.createModel("cloudbase");

// Maintain message history
const messages = [
{ role: "system", content: "You are a poetry expert" }
];

async function chat(userInput) {
// 1. Append user message
messages.push({ role: "user", content: userInput });

// 2. Call the model
const result = await model.generateText({
model: "deepseek-v4-flash",
messages
});

// 3. Append assistant reply to history
messages.push({ role: "assistant", content: result.text });

return result.text;
}

// Multi-turn conversation
await chat("What is Li Bai's most famous poem?");
// → "One of Li Bai's most famous poems is 'Quiet Night Thought'..."

await chat("What is the background of this poem?");
// → "Quiet Night Thought was written in 726 AD..."
// The model understands "this poem" refers to the one mentioned in the previous round

Streaming Multi-turn

In streaming scenarios, you need to wait for the stream to complete before appending the full assistant reply to history:

const model = ai.createModel("cloudbase");

const messages = [
{ role: "system", content: "You are a helpful assistant" }
];

async function chatStream(userInput) {
messages.push({ role: "user", content: userInput });

const res = await model.streamText({
model: "deepseek-v4-flash",
messages
});

let fullText = "";
for await (const text of res.textStream) {
fullText += text;
process.stdout.write(text);
}

// After stream ends, append complete reply to history
messages.push({ role: "assistant", content: fullText });

return fullText;
}

History Management Strategies

As conversation rounds increase, the messages array grows longer, causing two problems:

  1. Token consumption growth: Input tokens per round = all history tokens + current new message tokens
  2. Exceeding context window: When total history tokens exceed the model's maximum context length, the request will fail

Strategy 1: Context Truncation

Keep only the most recent N rounds and discard older history. Simple to implement, suitable for most scenarios.

const MAX_ROUNDS = 10; // Keep most recent 10 rounds

function trimMessages(messages) {
// Always keep the system message
const systemMsg = messages.find(m => m.role === "system");
const history = messages.filter(m => m.role !== "system");

// Each round = 1 user + 1 assistant = 2 messages
const trimmed = history.slice(-MAX_ROUNDS * 2);

return systemMsg ? [systemMsg, ...trimmed] : trimmed;
}

// Trim before each call
const trimmedMessages = trimMessages(messages);
const result = await model.generateText({
model: "deepseek-v4-flash",
messages: trimmedMessages
});

Strategy 2: Token Budget Control

Control history length by token count rather than rounds for more precision.

const MAX_INPUT_TOKENS = 8000; // Reserve 8000 tokens for input

function trimByTokens(messages, maxTokens) {
const systemMsg = messages.find(m => m.role === "system");
const history = messages.filter(m => m.role !== "system");

let totalTokens = estimateTokens(systemMsg?.content || "");
const result = systemMsg ? [systemMsg] : [];

// Start from the newest message, accumulate backwards
for (let i = history.length - 1; i >= 0; i--) {
const msgTokens = estimateTokens(history[i].content);
if (totalTokens + msgTokens > maxTokens) break;
totalTokens += msgTokens;
result.splice(systemMsg ? 1 : 0, 0, history[i]);
}

return result;
}

// Rough token estimation (Chinese ~1.5 tokens/char, English ~0.75 tokens/word)
function estimateTokens(text) {
if (!text) return 0;
return Math.ceil(text.length * 1.5);
}

Strategy 3: Rolling Summary

When history is too long, use the model to generate a summary of earlier conversations, compressing history while preserving key information.

async function summarizeHistory(messages) {
const model = ai.createModel("cloudbase");

const result = await model.generateText({
model: "deepseek-v4-flash",
messages: [
{
role: "user",
content: `Please briefly summarize the key information from this conversation:\n\n${
messages.map(m => `${m.role}: ${m.content}`).join("\n")
}`
}
]
});

return result.text;
}

// When history exceeds threshold, compress early conversations into a summary
async function manageHistory(messages, maxRounds = 10) {
const systemMsg = messages.find(m => m.role === "system");
const history = messages.filter(m => m.role !== "system");

if (history.length <= maxRounds * 2) return messages;

// Compress early conversations into a summary
const earlyHistory = history.slice(0, -maxRounds * 2);
const recentHistory = history.slice(-maxRounds * 2);
const summary = await summarizeHistory(earlyHistory);

return [
...(systemMsg ? [systemMsg] : []),
{ role: "system", content: `[Conversation history summary] ${summary}` },
...recentHistory
];
}

Strategy Comparison

StrategyProsConsUse Cases
Context TruncationSimple, zero overheadLoses early informationSimple Q&A, customer service
Token Budget ControlPrecise cost controlSlightly complexCost-sensitive scenarios
Rolling SummaryPreserves core infoExtra API call neededLong conversations, complex tasks

Thinking Models in Multi-turn

When using deep thinking models (e.g., deepseek-r1), the model returns both reasoning_content (thinking process) and content (final answer).

Key rule: When updating messages, only keep content, ignore reasoning_content.

const OpenAI = require("openai");

const client = new OpenAI({
apiKey: "<YOUR_API_KEY>",
baseURL: "https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase"
});

const messages = [];

async function chatWithThinking(userMessage) {
messages.push({ role: "user", content: userMessage });

const completion = await client.chat.completions.create({
model: "deepseek-r1",
messages
});

const choice = completion.choices[0];

// ⚠️ Do NOT append reasoning_content to messages
console.log("Thinking:", choice.message.reasoning_content);
console.log("Answer:", choice.message.content);

// ✅ Only append content to history
messages.push({
role: "assistant",
content: choice.message.content
});

return choice.message.content;
}
warning

Appending reasoning_content to messages will cause format errors or degraded response quality in subsequent requests.

Cost Optimization Tips

DirectionApproach
Simplify system promptUse concise instructions instead of verbose descriptions
Control history lengthUse truncation or summary strategies above
Choose appropriate modelUse deepseek-v4-flash for simple tasks, hy3-preview for complex reasoning
Use cachingFor fixed long system prompts, use prompt caching to reduce costs