Multi-turn Conversation

Multi-turn conversation is one of the most common use cases for large language models. This document explains how to build multi-turn context, message format specifications, and history management strategies for production environments.

How It Works

LLM APIs are stateless — the server does not save any conversation history. Each request is independent, and the model does not "remember" previous conversations.

To implement multi-turn conversation, you need to pass the complete message history as input with each request:

Round 1: messages = [user message 1]
Round 2: messages = [user message 1, assistant reply 1, user message 2]
Round 3: messages = [user message 1, assistant reply 1, user message 2, assistant reply 2, user message 3]

Each round's messages array contains all previous conversation records, and the model generates responses based on the complete context.

note

This means that as conversation rounds increase, the number of input tokens grows continuously, directly affecting call costs and response speed. The history management strategies section below explains how to solve this problem.

Message Format

The message list messages is an array where each message contains two core fields: role and content:

[
  { "role": "system", "content": "You are a professional translation assistant" },
  { "role": "user", "content": "Translate 'hello' to Chinese" },
  { "role": "assistant", "content": "你好" },
  { "role": "user", "content": "Now translate it to Japanese" }
]

Role Description

Role	Description	Count
`system`	System prompt, defines model behavior and background	0 or 1, placed first
`user`	User input message	At least 1
`assistant`	Model's response	Generated by model, passed back in multi-turn
`tool`	Tool call execution result	Only used in tool calling scenarios

Content Format

The content field supports two formats:

Plain text (most common):

{ "role": "user", "content": "Tell me about Li Bai" }

Multimodal content (for image understanding etc.):

{
  "role": "user",
  "content": [
    { "type": "text", "text": "What's in this image?" },
    { "type": "image_url", "image_url": { "url": "https://example.com/photo.png" } }
  ]
}

Quick Start

Protocol Note

The examples below demonstrate using the Chat Completions protocol. The core mechanism of multi-turn conversations (maintaining a complete messages array) applies to all protocols, including CloudBase SDK, OpenAI SDK, and Anthropic SDK compatible protocols. For details on each protocol, see the Access Methods documentation.

CloudBase SDK
OpenAI SDK
cURL
Mini Program

Implement multi-turn conversation using CloudBase SDK (Web / Node.js):

const model = ai.createModel("cloudbase");

// Maintain message history
const messages = [
  { role: "system", content: "You are a poetry expert" }
];

async function chat(userInput) {
  // 1. Append user message
  messages.push({ role: "user", content: userInput });

  // 2. Call the model
  const result = await model.generateText({
    model: "hy3",
    messages
  });

  // 3. Append assistant reply to history
  messages.push({ role: "assistant", content: result.text });

  return result.text;
}

// Multi-turn conversation
await chat("What is Li Bai's most famous poem?");
// → "One of Li Bai's most famous poems is 'Quiet Night Thought'..."

await chat("What is the background of this poem?");
// → "Quiet Night Thought was written in 726 AD..."
// The model understands "this poem" refers to the one mentioned in the previous round

Online Example

Open full example code in CodeSandbox →

Implement multi-turn conversation using OpenAI SDK:

const OpenAI = require("openai");

const client = new OpenAI({
  apiKey: "<YOUR_API_KEY>",
  baseURL: "https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase"
});

const messages = [
  { role: "system", content: "You are a poetry expert" }
];

async function chat(userMessage) {
  messages.push({ role: "user", content: userMessage });

  const completion = await client.chat.completions.create({
    model: "hy3",
    messages
  });

  const assistantMessage = completion.choices[0].message;
  messages.push(assistantMessage);

  return assistantMessage.content;
}

await chat("What is Li Bai's most famous poem?");
await chat("What is the background of this poem?");

Online Example

Open full example code in CodeSandbox →

Using HTTP API for multi-turn conversation, each request must carry the complete history:

curl -X POST 'https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase/chat/completions' \
  -H 'Authorization: Bearer <YOUR_API_KEY>' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "hy3",
    "messages": [
      {"role": "system", "content": "You are a poetry expert"},
      {"role": "user", "content": "What is Li Bai most famous poem?"},
      {"role": "assistant", "content": "One of Li Bai most famous poems is Quiet Night Thought..."},
      {"role": "user", "content": "What is the background of this poem?"}
    ]
  }'

Implement multi-turn conversation in WeChat Mini Program:

Page({
  data: {
    chatHistory: [],
    inputValue: ""
  },

  async sendMessage() {
    const { inputValue, chatHistory } = this.data;
    if (!inputValue.trim()) return;

    const messages = [
      { role: "system", content: "You are a poetry expert" },
      ...chatHistory,
      { role: "user", content: inputValue }
    ];

    const model = wx.cloud.extend.AI.createModel("cloudbase");
    let assistantContent = "";

    const res = await model.streamText({
      data: { model: "hy3", messages }
    });

    for await (const text of res.textStream) {
      assistantContent += text;
      this.setData({ currentReply: assistantContent });
    }

    this.setData({
      chatHistory: [
        ...chatHistory,
        { role: "user", content: inputValue },
        { role: "assistant", content: assistantContent }
      ],
      inputValue: ""
    });
  }
});

Streaming Multi-turn

In streaming scenarios, you need to wait for the stream to complete before appending the full assistant reply to history:

const model = ai.createModel("cloudbase");

const messages = [
  { role: "system", content: "You are a helpful assistant" }
];

async function chatStream(userInput) {
  messages.push({ role: "user", content: userInput });

  const res = await model.streamText({
    model: "hy3",
    messages
  });

  let fullText = "";
  for await (const text of res.textStream) {
    fullText += text;
    process.stdout.write(text);
  }

  // After stream ends, append complete reply to history
  messages.push({ role: "assistant", content: fullText });

  return fullText;
}

History Management Strategies

As conversation rounds increase, the messages array grows longer, causing two problems:

Token consumption growth: Input tokens per round = all history tokens + current new message tokens
Exceeding context window: When total history tokens exceed the model's maximum context length, the request will fail

Strategy 1: Context Truncation

Keep only the most recent N rounds and discard older history. Simple to implement, suitable for most scenarios.

const MAX_ROUNDS = 10;  // Keep most recent 10 rounds

function trimMessages(messages) {
  // Always keep the system message
  const systemMsg = messages.find(m => m.role === "system");
  const history = messages.filter(m => m.role !== "system");

  // Each round = 1 user + 1 assistant = 2 messages
  const trimmed = history.slice(-MAX_ROUNDS * 2);

  return systemMsg ? [systemMsg, ...trimmed] : trimmed;
}

// Trim before each call
const trimmedMessages = trimMessages(messages);
const result = await model.generateText({
  model: "hy3",
  messages: trimmedMessages
});

Strategy 2: Token Budget Control

Control history length by token count rather than rounds for more precision.

const MAX_INPUT_TOKENS = 8000;  // Reserve 8000 tokens for input

function trimByTokens(messages, maxTokens) {
  const systemMsg = messages.find(m => m.role === "system");
  const history = messages.filter(m => m.role !== "system");

  let totalTokens = estimateTokens(systemMsg?.content || "");
  const result = systemMsg ? [systemMsg] : [];

  // Start from the newest message, accumulate backwards
  for (let i = history.length - 1; i >= 0; i--) {
    const msgTokens = estimateTokens(history[i].content);
    if (totalTokens + msgTokens > maxTokens) break;
    totalTokens += msgTokens;
    result.splice(systemMsg ? 1 : 0, 0, history[i]);
  }

  return result;
}

// Rough token estimation (Chinese ~1.5 tokens/char, English ~0.75 tokens/word)
function estimateTokens(text) {
  if (!text) return 0;
  return Math.ceil(text.length * 1.5);
}

Strategy 3: Rolling Summary

When history is too long, use the model to generate a summary of earlier conversations, compressing history while preserving key information.

async function summarizeHistory(messages) {
  const model = ai.createModel("cloudbase");

  const result = await model.generateText({
    model: "hy3",
    messages: [
      {
        role: "user",
        content: `Please briefly summarize the key information from this conversation:\n\n${
          messages.map(m => `${m.role}: ${m.content}`).join("\n")
        }`
      }
    ]
  });

  return result.text;
}

// When history exceeds threshold, compress early conversations into a summary
async function manageHistory(messages, maxRounds = 10) {
  const systemMsg = messages.find(m => m.role === "system");
  const history = messages.filter(m => m.role !== "system");

  if (history.length <= maxRounds * 2) return messages;

  // Compress early conversations into a summary
  const earlyHistory = history.slice(0, -maxRounds * 2);
  const recentHistory = history.slice(-maxRounds * 2);
  const summary = await summarizeHistory(earlyHistory);

  return [
    ...(systemMsg ? [systemMsg] : []),
    { role: "system", content: `[Conversation history summary] ${summary}` },
    ...recentHistory
  ];
}

Strategy Comparison

Strategy	Pros	Cons	Use Cases
Context Truncation	Simple, zero overhead	Loses early information	Simple Q&A, customer service
Token Budget Control	Precise cost control	Slightly complex	Cost-sensitive scenarios
Rolling Summary	Preserves core info	Extra API call needed	Long conversations, complex tasks

Thinking Models in Multi-turn

When using deep thinking models (e.g., deepseek-r1), the model returns both reasoning_content (thinking process) and content (final answer).

Key rule: When updating messages, only keep content, ignore reasoning_content.

const OpenAI = require("openai");

const client = new OpenAI({
  apiKey: "<YOUR_API_KEY>",
  baseURL: "https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase"
});

const messages = [];

async function chatWithThinking(userMessage) {
  messages.push({ role: "user", content: userMessage });

  const completion = await client.chat.completions.create({
    model: "deepseek-r1",
    messages
  });

  const choice = completion.choices[0];

  // ⚠️ Do NOT append reasoning_content to messages
  console.log("Thinking:", choice.message.reasoning_content);
  console.log("Answer:", choice.message.content);

  // ✅ Only append content to history
  messages.push({
    role: "assistant",
    content: choice.message.content
  });

  return choice.message.content;
}

warning

Appending reasoning_content to messages will cause format errors or degraded response quality in subsequent requests.

Cost Optimization Tips

Direction	Approach
Simplify system prompt	Use concise instructions instead of verbose descriptions
Control history length	Use truncation or summary strategies above
Choose appropriate model	Use `hy3` for simple tasks, `deepseek-r1` for complex reasoning
Use caching	For fixed long system prompts, use prompt caching to reduce costs

How It Works​

Message Format​

Role Description​

Content Format​

Quick Start​

Streaming Multi-turn​

History Management Strategies​

Strategy 1: Context Truncation​

Strategy 2: Token Budget Control​

Strategy 3: Rolling Summary​

Strategy Comparison​

Thinking Models in Multi-turn​

Cost Optimization Tips​