Multimodal Understanding
Multimodal understanding lets a model accept inputs other than text—images, videos, files—within a single conversation and answer questions about them. Common scenarios include image captioning, screenshot Q&A, video summarization, document extraction, and OCR.
- Multimodal understanding (this page): the input contains images / videos / files; the model returns text.
- Image generation: the input is a text prompt; the model returns an image. See Image Generation.
Supported Models
Multimodal capability varies considerably across models:
| Model | Image input | Video input | File input | Notes |
|---|---|---|---|---|
glm-5v-turbo | ✅ URL / Base64 | ✅ URL | ✅ URL (PDF / TXT / DOC) | Image, video and file CANNOT be mixed in one request |
qwen3.5-plus | ✅ URL / Base64 | ✅ URL | — | Thinking is on by default; pair with enable_thinking |
kimi-k2.6 | ✅ URL / Base64 | ✅ URL | — | The only model in Kimi family that supports video |
kimi-k2.5 | ✅ URL / Base64 | — | — | No video support |
kimi-k2.7-code | ✅ Base64 only | — | — | Coding-specific; URL form is not supported |
- Only commonly used multimodal models are listed. See Overview for the full model list.
- Pure text models (e.g.
deepseek-v4-flash,hy3-preview) silently ignore or reject image/video inputs.
Key Differences
- Image input: every model except
kimi-k2.7-codeaccepts both URL and Base64.kimi-k2.7-codeonly accepts Base64. - Video: only
glm-5v-turbo,qwen3.5-plus, andkimi-k2.6accept video, and only via a publicly reachable URL (Base64 not supported). - File: only
glm-5v-turboaccepts files (PDF / TXT / DOC), URL only. - Mixing limit:
glm-5v-turbodoes not allow image, video, and file to be sent together in one request—split into separate calls. - Image size: when using Base64, keep each image under ~5 MB. URLs must be reachable from our backend (public direct link).
Message Format
Multimodal requests follow the OpenAI Chat Completions standard: replace messages[n].content with an array of content blocks—text, image, video, or file.
type ContentBlock =
| { type: "text"; text: string }
| { type: "image_url"; image_url: { url: string } }
| { type: "video_url"; video_url: { url: string } }
| { type: "file"; file: { file_url: string } };
Text block
{ type: "text", text: "Describe this image." }
Image block
URL:
{
type: "image_url",
image_url: { url: "https://example.com/photo.jpg" }
}
Base64:
const fs = require("fs");
const base64 = fs.readFileSync("./photo.jpg").toString("base64");
{
type: "image_url",
image_url: { url: `data:image/jpeg;base64,${base64}` }
}
Video block
{
type: "video_url",
video_url: { url: "https://example.com/clip.mp4" }
}
File block (glm-5v-turbo only)
{
type: "file",
file: { file_url: "https://example.com/contract.pdf" }
}
Usage
Image Understanding
- CloudBase SDK
- OpenAI SDK
- Mini Program
- cURL
const model = ai.createModel("cloudbase");
const res = await model.generateText({
model: "glm-5v-turbo",
messages: [
{
role: "user",
content: [
{ type: "text", text: "How many cats are there and what colors?" },
{
type: "image_url",
image_url: { url: "https://example.com/cats.jpg" }
}
]
}
]
});
console.log(res.text);
const OpenAI = require("openai");
const client = new OpenAI({
apiKey: "<YOUR_API_KEY>",
baseURL: "https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase"
});
const completion = await client.chat.completions.create({
model: "glm-5v-turbo",
messages: [
{
role: "user",
content: [
{ type: "text", text: "How many cats are there and what colors?" },
{
type: "image_url",
image_url: { url: "https://example.com/cats.jpg" }
}
]
}
]
});
console.log(completion.choices[0].message.content);
const model = wx.cloud.extend.AI.createModel("cloudbase");
const res = await model.generateText({
data: {
model: "glm-5v-turbo",
messages: [
{
role: "user",
content: [
{ type: "text", text: "How many cats are there and what colors?" },
{
type: "image_url",
image_url: { url: "https://example.com/cats.jpg" }
}
]
}
]
}
});
console.log(res.text);
curl -X POST 'https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase/chat/completions' \
-H 'Authorization: Bearer <YOUR_API_KEY>' \
-H 'Content-Type: application/json' \
-d '{
"model": "glm-5v-turbo",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "How many cats are there and what colors?"},
{"type": "image_url", "image_url": {"url": "https://example.com/cats.jpg"}}
]
}
]
}'
Sending a Local Image as Base64
Useful when the image is not on the public internet:
const fs = require("fs");
const path = require("path");
const filePath = path.resolve(__dirname, "./local.jpg");
const base64 = fs.readFileSync(filePath).toString("base64");
const dataUrl = `data:image/jpeg;base64,${base64}`;
const model = ai.createModel("cloudbase");
const res = await model.generateText({
model: "kimi-k2.6",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Extract the text in the image." },
{ type: "image_url", image_url: { url: dataUrl } }
]
}
]
});
console.log(res.text);
- Mini program:
wx.getFileSystemManager().readFile({ filePath, encoding: "base64" }) - Web: use
FileReader.readAsDataURLto obtain the fulldata:image/...;base64,...string.
Multiple Images in One Turn
Append several image_url blocks to the same content array:
const res = await model.generateText({
model: "glm-5v-turbo",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Compare these two images and list 3 differences." },
{ type: "image_url", image_url: { url: "https://example.com/before.jpg" } },
{ type: "image_url", image_url: { url: "https://example.com/after.jpg" } }
]
}
]
});
Video Understanding
Video must be a public URL, and the model must be glm-5v-turbo / qwen3.5-plus / kimi-k2.6:
const model = ai.createModel("cloudbase");
const res = await model.generateText({
model: "kimi-k2.6",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Summarize this video in 100 words." },
{
type: "video_url",
video_url: { url: "https://example.com/clip.mp4" }
}
]
}
]
});
console.log(res.text);
Video processing can take a while; combine with Streaming to avoid request timeouts.
File Understanding (PDF / Text)
Only glm-5v-turbo supports it, and only via URL:
const res = await model.generateText({
model: "glm-5v-turbo",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Summarize the key terms and list the risks." },
{
type: "file",
file: { file_url: "https://example.com/contract.pdf" }
}
]
}
]
});
glm-5v-turbo does not allow image, video, and file in one request—split into separate calls.
Combining with Streaming
Multimodal requests are often slow; streaming is recommended:
const res = await model.streamText({
model: "glm-5v-turbo",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Describe the composition and color of this photo." },
{ type: "image_url", image_url: { url: "https://example.com/photo.jpg" } }
]
}
]
});
for await (const text of res.textStream) {
process.stdout.write(text);
}
See Streaming for more.
Combining with Deep Thinking
Some models support deep thinking on top of multimodal input—useful for chart analysis, "look-at-image-and-solve-math" tasks, etc.
const res = await model.generateText({
model: "qwen3.5-plus",
reasoning_effort: "high",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Which expression matches the curve in the chart? Show the derivation." },
{ type: "image_url", image_url: { url: "https://example.com/chart.png" } }
]
}
]
});
const raw = res.rawResponses[0].choices[0].message;
console.log("Thinking:", raw.reasoning_content);
console.log("Answer:", res.text);
See Deep Thinking for reasoning_effort semantics and model differences.
Multimodal in Multi-turn Conversations
Multimodal blocks work in conversation history, but watch out for:
- URL longevity: signed/temporary URLs may expire, breaking the history. Prefer Base64 or permanent object-storage links.
- Token cost: an image is encoded into hundreds to thousands of tokens. Pair with Context Management to trim history.
- Role restriction: only
usermessages may use the arraycontentform.assistantreplies remain plain text.
const messages = [
{
role: "user",
content: [
{ type: "text", text: "What flower is this?" },
{ type: "image_url", image_url: { url: "https://example.com/flower.jpg" } }
]
},
{ role: "assistant", content: "It's a sunflower." },
{ role: "user", content: "When does it usually bloom?" }
];
const res = await model.generateText({ model: "glm-5v-turbo", messages });
FAQ
Error: "model does not support image input"
The selected model is text-only. Pick a model from the supported list.
Model says "I cannot see the image" with a URL
- Verify the URL is publicly reachable and not behind auth. Watch out for signed URLs that expire.
- Some models cannot access domains in mainland China. Consider uploading to CloudBase Storage and using its download URL.
- Some models (e.g.
kimi-k2.7-code) only accept Base64; URLs are rejected.
Video recognition quality is poor
- Shorter and clearer videos work better. Aim for under 30 seconds and around 720p.
- For complex videos, extract key frames into images and send them as multi-image input instead.
How do I control image fidelity (detail parameter)?
Some models support image_url.detail (low / high / auto) to trade off cost versus fidelity. low is cheaper but less accurate; high is the opposite. Whether it takes effect depends on the model.