Skip to main content

Multimodal Understanding

Multimodal understanding lets a model accept inputs other than text—images, videos, files—within a single conversation and answer questions about them. Common scenarios include image captioning, screenshot Q&A, video summarization, document extraction, and OCR.

Difference from image generation
  • Multimodal understanding (this page): the input contains images / videos / files; the model returns text.
  • Image generation: the input is a text prompt; the model returns an image. See Image Generation.

Supported Models

Multimodal capability varies considerably across models:

ModelImage inputVideo inputFile inputNotes
glm-5v-turbo✅ URL / Base64✅ URL✅ URL (PDF / TXT / DOC)Image, video and file CANNOT be mixed in one request
qwen3.5-plus✅ URL / Base64✅ URLThinking is on by default; pair with enable_thinking
kimi-k2.6✅ URL / Base64✅ URLThe only model in Kimi family that supports video
kimi-k2.5✅ URL / Base64No video support
kimi-k2.7-code✅ Base64 onlyCoding-specific; URL form is not supported
note
  • Only commonly used multimodal models are listed. See Overview for the full model list.
  • Pure text models (e.g. deepseek-v4-flash, hy3-preview) silently ignore or reject image/video inputs.

Key Differences

  • Image input: every model except kimi-k2.7-code accepts both URL and Base64. kimi-k2.7-code only accepts Base64.
  • Video: only glm-5v-turbo, qwen3.5-plus, and kimi-k2.6 accept video, and only via a publicly reachable URL (Base64 not supported).
  • File: only glm-5v-turbo accepts files (PDF / TXT / DOC), URL only.
  • Mixing limit: glm-5v-turbo does not allow image, video, and file to be sent together in one request—split into separate calls.
  • Image size: when using Base64, keep each image under ~5 MB. URLs must be reachable from our backend (public direct link).

Message Format

Multimodal requests follow the OpenAI Chat Completions standard: replace messages[n].content with an array of content blocks—text, image, video, or file.

type ContentBlock =
| { type: "text"; text: string }
| { type: "image_url"; image_url: { url: string } }
| { type: "video_url"; video_url: { url: string } }
| { type: "file"; file: { file_url: string } };

Text block

{ type: "text", text: "Describe this image." }

Image block

URL:

{
type: "image_url",
image_url: { url: "https://example.com/photo.jpg" }
}

Base64:

const fs = require("fs");
const base64 = fs.readFileSync("./photo.jpg").toString("base64");

{
type: "image_url",
image_url: { url: `data:image/jpeg;base64,${base64}` }
}

Video block

{
type: "video_url",
video_url: { url: "https://example.com/clip.mp4" }
}

File block (glm-5v-turbo only)

{
type: "file",
file: { file_url: "https://example.com/contract.pdf" }
}

Usage

Image Understanding

const model = ai.createModel("cloudbase");

const res = await model.generateText({
model: "glm-5v-turbo",
messages: [
{
role: "user",
content: [
{ type: "text", text: "How many cats are there and what colors?" },
{
type: "image_url",
image_url: { url: "https://example.com/cats.jpg" }
}
]
}
]
});

console.log(res.text);

Sending a Local Image as Base64

Useful when the image is not on the public internet:

const fs = require("fs");
const path = require("path");

const filePath = path.resolve(__dirname, "./local.jpg");
const base64 = fs.readFileSync(filePath).toString("base64");
const dataUrl = `data:image/jpeg;base64,${base64}`;

const model = ai.createModel("cloudbase");

const res = await model.generateText({
model: "kimi-k2.6",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Extract the text in the image." },
{ type: "image_url", image_url: { url: dataUrl } }
]
}
]
});

console.log(res.text);
Getting Base64 in mini program / Web
  • Mini program: wx.getFileSystemManager().readFile({ filePath, encoding: "base64" })
  • Web: use FileReader.readAsDataURL to obtain the full data:image/...;base64,... string.

Multiple Images in One Turn

Append several image_url blocks to the same content array:

const res = await model.generateText({
model: "glm-5v-turbo",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Compare these two images and list 3 differences." },
{ type: "image_url", image_url: { url: "https://example.com/before.jpg" } },
{ type: "image_url", image_url: { url: "https://example.com/after.jpg" } }
]
}
]
});

Video Understanding

Video must be a public URL, and the model must be glm-5v-turbo / qwen3.5-plus / kimi-k2.6:

const model = ai.createModel("cloudbase");

const res = await model.generateText({
model: "kimi-k2.6",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Summarize this video in 100 words." },
{
type: "video_url",
video_url: { url: "https://example.com/clip.mp4" }
}
]
}
]
});

console.log(res.text);
note

Video processing can take a while; combine with Streaming to avoid request timeouts.

File Understanding (PDF / Text)

Only glm-5v-turbo supports it, and only via URL:

const res = await model.generateText({
model: "glm-5v-turbo",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Summarize the key terms and list the risks." },
{
type: "file",
file: { file_url: "https://example.com/contract.pdf" }
}
]
}
]
});
warning

glm-5v-turbo does not allow image, video, and file in one request—split into separate calls.

Combining with Streaming

Multimodal requests are often slow; streaming is recommended:

const res = await model.streamText({
model: "glm-5v-turbo",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Describe the composition and color of this photo." },
{ type: "image_url", image_url: { url: "https://example.com/photo.jpg" } }
]
}
]
});

for await (const text of res.textStream) {
process.stdout.write(text);
}

See Streaming for more.

Combining with Deep Thinking

Some models support deep thinking on top of multimodal input—useful for chart analysis, "look-at-image-and-solve-math" tasks, etc.

const res = await model.generateText({
model: "qwen3.5-plus",
reasoning_effort: "high",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Which expression matches the curve in the chart? Show the derivation." },
{ type: "image_url", image_url: { url: "https://example.com/chart.png" } }
]
}
]
});

const raw = res.rawResponses[0].choices[0].message;
console.log("Thinking:", raw.reasoning_content);
console.log("Answer:", res.text);

See Deep Thinking for reasoning_effort semantics and model differences.

Multimodal in Multi-turn Conversations

Multimodal blocks work in conversation history, but watch out for:

  • URL longevity: signed/temporary URLs may expire, breaking the history. Prefer Base64 or permanent object-storage links.
  • Token cost: an image is encoded into hundreds to thousands of tokens. Pair with Context Management to trim history.
  • Role restriction: only user messages may use the array content form. assistant replies remain plain text.
const messages = [
{
role: "user",
content: [
{ type: "text", text: "What flower is this?" },
{ type: "image_url", image_url: { url: "https://example.com/flower.jpg" } }
]
},
{ role: "assistant", content: "It's a sunflower." },
{ role: "user", content: "When does it usually bloom?" }
];

const res = await model.generateText({ model: "glm-5v-turbo", messages });

FAQ

Error: "model does not support image input"

The selected model is text-only. Pick a model from the supported list.

Model says "I cannot see the image" with a URL

  • Verify the URL is publicly reachable and not behind auth. Watch out for signed URLs that expire.
  • Some models cannot access domains in mainland China. Consider uploading to CloudBase Storage and using its download URL.
  • Some models (e.g. kimi-k2.7-code) only accept Base64; URLs are rejected.

Video recognition quality is poor

  • Shorter and clearer videos work better. Aim for under 30 seconds and around 720p.
  • For complex videos, extract key frames into images and send them as multi-image input instead.

How do I control image fidelity (detail parameter)?

Some models support image_url.detail (low / high / auto) to trade off cost versus fidelity. low is cheaper but less accurate; high is the opposite. Whether it takes effect depends on the model.