Multimodal Understanding

Multimodal understanding lets a model accept inputs other than text—images, videos, files—within a single conversation and answer questions about them. Common scenarios include image captioning, screenshot Q&A, video summarization, document extraction, and OCR.

Difference from image generation

Multimodal understanding (this page): the input contains images / videos / files; the model returns text.
Image generation: the input is a text prompt; the model returns an image. See Image Generation.

Supported Models

Multimodal capability varies considerably across models:

Model	Image input	Video input	File input	Notes
`glm-5v-turbo`	✅ URL / Base64	✅ URL	✅ URL (PDF / TXT / DOC)	Image, video and file CANNOT be mixed in one request
`qwen3.5-plus`	✅ URL / Base64	✅ URL	—	Thinking is on by default; pair with `enable_thinking`
`kimi-k2.6`	✅ URL / Base64	✅ URL	—	The only model in Kimi family that supports video
`kimi-k2.5`	✅ URL / Base64	—	—	No video support
`kimi-k2.7-code`	✅ Base64 only	—	—	Coding-specific; URL form is not supported

note

Only commonly used multimodal models are listed. See Overview for the full model list.
Pure text models (e.g. deepseek-v4-flash, hy3) silently ignore or reject image/video inputs.

Key Differences

Image input: every model except kimi-k2.7-code accepts both URL and Base64. kimi-k2.7-code only accepts Base64.
Video: only glm-5v-turbo, qwen3.5-plus, and kimi-k2.6 accept video, and only via a publicly reachable URL (Base64 not supported).
File: only glm-5v-turbo accepts files (PDF / TXT / DOC), URL only.
Mixing limit: glm-5v-turbo does not allow image, video, and file to be sent together in one request—split into separate calls.
Image size: when using Base64, keep each image under ~5 MB. URLs must be reachable from our backend (public direct link).

Message Format

Multimodal requests follow the OpenAI Chat Completions standard: replace messages[n].content with an array of content blocks—text, image, video, or file.

type ContentBlock =
  | { type: "text"; text: string }
  | { type: "image_url"; image_url: { url: string } }
  | { type: "video_url"; video_url: { url: string } }
  | { type: "file"; file: { file_url: string } };

Text block

{ type: "text", text: "Describe this image." }

Image block

URL:

{
  type: "image_url",
  image_url: { url: "https://example.com/photo.jpg" }
}

Base64:

const fs = require("fs");
const base64 = fs.readFileSync("./photo.jpg").toString("base64");

{
  type: "image_url",
  image_url: { url: `data:image/jpeg;base64,${base64}` }
}

Video block

{
  type: "video_url",
  video_url: { url: "https://example.com/clip.mp4" }
}

File block (`glm-5v-turbo` only)

{
  type: "file",
  file: { file_url: "https://example.com/contract.pdf" }
}

Usage

Image Understanding

CloudBase SDK
OpenAI SDK
Mini Program
cURL

const model = ai.createModel("cloudbase");

const res = await model.generateText({
  model: "glm-5v-turbo",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "How many cats are there and what colors?" },
        {
          type: "image_url",
          image_url: { url: "https://example.com/cats.jpg" }
        }
      ]
    }
  ]
});

console.log(res.text);

const OpenAI = require("openai");

const client = new OpenAI({
  apiKey: "<YOUR_API_KEY>",
  baseURL: "https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase"
});

const completion = await client.chat.completions.create({
  model: "glm-5v-turbo",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "How many cats are there and what colors?" },
        {
          type: "image_url",
          image_url: { url: "https://example.com/cats.jpg" }
        }
      ]
    }
  ]
});

console.log(completion.choices[0].message.content);

const model = wx.cloud.extend.AI.createModel("cloudbase");

const res = await model.generateText({
  data: {
    model: "glm-5v-turbo",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "How many cats are there and what colors?" },
          {
            type: "image_url",
            image_url: { url: "https://example.com/cats.jpg" }
          }
        ]
      }
    ]
  }
});

console.log(res.text);

curl -X POST 'https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase/chat/completions' \
  -H 'Authorization: Bearer <YOUR_API_KEY>' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "glm-5v-turbo",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "How many cats are there and what colors?"},
          {"type": "image_url", "image_url": {"url": "https://example.com/cats.jpg"}}
        ]
      }
    ]
  }'

Sending a Local Image as Base64

Useful when the image is not on the public internet:

const fs = require("fs");
const path = require("path");

const filePath = path.resolve(__dirname, "./local.jpg");
const base64 = fs.readFileSync(filePath).toString("base64");
const dataUrl = `data:image/jpeg;base64,${base64}`;

const model = ai.createModel("cloudbase");

const res = await model.generateText({
  model: "kimi-k2.6",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "Extract the text in the image." },
        { type: "image_url", image_url: { url: dataUrl } }
      ]
    }
  ]
});

console.log(res.text);

Getting Base64 in mini program / Web

Mini program: wx.getFileSystemManager().readFile({ filePath, encoding: "base64" })
Web: use FileReader.readAsDataURL to obtain the full data:image/...;base64,... string.

Multiple Images in One Turn

Append several image_url blocks to the same content array:

const res = await model.generateText({
  model: "glm-5v-turbo",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "Compare these two images and list 3 differences." },
        { type: "image_url", image_url: { url: "https://example.com/before.jpg" } },
        { type: "image_url", image_url: { url: "https://example.com/after.jpg" } }
      ]
    }
  ]
});

Video Understanding

Video must be a public URL, and the model must be glm-5v-turbo / qwen3.5-plus / kimi-k2.6:

const model = ai.createModel("cloudbase");

const res = await model.generateText({
  model: "kimi-k2.6",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "Summarize this video in 100 words." },
        {
          type: "video_url",
          video_url: { url: "https://example.com/clip.mp4" }
        }
      ]
    }
  ]
});

console.log(res.text);

note

Video processing can take a while; combine with Streaming to avoid request timeouts.

File Understanding (PDF / Text)

Only glm-5v-turbo supports it, and only via URL:

const res = await model.generateText({
  model: "glm-5v-turbo",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "Summarize the key terms and list the risks." },
        {
          type: "file",
          file: { file_url: "https://example.com/contract.pdf" }
        }
      ]
    }
  ]
});

warning

glm-5v-turbo does not allow image, video, and file in one request—split into separate calls.

Combining with Streaming

Multimodal requests are often slow; streaming is recommended:

const res = await model.streamText({
  model: "glm-5v-turbo",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "Describe the composition and color of this photo." },
        { type: "image_url", image_url: { url: "https://example.com/photo.jpg" } }
      ]
    }
  ]
});

for await (const text of res.textStream) {
  process.stdout.write(text);
}

See Streaming for more.

Combining with Deep Thinking

Some models support deep thinking on top of multimodal input—useful for chart analysis, "look-at-image-and-solve-math" tasks, etc.

const res = await model.generateText({
  model: "qwen3.5-plus",
  reasoning_effort: "high",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "Which expression matches the curve in the chart? Show the derivation." },
        { type: "image_url", image_url: { url: "https://example.com/chart.png" } }
      ]
    }
  ]
});

const raw = res.rawResponses[0].choices[0].message;
console.log("Thinking:", raw.reasoning_content);
console.log("Answer:", res.text);

See Deep Thinking for reasoning_effort semantics and model differences.

Multimodal in Multi-turn Conversations

Multimodal blocks work in conversation history, but watch out for:

URL longevity: signed/temporary URLs may expire, breaking the history. Prefer Base64 or permanent object-storage links.
Token cost: an image is encoded into hundreds to thousands of tokens. Pair with Context Management to trim history.
Role restriction: only user messages may use the array content form. assistant replies remain plain text.

const messages = [
  {
    role: "user",
    content: [
      { type: "text", text: "What flower is this?" },
      { type: "image_url", image_url: { url: "https://example.com/flower.jpg" } }
    ]
  },
  { role: "assistant", content: "It's a sunflower." },
  { role: "user", content: "When does it usually bloom?" }
];

const res = await model.generateText({ model: "glm-5v-turbo", messages });

FAQ

Error: "model does not support image input"

The selected model is text-only. Pick a model from the supported list.

Model says "I cannot see the image" with a URL

Verify the URL is publicly reachable and not behind auth. Watch out for signed URLs that expire.
Some models cannot access domains in mainland China. Consider uploading to CloudBase Storage and using its download URL.
Some models (e.g. kimi-k2.7-code) only accept Base64; URLs are rejected.

Video recognition quality is poor

Shorter and clearer videos work better. Aim for under 30 seconds and around 720p.
For complex videos, extract key frames into images and send them as multi-image input instead.

How do I control image fidelity (`detail` parameter)?

Some models support image_url.detail (low / high / auto) to trade off cost versus fidelity. low is cheaper but less accurate; high is the opposite. Whether it takes effect depends on the model.

Supported Models​

Key Differences​

Message Format​

Text block​

Image block​

Video block​

File block (glm-5v-turbo only)​

Usage​

Image Understanding​

Sending a Local Image as Base64​

Multiple Images in One Turn​

Video Understanding​

File Understanding (PDF / Text)​

Combining with Streaming​

Combining with Deep Thinking​

Multimodal in Multi-turn Conversations​

FAQ​

Error: "model does not support image input"​

Model says "I cannot see the image" with a URL​

Video recognition quality is poor​

How do I control image fidelity (detail parameter)?​