Multimodal Image Understanding with DeepSeek V4-Pro in CloudBase AI

In one sentence: A Next.js Route Handler receives user-uploaded images, converts them to base64 Data URLs, and calls @cloudbase/node-sdk's app.ai().createModel('cloudbase').generateText with model: 'deepseek-v4-pro' — messages[].content follows the OpenAI-compatible multimodal array structure (type: 'image_url' + type: 'text') — returning AI descriptions, OCR results, and content analysis in a single call.

Estimated time: 20 minutes | Difficulty: Advanced

Applicable Scenarios

User uploads a product photo; AI generates e-commerce detail page copy or auto-produces alt text
User uploads an invoice, business card, or form photo; AI extracts fields and fills them into a structured form
User uploads a screenshot or chart; AI extracts data points or interprets chart trends
Basic content moderation (checking for prohibited elements in an image) as a coarse pre-filter before a dedicated content safety service
Adding an "upload a photo and ask anything" entry point to a Mini Program or web app without introducing a third-party multimodal SDK

Not applicable:

Real-time video stream understanding (dozens of frames per second) — use a dedicated video API or self-hosted VLM; generateText is a single call and is unsuitable for the latency and cost requirements of video
Bulk batch processing (millions of images at once) — use add-cos-upload-from-cloudbase-app to ingest into COS first, then run Cloud Functions in batch with scheduled tasks; this recipe covers the synchronous path where users upload and receive results immediately
Multi-turn image conversations (user sends an image, then sends follow-up questions, and the AI must remember the previous image) — the frontend must accumulate the full messages array and include historical images in each request; this recipe covers only single calls; for multi-turn, extend using the multi-turn template from Step 5 of add-ai-nextjs
General text-only conversation (no image, just a question) — switch to deepseek-v4-flash for better cost-efficiency; see add-ai-nextjs

Prerequisites

Dependency	Version
Next.js	14+ (App Router, stable Route Handler)
`@cloudbase/node-sdk`	`3.16.0` or higher (required by the AI module; multimodal uses the same interface)
Node.js	`18.17+` (required by Next.js 14)
Route Handler runtime	Must be `nodejs` — `edge` is not supported (the SDK depends on Node APIs)
CloudBase environment	Provisioned, with AI+ enabled in the Console, and `deepseek-v4-pro` visible in Model Management
Single image size	Recommended ≤ 4 MB per image; approximately 5.3 MB after base64 encoding — compress large images on the frontend first

The server side must use @cloudbase/node-sdk. Do not use the @cloudbase/js-sdk + signInAnonymously() web-side pattern — anonymous login is aggressively rate-limited and cannot be used in production. Production must go through a Node SDK + environment-level credential backend proxy; see add-ai-nextjs.

Step 1: Confirm deepseek-v4-pro is Available in the Console

Open the CloudBase Console → select your environment → AI+ → Model Management
Confirm that deepseek-v4-pro is online and available in the model list
If you only see deepseek-v4-flash and not deepseek-v4-pro, the model is not yet enabled for this environment — click Quick Setup → check DeepSeek V4-Pro; it takes effect within seconds

Why use pro instead of flash:

deepseek-v4-pro — supports image input (multimodal); the pro series is the only model that accepts type: 'image_url' inside messages.content arrays
deepseek-v4-flash — accepts only plain-text messages; passing images will be silently ignored or cause an error

For the complete model matrix, refer to the current Model Access documentation; all examples in this recipe use deepseek-v4-pro.

Step 2: Install the SDK and Configure Environment Variables

npm install @cloudbase/node-sdk

.env.local:

CLOUDBASE_ENV=your-env-id
TENCENTCLOUD_SECRETID=your-secret-id
TENCENTCLOUD_SECRETKEY=your-secret-key

None of these have a NEXT_PUBLIC_ prefix — the SDK call runs on the server side, and the env ID and credentials must not be exposed in the client bundle.

SECRETID/SECRETKEY are generated at Tencent Cloud Console → API Keys. For production, use a sub-account key with a CAM policy scoped to the current CloudBase environment. If your Next.js app is deployed to CloudBase Cloud Run or Cloud Functions, these two variables are auto-injected and can be omitted.

Step 3: Write the Route Handler — FormData → base64 → generateText

Create app/api/vision/route.ts:

import tcb from '@cloudbase/node-sdk';

export const runtime = 'nodejs'; // Required: edge is not supported

let app: ReturnType<typeof tcb.init> | null = null;

function getAi() {
  if (!app) {
    // timeout 60s: multimodal inference is slower than plain text; the default 15s will time out
    app = tcb.init({ env: process.env.CLOUDBASE_ENV!, timeout: 60000 });
  }
  return app.ai();
}

// File -> data:image/xxx;base64,xxxxxx
async function fileToDataUrl(file: File): Promise<string> {
  const buf = Buffer.from(await file.arrayBuffer());
  const mime = file.type || 'image/jpeg';
  return `data:${mime};base64,${buf.toString('base64')}`;
}

export async function POST(req: Request) {
  const form = await req.formData();
  const prompt = (form.get('prompt') as string) || 'Describe the contents of this image';
  const files = form.getAll('images').filter((v): v is File => v instanceof File);

  if (files.length === 0) {
    return Response.json({ error: 'no image uploaded' }, { status: 400 });
  }

  // Convert all images to data URLs, preserving the order of upload
  const imageContents = await Promise.all(
    files.map(async (file) => ({
      type: 'image_url' as const,
      image_url: { url: await fileToDataUrl(file) },
    })),
  );

  const ai = getAi();
  const model = ai.createModel('cloudbase');

  const result = await model.generateText({
    model: 'deepseek-v4-pro',
    messages: [
      {
        role: 'user',
        content: [
          ...imageContents,
          { type: 'text', text: prompt },
        ],
      },
    ],
  });

  return Response.json({ text: result.text });
}

Key points:

app is cached as a module-level variable — every request enters the POST function, but tcb.init() is only called once
The server SDK carries an environment-level identity — signInAnonymously() is not needed
content is an array, and order matters — images before text is the officially recommended order; the model treats images as context and text as the question about those images; reversing the order can work but occasionally causes the model to treat the text as the primary task and ignore the images as attachments
For multiple images, add more { type: 'image_url', ... } objects to imageContents — maximum 4 images (beyond that, the pro series context window overflows)
Images are passed as base64 Data URL strings, not as COS / OSS public URLs — public URLs can be blocked at certain regional egress points; inline Data URLs are the most reliable approach; the trade-off is a larger request body, so for production it is recommended to compress images to under 1 MB on the frontend
Use generateText rather than streamText: image understanding typically outputs 50–300 characters in one shot, making streaming unnecessary; getting the complete result.text directly is simpler. For streaming, replace generateText with streamText and consume result.textStream, following the streaming pattern in Step 3 of add-ai-nextjs

Step 4: Frontend Client Component — Upload Images and Display Results

Create app/vision/page.tsx:

'use client';

import { useState } from 'react';

export default function Vision() {
  const [files, setFiles] = useState<File[]>([]);
  const [prompt, setPrompt] = useState('Describe this image');
  const [result, setResult] = useState('');
  const [loading, setLoading] = useState(false);

  async function send() {
    if (files.length === 0 || loading) return;

    setLoading(true);
    setResult('');

    try {
      const form = new FormData();
      form.append('prompt', prompt);
      // All files share the same key 'images'; the Route Handler uses form.getAll('images') to receive them as an array
      files.forEach((f) => form.append('images', f));

      const res = await fetch('/api/vision', {
        method: 'POST',
        body: form,
      });

      if (!res.ok) {
        const err = await res.json().catch(() => ({ error: `HTTP ${res.status}` }));
        throw new Error(err.error || `HTTP ${res.status}`);
      }

      const data = await res.json();
      setResult(data.text);
    } catch (err) {
      setResult(`[Error] ${err instanceof Error ? err.message : String(err)}`);
    } finally {
      setLoading(false);
    }
  }

  return (
    <div style={{ maxWidth: 720, margin: '40px auto', padding: 16 }}>
      <input
        type="file"
        accept="image/*"
        multiple
        onChange={(e) => setFiles(Array.from(e.target.files ?? []))}
        disabled={loading}
      />

      <div style={{ margin: '12px 0' }}>
        {files.map((f) => (
          <img
            key={f.name}
            src={URL.createObjectURL(f)}
            alt={f.name}
            style={{ width: 120, height: 120, objectFit: 'cover', marginRight: 8 }}
          />
        ))}
      </div>

      <textarea
        rows={2}
        style={{ width: '100%', padding: 8 }}
        value={prompt}
        onChange={(e) => setPrompt(e.target.value)}
        placeholder="Ask something about this image"
        disabled={loading}
      />

      <button
        onClick={send}
        disabled={loading || files.length === 0}
        style={{ marginTop: 8 }}
      >
        {loading ? 'Analyzing...' : 'Send'}
      </button>

      {result && (
        <div
          style={{
            marginTop: 16,
            padding: 12,
            background: '#f5f5f5',
            whiteSpace: 'pre-wrap',
          }}
        >
          {result}
        </div>
      )}
    </div>
  );
}

Details worth noting:

The multiple attribute allows selecting multiple files at once; the Route Handler receives them as a File array
Multiple files all use the same form key 'images' appended repeatedly; the backend uses form.getAll('images') to collect them into an array. Do not use indexed names like images[0] / images[1] — FormData does not have that semantics
URL.createObjectURL(f) produces a blob URL that should ideally be revoked with revokeObjectURL on unmount; this example omits that for brevity — uploading many images over an extended session will gradually consume memory
Whether to compress on the frontend before uploading: the browser can use <canvas> + toBlob({ type: 'image/jpeg', quality: 0.8 }) to compress large images to under 1 MB, significantly reducing the Data URL size and token consumption. Skipping compression still works, but a 4 MB raw image becomes a 5.3 MB base64 payload, making the API request body considerably large

Step 5: Prompt Patterns for Three Common Scenarios

The same Route Handler handles all three — only the prompt changes:

Scenario A — Single Image Description / Copywriting

You are an e-commerce product copywriter. Examine this product image and write a concise description
of no more than 50 words in English, highlighting the product category, color, and key selling points.
Output the copy directly without any introductory remarks.

Scenario B — Single Image OCR + Structured Extraction

This is an invoice. Extract the following fields and return them as JSON:
`{ "vendor": "", "amount": 0, "date": "YYYY-MM-DD", "items": [] }`
Return an empty string or 0 for any field not found. Return only JSON — no markdown code fences.

Remember to JSON.parse(result.text) after receiving the response in the Route Handler. The model occasionally wraps the output in ```json ... ``` fences — stripping the code block markers before parsing is more reliable in production.

Scenario C — Multi-Image Comparison

Upload 2–4 product images, with a prompt such as:

These are N product images of the same category. Provide a brief comparison across three dimensions:
color, style, and suitable use case. Conclude with a recommendation on which image would work best
as the main product photo.

generateText processes multiple images in messages order; the model typically refers to them as "Image 1", "Image 2", and so on. To enforce explicit labels, include identifiers in the prompt such as "the first image is Model A, the second image is Model B".

Step 6: Verification

Start the development server: npm run dev
Open http://localhost:3000/vision in a browser
Select a local image, leave the default prompt, and click Send
After a few seconds, the AI description should appear in the grey box below
In browser DevTools Network panel, the /api/vision request payload should be multipart/form-data and the response should be { "text": "..." }
The server terminal should not show cloudbase.init is not a function / model not found / image format not supported
CloudBase Console → AI+ → Call Records should show the call, with a token count noticeably larger than a text-only call (images consume many input tokens)

If Step 4 takes more than 30 seconds, the image is likely too large — check the request body size in Network; if it exceeds 6 MB, compress the image before uploading.

Common Errors

Error / Symptom	Cause	Fix
`model not found` / `model 'deepseek-v4-pro' is not supported`	`deepseek-v4-pro` is not enabled in the current environment, or the name is misspelled (e.g. `deepseek-vl-v4-pro` / `deepseek-v4`)	Check the exact model ID in Console → AI+ → Model Management and copy it verbatim; if not yet enabled, click Quick Setup and check it
After upload: `image format not supported` / `invalid image_url`	The `image_url` value is a `blob:` URL or a frontend File object `path`, not a Data URL	Must pass the `data:image/jpeg;base64,xxx` format — see `fileToDataUrl` in the Route Handler; blob URLs are browser-local and inaccessible on the server
Multiple images uploaded but the model only answers about the first one	The `content` array order is reversed — text appears before images, causing the model to treat text as the primary instruction and images as attachments; or the number of images exceeds the model limit	Reorder the content array so all `type: 'image_url'` entries come before `type: 'text'`; keep the image count to 4 or fewer
`cannot find module '@cloudbase/node-sdk'` or `XMLHttpRequest is not defined` (after deployment)	The Route Handler uses `export const runtime = 'edge'`; the Edge Runtime lacks the full Node.js API	Change to `export const runtime = 'nodejs'`; the SDK must run on the Node Runtime
`secretId or secretKey not found` / `getCredential failed` (after deployment)	Server-side credentials were not injected. Vercel and self-managed servers require explicit configuration of `TENCENTCLOUD_SECRETID` + `TENCENTCLOUD_SECRETKEY` (Cloud Run / Cloud Functions inject them automatically)	Add both variables in the deployment platform's environment settings; values come from Tencent Cloud Console → API Keys. Use a sub-account key scoped via CAM to the current CloudBase environment for best security
Request hangs for 60 seconds before timing out	The Node SDK default `timeout: 15000` is too short; combined with large images and slow multimodal inference, it is regularly exceeded; or the image is too large and uploading is slow	Use `tcb.init({ env, timeout: 60000 })` to explicitly raise the timeout to 60s or more (already done in the example above); compress large images to under 1 MB on the frontend
In OCR scenario, AI output is wrapped in ```json ... ``` fences, causing `JSON.parse` to fail	The model adds markdown formatting on its own	Strip the fences before parsing: `text.replace(/^```json\s*
Single image is clear but fine-detail recognition fails (e.g., OCR reads 8 as 0)	Multimodal image understanding has limited fine-detail accuracy and cannot match dedicated OCR or document AI	For critical documents (invoices, receipts, ID cards), use Tencent Cloud OCR; this recipe is suitable only for coarse filtering and descriptive tasks

For the complete error code reference, see https://docs.cloudbase.net/error-code/.

Billing Notes

Multimodal image input consumes a large number of input tokens — a 512×512 image consumes approximately 800–1,500 tokens (subject to the model's current image tiling rules), which is significantly more expensive than an equivalent text prompt. When estimating costs for batch scenarios, calculate "image unit price + output text unit price" separately
deepseek-v4-pro is priced higher than deepseek-v4-flash — use pro only for image scenarios and switch back to flash for text-only conversations. See Model Access
The Route Handler is the backend proxy layer and credentials never reach the browser, but /api/vision is still exposed to the public internet. Before going live, verify the caller in the POST handler: validate your own login session (see add-auth-web-with-cloudbase-sdk), apply per-UID / per-IP rate limiting, or use CloudBase Security Controls for domain whitelisting to prevent API abuse

add-ai-nextjs — the text-only conversation version (streamText + deepseek-v4-flash); this recipe is its multimodal extension
add-ai-wechat-miniprogram — the same CloudBase AI capability in a Mini Program (wx.cloud.extend.AI); Mini Program image upload uses wx.chooseMedia + wx.getFileSystemManager().readFile to convert to base64; the subsequent messages structure is identical to this recipe
add-cos-upload-from-cloudbase-app — the alternative approach for bulk image ingestion into COS followed by Cloud Function batch multimodal calls, suitable for asynchronous "store first, analyze later" pipelines
connect-tavily-search-cloud-function — combining image understanding with web search (the AI analyzes an image and then queries real-time web pages); reference its search call and splice the results into the messages array from this recipe
CloudBase AI Toolkit — integration paths for AI IDEs such as Cursor, Windsurf, and CodeBuddy
Model Access — complete model matrix including capabilities per model (multimodal / Function Calling / JSON mode)
SDK API Reference — complete signatures for createModel / generateText / streamText and the multimodal schema for messages.content

Applicable Scenarios​

Prerequisites​

Step 1: Confirm deepseek-v4-pro is Available in the Console​

Step 2: Install the SDK and Configure Environment Variables​

Step 3: Write the Route Handler — FormData → base64 → generateText​

Step 4: Frontend Client Component — Upload Images and Display Results​

Step 5: Prompt Patterns for Three Common Scenarios​

Scenario A — Single Image Description / Copywriting​

Scenario B — Single Image OCR + Structured Extraction​

Scenario C — Multi-Image Comparison​

Step 6: Verification​

Common Errors​

Billing Notes​

Related Documentation​