Skip to main content

Multimodal Image Understanding with DeepSeek V4-Pro in CloudBase AI

In one sentence: A Next.js Route Handler receives user-uploaded images, converts them to base64 Data URLs, and calls @cloudbase/node-sdk's app.ai().createModel('cloudbase').generateText with model: 'deepseek-v4-pro'messages[].content follows the OpenAI-compatible multimodal array structure (type: 'image_url' + type: 'text') — returning AI descriptions, OCR results, and content analysis in a single call.

Estimated time: 20 minutes | Difficulty: Advanced

Applicable Scenarios

  • User uploads a product photo; AI generates e-commerce detail page copy or auto-produces alt text
  • User uploads an invoice, business card, or form photo; AI extracts fields and fills them into a structured form
  • User uploads a screenshot or chart; AI extracts data points or interprets chart trends
  • Basic content moderation (checking for prohibited elements in an image) as a coarse pre-filter before a dedicated content safety service
  • Adding an "upload a photo and ask anything" entry point to a Mini Program or web app without introducing a third-party multimodal SDK

Not applicable:

  • Real-time video stream understanding (dozens of frames per second) — use a dedicated video API or self-hosted VLM; generateText is a single call and is unsuitable for the latency and cost requirements of video
  • Bulk batch processing (millions of images at once) — use add-cos-upload-from-cloudbase-app to ingest into COS first, then run Cloud Functions in batch with scheduled tasks; this recipe covers the synchronous path where users upload and receive results immediately
  • Multi-turn image conversations (user sends an image, then sends follow-up questions, and the AI must remember the previous image) — the frontend must accumulate the full messages array and include historical images in each request; this recipe covers only single calls; for multi-turn, extend using the multi-turn template from Step 5 of add-ai-nextjs
  • General text-only conversation (no image, just a question) — switch to deepseek-v4-flash for better cost-efficiency; see add-ai-nextjs

Prerequisites

DependencyVersion
Next.js14+ (App Router, stable Route Handler)
@cloudbase/node-sdk3.16.0 or higher (required by the AI module; multimodal uses the same interface)
Node.js18.17+ (required by Next.js 14)
Route Handler runtimeMust be nodejsedge is not supported (the SDK depends on Node APIs)
CloudBase environmentProvisioned, with AI+ enabled in the Console, and deepseek-v4-pro visible in Model Management
Single image sizeRecommended ≤ 4 MB per image; approximately 5.3 MB after base64 encoding — compress large images on the frontend first

The server side must use @cloudbase/node-sdk. Do not use the @cloudbase/js-sdk + signInAnonymously() web-side pattern — anonymous login is aggressively rate-limited and cannot be used in production. Production must go through a Node SDK + environment-level credential backend proxy; see add-ai-nextjs.

Step 1: Confirm deepseek-v4-pro is Available in the Console

  1. Open the CloudBase Console → select your environment → AI+Model Management
  2. Confirm that deepseek-v4-pro is online and available in the model list
  3. If you only see deepseek-v4-flash and not deepseek-v4-pro, the model is not yet enabled for this environment — click Quick Setup → check DeepSeek V4-Pro; it takes effect within seconds

Why use pro instead of flash:

  • deepseek-v4-pro — supports image input (multimodal); the pro series is the only model that accepts type: 'image_url' inside messages.content arrays
  • deepseek-v4-flash — accepts only plain-text messages; passing images will be silently ignored or cause an error

For the complete model matrix, refer to the current Model Access documentation; all examples in this recipe use deepseek-v4-pro.

Step 2: Install the SDK and Configure Environment Variables

npm install @cloudbase/node-sdk

.env.local:

CLOUDBASE_ENV=your-env-id
TENCENTCLOUD_SECRETID=your-secret-id
TENCENTCLOUD_SECRETKEY=your-secret-key

None of these have a NEXT_PUBLIC_ prefix — the SDK call runs on the server side, and the env ID and credentials must not be exposed in the client bundle.

SECRETID/SECRETKEY are generated at Tencent Cloud Console → API Keys. For production, use a sub-account key with a CAM policy scoped to the current CloudBase environment. If your Next.js app is deployed to CloudBase Cloud Run or Cloud Functions, these two variables are auto-injected and can be omitted.

Step 3: Write the Route Handler — FormData → base64 → generateText

Create app/api/vision/route.ts:

import tcb from '@cloudbase/node-sdk';

export const runtime = 'nodejs'; // Required: edge is not supported

let app: ReturnType<typeof tcb.init> | null = null;

function getAi() {
if (!app) {
// timeout 60s: multimodal inference is slower than plain text; the default 15s will time out
app = tcb.init({ env: process.env.CLOUDBASE_ENV!, timeout: 60000 });
}
return app.ai();
}

// File -> data:image/xxx;base64,xxxxxx
async function fileToDataUrl(file: File): Promise<string> {
const buf = Buffer.from(await file.arrayBuffer());
const mime = file.type || 'image/jpeg';
return `data:${mime};base64,${buf.toString('base64')}`;
}

export async function POST(req: Request) {
const form = await req.formData();
const prompt = (form.get('prompt') as string) || 'Describe the contents of this image';
const files = form.getAll('images').filter((v): v is File => v instanceof File);

if (files.length === 0) {
return Response.json({ error: 'no image uploaded' }, { status: 400 });
}

// Convert all images to data URLs, preserving the order of upload
const imageContents = await Promise.all(
files.map(async (file) => ({
type: 'image_url' as const,
image_url: { url: await fileToDataUrl(file) },
})),
);

const ai = getAi();
const model = ai.createModel('cloudbase');

const result = await model.generateText({
model: 'deepseek-v4-pro',
messages: [
{
role: 'user',
content: [
...imageContents,
{ type: 'text', text: prompt },
],
},
],
});

return Response.json({ text: result.text });
}

Key points:

  • app is cached as a module-level variable — every request enters the POST function, but tcb.init() is only called once
  • The server SDK carries an environment-level identity — signInAnonymously() is not needed
  • content is an array, and order matters — images before text is the officially recommended order; the model treats images as context and text as the question about those images; reversing the order can work but occasionally causes the model to treat the text as the primary task and ignore the images as attachments
  • For multiple images, add more { type: 'image_url', ... } objects to imageContentsmaximum 4 images (beyond that, the pro series context window overflows)
  • Images are passed as base64 Data URL strings, not as COS / OSS public URLs — public URLs can be blocked at certain regional egress points; inline Data URLs are the most reliable approach; the trade-off is a larger request body, so for production it is recommended to compress images to under 1 MB on the frontend
  • Use generateText rather than streamText: image understanding typically outputs 50–300 characters in one shot, making streaming unnecessary; getting the complete result.text directly is simpler. For streaming, replace generateText with streamText and consume result.textStream, following the streaming pattern in Step 3 of add-ai-nextjs

Step 4: Frontend Client Component — Upload Images and Display Results

Create app/vision/page.tsx:

'use client';

import { useState } from 'react';

export default function Vision() {
const [files, setFiles] = useState<File[]>([]);
const [prompt, setPrompt] = useState('Describe this image');
const [result, setResult] = useState('');
const [loading, setLoading] = useState(false);

async function send() {
if (files.length === 0 || loading) return;

setLoading(true);
setResult('');

try {
const form = new FormData();
form.append('prompt', prompt);
// All files share the same key 'images'; the Route Handler uses form.getAll('images') to receive them as an array
files.forEach((f) => form.append('images', f));

const res = await fetch('/api/vision', {
method: 'POST',
body: form,
});

if (!res.ok) {
const err = await res.json().catch(() => ({ error: `HTTP ${res.status}` }));
throw new Error(err.error || `HTTP ${res.status}`);
}

const data = await res.json();
setResult(data.text);
} catch (err) {
setResult(`[Error] ${err instanceof Error ? err.message : String(err)}`);
} finally {
setLoading(false);
}
}

return (
<div style={{ maxWidth: 720, margin: '40px auto', padding: 16 }}>
<input
type="file"
accept="image/*"
multiple
onChange={(e) => setFiles(Array.from(e.target.files ?? []))}
disabled={loading}
/>

<div style={{ margin: '12px 0' }}>
{files.map((f) => (
<img
key={f.name}
src={URL.createObjectURL(f)}
alt={f.name}
style={{ width: 120, height: 120, objectFit: 'cover', marginRight: 8 }}
/>
))}
</div>

<textarea
rows={2}
style={{ width: '100%', padding: 8 }}
value={prompt}
onChange={(e) => setPrompt(e.target.value)}
placeholder="Ask something about this image"
disabled={loading}
/>

<button
onClick={send}
disabled={loading || files.length === 0}
style={{ marginTop: 8 }}
>
{loading ? 'Analyzing...' : 'Send'}
</button>

{result && (
<div
style={{
marginTop: 16,
padding: 12,
background: '#f5f5f5',
whiteSpace: 'pre-wrap',
}}
>
{result}
</div>
)}
</div>
);
}

Details worth noting:

  • The multiple attribute allows selecting multiple files at once; the Route Handler receives them as a File array
  • Multiple files all use the same form key 'images' appended repeatedly; the backend uses form.getAll('images') to collect them into an array. Do not use indexed names like images[0] / images[1] — FormData does not have that semantics
  • URL.createObjectURL(f) produces a blob URL that should ideally be revoked with revokeObjectURL on unmount; this example omits that for brevity — uploading many images over an extended session will gradually consume memory
  • Whether to compress on the frontend before uploading: the browser can use <canvas> + toBlob({ type: 'image/jpeg', quality: 0.8 }) to compress large images to under 1 MB, significantly reducing the Data URL size and token consumption. Skipping compression still works, but a 4 MB raw image becomes a 5.3 MB base64 payload, making the API request body considerably large

Step 5: Prompt Patterns for Three Common Scenarios

The same Route Handler handles all three — only the prompt changes:

Scenario A — Single Image Description / Copywriting

You are an e-commerce product copywriter. Examine this product image and write a concise description
of no more than 50 words in English, highlighting the product category, color, and key selling points.
Output the copy directly without any introductory remarks.

Scenario B — Single Image OCR + Structured Extraction

This is an invoice. Extract the following fields and return them as JSON:
`{ "vendor": "", "amount": 0, "date": "YYYY-MM-DD", "items": [] }`
Return an empty string or 0 for any field not found. Return only JSON — no markdown code fences.

Remember to JSON.parse(result.text) after receiving the response in the Route Handler. The model occasionally wraps the output in ```json ... ``` fences — stripping the code block markers before parsing is more reliable in production.

Scenario C — Multi-Image Comparison

Upload 2–4 product images, with a prompt such as:

These are N product images of the same category. Provide a brief comparison across three dimensions:
color, style, and suitable use case. Conclude with a recommendation on which image would work best
as the main product photo.

generateText processes multiple images in messages order; the model typically refers to them as "Image 1", "Image 2", and so on. To enforce explicit labels, include identifiers in the prompt such as "the first image is Model A, the second image is Model B".

Step 6: Verification

  1. Start the development server: npm run dev
  2. Open http://localhost:3000/vision in a browser
  3. Select a local image, leave the default prompt, and click Send
  4. After a few seconds, the AI description should appear in the grey box below
  5. In browser DevTools Network panel, the /api/vision request payload should be multipart/form-data and the response should be { "text": "..." }
  6. The server terminal should not show cloudbase.init is not a function / model not found / image format not supported
  7. CloudBase Console → AI+ → Call Records should show the call, with a token count noticeably larger than a text-only call (images consume many input tokens)

If Step 4 takes more than 30 seconds, the image is likely too large — check the request body size in Network; if it exceeds 6 MB, compress the image before uploading.

Common Errors

Error / SymptomCauseFix
model not found / model 'deepseek-v4-pro' is not supporteddeepseek-v4-pro is not enabled in the current environment, or the name is misspelled (e.g. deepseek-vl-v4-pro / deepseek-v4)Check the exact model ID in Console → AI+ → Model Management and copy it verbatim; if not yet enabled, click Quick Setup and check it
After upload: image format not supported / invalid image_urlThe image_url value is a blob: URL or a frontend File object path, not a Data URLMust pass the data:image/jpeg;base64,xxx format — see fileToDataUrl in the Route Handler; blob URLs are browser-local and inaccessible on the server
Multiple images uploaded but the model only answers about the first oneThe content array order is reversed — text appears before images, causing the model to treat text as the primary instruction and images as attachments; or the number of images exceeds the model limitReorder the content array so all type: 'image_url' entries come before type: 'text'; keep the image count to 4 or fewer
cannot find module '@cloudbase/node-sdk' or XMLHttpRequest is not defined (after deployment)The Route Handler uses export const runtime = 'edge'; the Edge Runtime lacks the full Node.js APIChange to export const runtime = 'nodejs'; the SDK must run on the Node Runtime
secretId or secretKey not found / getCredential failed (after deployment)Server-side credentials were not injected. Vercel and self-managed servers require explicit configuration of TENCENTCLOUD_SECRETID + TENCENTCLOUD_SECRETKEY (Cloud Run / Cloud Functions inject them automatically)Add both variables in the deployment platform's environment settings; values come from Tencent Cloud Console → API Keys. Use a sub-account key scoped via CAM to the current CloudBase environment for best security
Request hangs for 60 seconds before timing outThe Node SDK default timeout: 15000 is too short; combined with large images and slow multimodal inference, it is regularly exceeded; or the image is too large and uploading is slowUse tcb.init({ env, timeout: 60000 }) to explicitly raise the timeout to 60s or more (already done in the example above); compress large images to under 1 MB on the frontend
In OCR scenario, AI output is wrapped in ```json ... ``` fences, causing JSON.parse to failThe model adds markdown formatting on its ownStrip the fences before parsing: `text.replace(/^```json\s*
Single image is clear but fine-detail recognition fails (e.g., OCR reads 8 as 0)Multimodal image understanding has limited fine-detail accuracy and cannot match dedicated OCR or document AIFor critical documents (invoices, receipts, ID cards), use Tencent Cloud OCR; this recipe is suitable only for coarse filtering and descriptive tasks

For the complete error code reference, see https://docs.cloudbase.net/error-code/.

Billing Notes

  • Multimodal image input consumes a large number of input tokens — a 512×512 image consumes approximately 800–1,500 tokens (subject to the model's current image tiling rules), which is significantly more expensive than an equivalent text prompt. When estimating costs for batch scenarios, calculate "image unit price + output text unit price" separately
  • deepseek-v4-pro is priced higher than deepseek-v4-flash — use pro only for image scenarios and switch back to flash for text-only conversations. See Model Access
  • The Route Handler is the backend proxy layer and credentials never reach the browser, but /api/vision is still exposed to the public internet. Before going live, verify the caller in the POST handler: validate your own login session (see add-auth-web-with-cloudbase-sdk), apply per-UID / per-IP rate limiting, or use CloudBase Security Controls for domain whitelisting to prevent API abuse
  • add-ai-nextjs — the text-only conversation version (streamText + deepseek-v4-flash); this recipe is its multimodal extension
  • add-ai-wechat-miniprogram — the same CloudBase AI capability in a Mini Program (wx.cloud.extend.AI); Mini Program image upload uses wx.chooseMedia + wx.getFileSystemManager().readFile to convert to base64; the subsequent messages structure is identical to this recipe
  • add-cos-upload-from-cloudbase-app — the alternative approach for bulk image ingestion into COS followed by Cloud Function batch multimodal calls, suitable for asynchronous "store first, analyze later" pipelines
  • connect-tavily-search-cloud-function — combining image understanding with web search (the AI analyzes an image and then queries real-time web pages); reference its search call and splice the results into the messages array from this recipe
  • CloudBase AI Toolkit — integration paths for AI IDEs such as Cursor, Windsurf, and CodeBuddy
  • Model Access — complete model matrix including capabilities per model (multimodal / Function Calling / JSON mode)
  • SDK API Reference — complete signatures for createModel / generateText / streamText and the multimodal schema for messages.content