Multimodal Image Understanding with DeepSeek V4-Pro in CloudBase AI
In one sentence: A Next.js Route Handler receives user-uploaded images, converts them to base64 Data URLs, and calls
@cloudbase/node-sdk'sapp.ai().createModel('cloudbase').generateTextwithmodel: 'deepseek-v4-pro'—messages[].contentfollows the OpenAI-compatible multimodal array structure (type: 'image_url'+type: 'text') — returning AI descriptions, OCR results, and content analysis in a single call.Estimated time: 20 minutes | Difficulty: Advanced
Applicable Scenarios
- User uploads a product photo; AI generates e-commerce detail page copy or auto-produces alt text
- User uploads an invoice, business card, or form photo; AI extracts fields and fills them into a structured form
- User uploads a screenshot or chart; AI extracts data points or interprets chart trends
- Basic content moderation (checking for prohibited elements in an image) as a coarse pre-filter before a dedicated content safety service
- Adding an "upload a photo and ask anything" entry point to a Mini Program or web app without introducing a third-party multimodal SDK
Not applicable:
- Real-time video stream understanding (dozens of frames per second) — use a dedicated video API or self-hosted VLM;
generateTextis a single call and is unsuitable for the latency and cost requirements of video - Bulk batch processing (millions of images at once) — use add-cos-upload-from-cloudbase-app to ingest into COS first, then run Cloud Functions in batch with scheduled tasks; this recipe covers the synchronous path where users upload and receive results immediately
- Multi-turn image conversations (user sends an image, then sends follow-up questions, and the AI must remember the previous image) — the frontend must accumulate the full
messagesarray and include historical images in each request; this recipe covers only single calls; for multi-turn, extend using the multi-turn template from Step 5 of add-ai-nextjs - General text-only conversation (no image, just a question) — switch to
deepseek-v4-flashfor better cost-efficiency; see add-ai-nextjs
Prerequisites
| Dependency | Version |
|---|---|
| Next.js | 14+ (App Router, stable Route Handler) |
@cloudbase/node-sdk | 3.16.0 or higher (required by the AI module; multimodal uses the same interface) |
| Node.js | 18.17+ (required by Next.js 14) |
| Route Handler runtime | Must be nodejs — edge is not supported (the SDK depends on Node APIs) |
| CloudBase environment | Provisioned, with AI+ enabled in the Console, and deepseek-v4-pro visible in Model Management |
| Single image size | Recommended ≤ 4 MB per image; approximately 5.3 MB after base64 encoding — compress large images on the frontend first |
The server side must use
@cloudbase/node-sdk. Do not use the@cloudbase/js-sdk+signInAnonymously()web-side pattern — anonymous login is aggressively rate-limited and cannot be used in production. Production must go through a Node SDK + environment-level credential backend proxy; see add-ai-nextjs.
Step 1: Confirm deepseek-v4-pro is Available in the Console
- Open the CloudBase Console → select your environment → AI+ → Model Management
- Confirm that
deepseek-v4-prois online and available in the model list - If you only see
deepseek-v4-flashand notdeepseek-v4-pro, the model is not yet enabled for this environment — click Quick Setup → check DeepSeek V4-Pro; it takes effect within seconds
Why use pro instead of flash:
deepseek-v4-pro— supports image input (multimodal); the pro series is the only model that acceptstype: 'image_url'insidemessages.contentarraysdeepseek-v4-flash— accepts only plain-text messages; passing images will be silently ignored or cause an error
For the complete model matrix, refer to the current Model Access documentation; all examples in this recipe use deepseek-v4-pro.
Step 2: Install the SDK and Configure Environment Variables
npm install @cloudbase/node-sdk
.env.local:
CLOUDBASE_ENV=your-env-id
TENCENTCLOUD_SECRETID=your-secret-id
TENCENTCLOUD_SECRETKEY=your-secret-key
None of these have a NEXT_PUBLIC_ prefix — the SDK call runs on the server side, and the env ID and credentials must not be exposed in the client bundle.
SECRETID/SECRETKEY are generated at Tencent Cloud Console → API Keys. For production, use a sub-account key with a CAM policy scoped to the current CloudBase environment. If your Next.js app is deployed to CloudBase Cloud Run or Cloud Functions, these two variables are auto-injected and can be omitted.
Step 3: Write the Route Handler — FormData → base64 → generateText
Create app/api/vision/route.ts:
import tcb from '@cloudbase/node-sdk';
export const runtime = 'nodejs'; // Required: edge is not supported
let app: ReturnType<typeof tcb.init> | null = null;
function getAi() {
if (!app) {
// timeout 60s: multimodal inference is slower than plain text; the default 15s will time out
app = tcb.init({ env: process.env.CLOUDBASE_ENV!, timeout: 60000 });
}
return app.ai();
}
// File -> data:image/xxx;base64,xxxxxx
async function fileToDataUrl(file: File): Promise<string> {
const buf = Buffer.from(await file.arrayBuffer());
const mime = file.type || 'image/jpeg';
return `data:${mime};base64,${buf.toString('base64')}`;
}
export async function POST(req: Request) {
const form = await req.formData();
const prompt = (form.get('prompt') as string) || 'Describe the contents of this image';
const files = form.getAll('images').filter((v): v is File => v instanceof File);
if (files.length === 0) {
return Response.json({ error: 'no image uploaded' }, { status: 400 });
}
// Convert all images to data URLs, preserving the order of upload
const imageContents = await Promise.all(
files.map(async (file) => ({
type: 'image_url' as const,
image_url: { url: await fileToDataUrl(file) },
})),
);
const ai = getAi();
const model = ai.createModel('cloudbase');
const result = await model.generateText({
model: 'deepseek-v4-pro',
messages: [
{
role: 'user',
content: [
...imageContents,
{ type: 'text', text: prompt },
],
},
],
});
return Response.json({ text: result.text });
}
Key points:
appis cached as a module-level variable — every request enters thePOSTfunction, buttcb.init()is only called once- The server SDK carries an environment-level identity —
signInAnonymously()is not needed contentis an array, and order matters — images before text is the officially recommended order; the model treats images as context and text as the question about those images; reversing the order can work but occasionally causes the model to treat the text as the primary task and ignore the images as attachments- For multiple images, add more
{ type: 'image_url', ... }objects toimageContents— maximum 4 images (beyond that, the pro series context window overflows) - Images are passed as base64 Data URL strings, not as COS / OSS public URLs — public URLs can be blocked at certain regional egress points; inline Data URLs are the most reliable approach; the trade-off is a larger request body, so for production it is recommended to compress images to under 1 MB on the frontend
- Use
generateTextrather thanstreamText: image understanding typically outputs 50–300 characters in one shot, making streaming unnecessary; getting the completeresult.textdirectly is simpler. For streaming, replacegenerateTextwithstreamTextand consumeresult.textStream, following the streaming pattern in Step 3 of add-ai-nextjs
Step 4: Frontend Client Component — Upload Images and Display Results
Create app/vision/page.tsx:
'use client';
import { useState } from 'react';
export default function Vision() {
const [files, setFiles] = useState<File[]>([]);
const [prompt, setPrompt] = useState('Describe this image');
const [result, setResult] = useState('');
const [loading, setLoading] = useState(false);
async function send() {
if (files.length === 0 || loading) return;
setLoading(true);
setResult('');
try {
const form = new FormData();
form.append('prompt', prompt);
// All files share the same key 'images'; the Route Handler uses form.getAll('images') to receive them as an array
files.forEach((f) => form.append('images', f));
const res = await fetch('/api/vision', {
method: 'POST',
body: form,
});
if (!res.ok) {
const err = await res.json().catch(() => ({ error: `HTTP ${res.status}` }));
throw new Error(err.error || `HTTP ${res.status}`);
}
const data = await res.json();
setResult(data.text);
} catch (err) {
setResult(`[Error] ${err instanceof Error ? err.message : String(err)}`);
} finally {
setLoading(false);
}
}
return (
<div style={{ maxWidth: 720, margin: '40px auto', padding: 16 }}>
<input
type="file"
accept="image/*"
multiple
onChange={(e) => setFiles(Array.from(e.target.files ?? []))}
disabled={loading}
/>
<div style={{ margin: '12px 0' }}>
{files.map((f) => (
<img
key={f.name}
src={URL.createObjectURL(f)}
alt={f.name}
style={{ width: 120, height: 120, objectFit: 'cover', marginRight: 8 }}
/>
))}
</div>
<textarea
rows={2}
style={{ width: '100%', padding: 8 }}
value={prompt}
onChange={(e) => setPrompt(e.target.value)}
placeholder="Ask something about this image"
disabled={loading}
/>
<button
onClick={send}
disabled={loading || files.length === 0}
style={{ marginTop: 8 }}
>
{loading ? 'Analyzing...' : 'Send'}
</button>
{result && (
<div
style={{
marginTop: 16,
padding: 12,
background: '#f5f5f5',
whiteSpace: 'pre-wrap',
}}
>
{result}
</div>
)}
</div>
);
}
Details worth noting:
- The
multipleattribute allows selecting multiple files at once; the Route Handler receives them as a File array - Multiple files all use the same form key
'images'appended repeatedly; the backend usesform.getAll('images')to collect them into an array. Do not use indexed names likeimages[0]/images[1]— FormData does not have that semantics URL.createObjectURL(f)produces a blob URL that should ideally be revoked withrevokeObjectURLon unmount; this example omits that for brevity — uploading many images over an extended session will gradually consume memory- Whether to compress on the frontend before uploading: the browser can use
<canvas>+toBlob({ type: 'image/jpeg', quality: 0.8 })to compress large images to under 1 MB, significantly reducing the Data URL size and token consumption. Skipping compression still works, but a 4 MB raw image becomes a 5.3 MB base64 payload, making the API request body considerably large
Step 5: Prompt Patterns for Three Common Scenarios
The same Route Handler handles all three — only the prompt changes:
Scenario A — Single Image Description / Copywriting
You are an e-commerce product copywriter. Examine this product image and write a concise description
of no more than 50 words in English, highlighting the product category, color, and key selling points.
Output the copy directly without any introductory remarks.
Scenario B — Single Image OCR + Structured Extraction
This is an invoice. Extract the following fields and return them as JSON:
`{ "vendor": "", "amount": 0, "date": "YYYY-MM-DD", "items": [] }`
Return an empty string or 0 for any field not found. Return only JSON — no markdown code fences.
Remember to JSON.parse(result.text) after receiving the response in the Route Handler. The model occasionally wraps the output in ```json ... ``` fences — stripping the code block markers before parsing is more reliable in production.
Scenario C — Multi-Image Comparison
Upload 2–4 product images, with a prompt such as:
These are N product images of the same category. Provide a brief comparison across three dimensions:
color, style, and suitable use case. Conclude with a recommendation on which image would work best
as the main product photo.
generateText processes multiple images in messages order; the model typically refers to them as "Image 1", "Image 2", and so on. To enforce explicit labels, include identifiers in the prompt such as "the first image is Model A, the second image is Model B".
Step 6: Verification
- Start the development server:
npm run dev - Open
http://localhost:3000/visionin a browser - Select a local image, leave the default prompt, and click Send
- After a few seconds, the AI description should appear in the grey box below
- In browser DevTools Network panel, the
/api/visionrequest payload should bemultipart/form-dataand the response should be{ "text": "..." } - The server terminal should not show
cloudbase.init is not a function/model not found/image format not supported - CloudBase Console → AI+ → Call Records should show the call, with a token count noticeably larger than a text-only call (images consume many input tokens)
If Step 4 takes more than 30 seconds, the image is likely too large — check the request body size in Network; if it exceeds 6 MB, compress the image before uploading.
Common Errors
| Error / Symptom | Cause | Fix |
|---|---|---|
model not found / model 'deepseek-v4-pro' is not supported | deepseek-v4-pro is not enabled in the current environment, or the name is misspelled (e.g. deepseek-vl-v4-pro / deepseek-v4) | Check the exact model ID in Console → AI+ → Model Management and copy it verbatim; if not yet enabled, click Quick Setup and check it |
After upload: image format not supported / invalid image_url | The image_url value is a blob: URL or a frontend File object path, not a Data URL | Must pass the data:image/jpeg;base64,xxx format — see fileToDataUrl in the Route Handler; blob URLs are browser-local and inaccessible on the server |
| Multiple images uploaded but the model only answers about the first one | The content array order is reversed — text appears before images, causing the model to treat text as the primary instruction and images as attachments; or the number of images exceeds the model limit | Reorder the content array so all type: 'image_url' entries come before type: 'text'; keep the image count to 4 or fewer |
cannot find module '@cloudbase/node-sdk' or XMLHttpRequest is not defined (after deployment) | The Route Handler uses export const runtime = 'edge'; the Edge Runtime lacks the full Node.js API | Change to export const runtime = 'nodejs'; the SDK must run on the Node Runtime |
secretId or secretKey not found / getCredential failed (after deployment) | Server-side credentials were not injected. Vercel and self-managed servers require explicit configuration of TENCENTCLOUD_SECRETID + TENCENTCLOUD_SECRETKEY (Cloud Run / Cloud Functions inject them automatically) | Add both variables in the deployment platform's environment settings; values come from Tencent Cloud Console → API Keys. Use a sub-account key scoped via CAM to the current CloudBase environment for best security |
| Request hangs for 60 seconds before timing out | The Node SDK default timeout: 15000 is too short; combined with large images and slow multimodal inference, it is regularly exceeded; or the image is too large and uploading is slow | Use tcb.init({ env, timeout: 60000 }) to explicitly raise the timeout to 60s or more (already done in the example above); compress large images to under 1 MB on the frontend |
In OCR scenario, AI output is wrapped in ```json ... ``` fences, causing JSON.parse to fail | The model adds markdown formatting on its own | Strip the fences before parsing: `text.replace(/^```json\s* |
| Single image is clear but fine-detail recognition fails (e.g., OCR reads 8 as 0) | Multimodal image understanding has limited fine-detail accuracy and cannot match dedicated OCR or document AI | For critical documents (invoices, receipts, ID cards), use Tencent Cloud OCR; this recipe is suitable only for coarse filtering and descriptive tasks |
For the complete error code reference, see https://docs.cloudbase.net/error-code/.
Billing Notes
- Multimodal image input consumes a large number of input tokens — a 512×512 image consumes approximately 800–1,500 tokens (subject to the model's current image tiling rules), which is significantly more expensive than an equivalent text prompt. When estimating costs for batch scenarios, calculate "image unit price + output text unit price" separately
deepseek-v4-prois priced higher thandeepseek-v4-flash— use pro only for image scenarios and switch back to flash for text-only conversations. See Model Access- The Route Handler is the backend proxy layer and credentials never reach the browser, but
/api/visionis still exposed to the public internet. Before going live, verify the caller in thePOSThandler: validate your own login session (see add-auth-web-with-cloudbase-sdk), apply per-UID / per-IP rate limiting, or use CloudBase Security Controls for domain whitelisting to prevent API abuse
Related Documentation
- add-ai-nextjs — the text-only conversation version (
streamText+deepseek-v4-flash); this recipe is its multimodal extension - add-ai-wechat-miniprogram — the same CloudBase AI capability in a Mini Program (
wx.cloud.extend.AI); Mini Program image upload useswx.chooseMedia+wx.getFileSystemManager().readFileto convert to base64; the subsequentmessagesstructure is identical to this recipe - add-cos-upload-from-cloudbase-app — the alternative approach for bulk image ingestion into COS followed by Cloud Function batch multimodal calls, suitable for asynchronous "store first, analyze later" pipelines
- connect-tavily-search-cloud-function — combining image understanding with web search (the AI analyzes an image and then queries real-time web pages); reference its search call and splice the results into the
messagesarray from this recipe - CloudBase AI Toolkit — integration paths for AI IDEs such as Cursor, Windsurf, and CodeBuddy
- Model Access — complete model matrix including capabilities per model (multimodal / Function Calling / JSON mode)
- SDK API Reference — complete signatures for
createModel / generateText / streamTextand the multimodal schema formessages.content