Long Document Q&A with DeepSeek V4 1M-Token Context (No RAG)
In one sentence: the user uploads a PDF / entire codebase / Excel file; the backend parses it into plain text with
pdf-parseand stuffs the full content into the prompt; then calls CloudBase AIapp.ai().createModel('cloudbase').streamText({ model: 'deepseek-v4-pro' })to answer in one shot — no RAG embedding pipeline, no vector database, no retrieval step.Estimated time: 30 minutes | Difficulty: Advanced
Applicable Scenarios
- One-off Q&A on a single long document: annual report PDFs, multi-page academic papers, legal contracts, white papers — ask and move on
- Having AI read an entire repository (tens of thousands of lines) for architecture understanding, bug finding, or documentation generation
- Data Q&A on large Excel / CSV tables (thousands of rows) — serialize the table as plain text and feed it in
- Early-stage internal knowledge base where the total document count is stable at a few dozen files and a vector database is not yet necessary
Not applicable:
- Short documents (under a few thousand tokens) — using
deepseek-v4-pro's 1M context for this is wasteful; usedeepseek-v4-flashfrom add-ai-nextjs instead - Continuously growing documents / multi-document retrieval / need for fine-grained chunk citations — use add-rag-with-pgvector-cloudbase; embedding + vector store + retrieval is designed for "large knowledge bases + precise attribution"
- Documents exceeding the 800K-token soft limit (rough estimate: 1 token ≈ 1.5 Chinese characters / 4 English characters) — forcing documents beyond this limit results in truncation or errors; a fallback strategy is required
- A single document that is queried at high frequency (the same PDF queried hundreds of times per day) — re-stuffing the full text every time causes token costs to explode; RAG is more economical in this case
Which Path to Take: Decision Tree
| Your Scenario | Recommended Approach |
|---|---|
| Total document size < 800K tokens, one-off queries | This recipe (long-context all-in-one) |
| Total document size < 800K tokens, but a single document is queried hundreds of times | RAG (add-rag-with-pgvector-cloudbase) |
| Total document size > 800K tokens, or documents are continuously growing | RAG |
| Need precise citations ("this sentence is from page X, paragraph Y") | RAG (retrieved chunks carry their own index) |
| Very short documents (under a few thousand tokens) | Go directly to add-ai-nextjs with a flash model; 1M context is not needed |
In short: this recipe is simpler engineering but costs more per query; RAG is more complex engineering but costs less per query. Choose based on your document reuse rate.
Prerequisites
| Dependency | Version |
|---|---|
| Next.js | 14+ (App Router) |
@cloudbase/node-sdk | 3.16.0 or higher (required by AI module) |
pdf-parse | ^1.1.1 (parses PDF buffer into plain text) |
| Node.js | 18.17+ |
| Route Handler runtime | Must be nodejs — edge is not supported |
| CloudBase environment | Provisioned, with "AI+" capability enabled in the Console |
| Model | deepseek-v4-pro (1M context is exclusive to the pro tier; flash does not support it) |
The server side must use
@cloudbase/node-sdk. Do not use@cloudbase/js-sdk + signInAnonymously(). Anonymous Web SDK calls are aggressively rate-limited by default (see Web SDK Security Policy), and the long-document scenario involves requests that run for tens of seconds — the Web SDK path simply does not work here.
Step 1: Confirm deepseek-v4-pro Is Available in the Console
- Open the CloudBase Console → select your environment → AI+ → Model Management
- Confirm that
deepseek-v4-proappears in the available model list (see full list at Model Access) deepseek-v4-prois the pro tier; 1M context is its exclusive capability.deepseek-v4-flashis the short-context, cost-optimized variant and cannot be used for this recipe- Token billing is calculated separately for "input + output"; the pro tier is priced higher per token than flash. A single 1M-token input can consume the equivalent of dozens of flash conversations. Check the current unit prices in the Console before deploying
Step 2: Install Dependencies and Configure Environment Variables
npm install @cloudbase/node-sdk pdf-parse
# Types (optional): npm install -D @types/pdf-parse
.env.local:
CLOUDBASE_ENV=your-env-id
TENCENTCLOUD_SECRETID=your-secret-id
TENCENTCLOUD_SECRETKEY=your-secret-key
Identical to Step 2 in add-ai-nextjs — none of the three variables have a NEXT_PUBLIC_ prefix; the SDK only runs inside the server-side Route Handler. Credentials come from Tencent Cloud Console → API Keys; for production, use a sub-account key with a CAM policy scoped to the current environment.
Step 3: Write the Route Handler — Parse PDF and Stuff the Full Text into the Prompt
Create app/api/longdoc/route.ts:
import tcb from '@cloudbase/node-sdk';
import pdfParse from 'pdf-parse';
export const runtime = 'nodejs'; // Required: edge cannot run the SDK or pdf-parse
export const maxDuration = 180; // Next.js 14+ Route Handler defaults to 60s; not enough for long documents
let app: ReturnType<typeof tcb.init> | null = null;
function getAi() {
if (!app) {
// timeout 120000 (120s): time-to-first-token for million-scale inputs can exceed 30s; default 15s will always time out
app = tcb.init({ env: process.env.CLOUDBASE_ENV!, timeout: 120000 });
}
return app.ai();
}
// 1 token ≈ 1.5 Chinese characters / 4 English characters; leave 20% headroom for the 1M context, soft limit at 800K tokens
const MAX_INPUT_TOKENS = 800_000;
function estimateTokens(text: string): number {
// Simplified estimate: CJK at 1.5 chars/token, other text at 4 chars/token; mixed text uses an average
const cjkChars = (text.match(/[\u4e00-\u9fff]/g) || []).length;
const otherChars = text.length - cjkChars;
return Math.ceil(cjkChars / 1.5 + otherChars / 4);
}
export async function POST(req: Request) {
const form = await req.formData();
const file = form.get('file') as File | null;
const question = (form.get('question') as string) || 'Please summarize this document';
if (!file) {
return new Response(JSON.stringify({ error: 'file required' }), { status: 400 });
}
// 1. Convert File to Buffer; pdf-parse accepts a Buffer
const buf = Buffer.from(await file.arrayBuffer());
// 2. PDF → plain text (pdf-parse concatenates pages automatically; layout is lost but all text is preserved)
const parsed = await pdfParse(buf);
const fullText = parsed.text;
// 3. Token estimate + truncation fallback
const tokens = estimateTokens(fullText);
let docText = fullText;
if (tokens > MAX_INPUT_TOKENS) {
// Simple fallback: keep head + tail (assuming key information is at the beginning and end)
// A smarter approach would use LLM-based summarization for the middle; for more complexity, use RAG
const cjkRatio = (fullText.match(/[\u4e00-\u9fff]/g) || []).length / fullText.length;
const charsPerToken = cjkRatio > 0.5 ? 1.5 : 4;
const keepChars = Math.floor(MAX_INPUT_TOKENS * charsPerToken * 0.45); // keep 45% from head and tail each
docText =
fullText.slice(0, keepChars) +
'\n\n[...middle content omitted because it exceeds the context window...]\n\n' +
fullText.slice(-keepChars);
}
// 4. Build prompt + streamText
const ai = getAi();
const model = ai.createModel('cloudbase');
const result = await model.streamText({
model: 'deepseek-v4-pro', // 1M context requires pro; flash has a short context window
messages: [
{
role: 'system',
content:
'You are a rigorous long-document analysis assistant. Answer questions strictly based on the provided "Reference Document". ' +
'If information is not present in the document, say "Not mentioned in the document" — do not fabricate. ' +
'Answer in the same language as the question. When referencing numbers, dates, or clauses, quote the original wording.',
},
{
role: 'user',
content: `Reference Document (${file.name}, approximately ${tokens} tokens):\n\n${docText}\n\nQuestion: ${question}`,
},
],
});
// 5. AsyncIterable → ReadableStream; frontend reads with bare fetch
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
try {
for await (const chunk of result.textStream) {
controller.enqueue(encoder.encode(chunk));
}
controller.close();
} catch (err) {
controller.error(err);
}
},
});
return new Response(stream, {
headers: {
'Content-Type': 'text/plain; charset=utf-8',
'X-Doc-Tokens': String(tokens),
'X-Doc-Truncated': tokens > MAX_INPUT_TOKENS ? '1' : '0',
},
});
}
Key points:
tcb.init({ ..., timeout: 120000 })must be set explicitly to 120s. Time-to-first-token for million-scale inputs can reach 30s+; the default 60s is marginal for long-document scenarios, and 120s provides comfortable headroomMAX_INPUT_TOKENS = 800_000is a soft limit.deepseek-v4-pronominally supports 1M tokens, but in practice you should leave 20% for output tokens + system prompt + safety margin- The fallback strategy here is the simplest possible "head + tail" truncation. In production you can layer in: summarizing the middle sections (call flash once to generate a summary), or using embedding to coarsely filter to the top-N most relevant sections for the user query (at which point this is already a simplified RAG — consider going directly to add-rag-with-pgvector-cloudbase)
- Do not use
result.dataStream— it carries chunk metadata intended for the Vercel AI SDK protocol. This recipe uses bare fetch on the frontend and only needs plain text increments
Step 4: Excel / CSV / Code Repositories
PDF uses pdf-parse. For other file types, only replace the pdfParse(buf) step in Step 3 with the corresponding parser — the prompt construction and streamText call are identical.
Excel / .xlsx — install xlsx to parse:
import * as XLSX from 'xlsx';
const wb = XLSX.read(buf, { type: 'buffer' });
const fullText = wb.SheetNames.map((name) => {
const sheet = wb.Sheets[name];
return `## Sheet: ${name}\n${XLSX.utils.sheet_to_csv(sheet)}`;
}).join('\n\n');
Convert each sheet to a CSV string and concatenate them. LLMs handle CSV format well. Tables with thousands of rows are well within deepseek-v4-pro's capacity.
Entire code repository — traverse the file tree and filter out node_modules / .git / dist:
import { promises as fs } from 'node:fs';
import path from 'node:path';
const SKIP_DIRS = new Set(['node_modules', '.git', 'dist', '.next', 'build']);
const ALLOW_EXT = new Set(['.ts', '.tsx', '.js', '.jsx', '.py', '.go', '.md', '.json', '.yaml']);
async function readRepo(root: string): Promise<string> {
const parts: string[] = [];
async function walk(dir: string) {
const entries = await fs.readdir(dir, { withFileTypes: true });
for (const e of entries) {
const full = path.join(dir, e.name);
if (e.isDirectory()) {
if (!SKIP_DIRS.has(e.name)) await walk(full);
} else if (ALLOW_EXT.has(path.extname(e.name))) {
const content = await fs.readFile(full, 'utf-8');
parts.push(`### ${path.relative(root, full)}\n\`\`\`\n${content}\n\`\`\``);
}
}
}
await walk(root);
return parts.join('\n\n');
}
A medium-sized repository (a few thousand files, tens of thousands of lines of code) is approximately 300K–500K tokens, comfortably within deepseek-v4-pro's range.
CSV / TXT / Markdown — use buf.toString('utf-8') directly; no parsing needed.
Step 5: Frontend — Upload and Stream the Answer
Create app/longdoc/page.tsx:
'use client';
import { useState } from 'react';
export default function LongDocQA() {
const [file, setFile] = useState<File | null>(null);
const [question, setQuestion] = useState('Summarize the key conclusions of this document');
const [answer, setAnswer] = useState('');
const [loading, setLoading] = useState(false);
const [meta, setMeta] = useState<{ tokens?: string; truncated?: string }>({});
async function send() {
if (!file || loading) return;
setAnswer('');
setMeta({});
setLoading(true);
try {
const form = new FormData();
form.append('file', file);
form.append('question', question);
const res = await fetch('/api/longdoc', { method: 'POST', body: form });
if (!res.ok || !res.body) throw new Error(`HTTP ${res.status}`);
setMeta({
tokens: res.headers.get('X-Doc-Tokens') || undefined,
truncated: res.headers.get('X-Doc-Truncated') || undefined,
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
let acc = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
// stream: true is required — without it, multi-byte characters split across chunk boundaries produce garbled output
acc += decoder.decode(value, { stream: true });
setAnswer(acc);
}
} catch (err) {
setAnswer(`[Error] ${err instanceof Error ? err.message : String(err)}`);
} finally {
setLoading(false);
}
}
return (
<div style={{ maxWidth: 800, margin: '40px auto', padding: 16 }}>
<h1>Long Document Q&A (DeepSeek V4 Pro)</h1>
<input
type="file"
accept=".pdf"
onChange={(e) => setFile(e.target.files?.[0] || null)}
style={{ display: 'block', margin: '12px 0' }}
/>
<textarea
value={question}
onChange={(e) => setQuestion(e.target.value)}
rows={3}
style={{ width: '100%', padding: 8 }}
placeholder="Ask something about this document"
/>
<button onClick={send} disabled={!file || loading} style={{ marginTop: 12 }}>
{loading ? 'Generating... (first token may take 30s+ for long documents)' : 'Send'}
</button>
{meta.tokens && (
<div style={{ marginTop: 12, color: '#666', fontSize: 13 }}>
Document: ~{meta.tokens} tokens
{meta.truncated === '1' && ' · Exceeded limit — automatically truncated (head + tail kept)'}
</div>
)}
<div
style={{
marginTop: 16,
padding: 12,
background: '#f5f5f5',
whiteSpace: 'pre-wrap',
minHeight: 200,
}}
>
{answer || '(waiting for response)'}
</div>
</div>
);
}
Key details:
accept=".pdf"is only a UI hint, not a security control — the backend Route Handler must also validate MIME type / magic bytes; this is mandatory for production- Time-to-first-token for long documents can reach 30s+ (the model must ingest the full text before generating output). The loading button label should explicitly say "first token may take 30s+" so users don't think the app is frozen
decoder.decode(value, { stream: true })—stream: truemust not be omitted; the reason is the same as Step 4 in add-ai-nextjs
Verification
- Start
npm run devand openhttp://localhost:3000/longdocin a browser - Upload a multi-page PDF (e.g. a company earnings report) and ask "What are the key revenue changes in this report?"
- In the Network panel, the
/api/longdocresponse headers should showX-Doc-Tokenswith the token count, andX-Doc-Truncated: 0indicating no truncation was triggered - After 10–40 seconds, streaming output should begin, and the answer should appear incrementally in the UI
- In CloudBase Console → AI+ → Call Records, you should see the input + output token count for that call, with the model field showing
deepseek-v4-pro - Upload a very large document that exceeds 800K tokens (e.g. an unpacked source code directory or a long novel) — you should see
X-Doc-Truncated: 1and the UI should display "Exceeded limit — automatically truncated"
Common Errors
| Error / Symptom | Cause | Fix |
|---|---|---|
timeout / request hangs around 60s | Node SDK default timeout: 15000 is always exceeded in long-document scenarios; or the Next.js Route Handler's own maxDuration was not increased | Both tcb.init({ env, timeout: 120000 }) and export const maxDuration = 180 must be set |
model not found / model deepseek-v4-pro is not supported | The pro model is not enabled in this environment, or it was misspelled as deepseek-v4-flash | Check "Model Management" in the Console against the exact name; see Model Access. Only the pro tier supports 1M context |
context length exceeded / prompt too long | Token estimate was too low and the actual content exceeded the model limit | Lower MAX_INPUT_TOKENS from 800K to 700K and observe; the estimate function is lenient for pure-English documents — for English-only content, use 3.5 characters/token |
pdf-parse returns an empty text field | The uploaded PDF is a scanned / image-based PDF with no text layer | pdf-parse can only read text layers; scanned documents require OCR first (use an external OCR service or Tencent Cloud OCR API) |
After deploying to Vercel / CloudBase Run: cannot find module '@cloudbase/node-sdk' or XMLHttpRequest is not defined | The Route Handler uses export const runtime = 'edge'; the Edge Runtime lacks the full Node.js API | Change to runtime = 'nodejs'; both the SDK and pdf-parse depend on Node Buffer / fs, which are unavailable in the Edge Runtime |
| Streaming response breaks midway | An exception thrown by streamText inside for await was not forwarded to controller.error(err) | The try/catch must convert exceptions to controller.error(err) — do not throw (once the Response has been sent, a throw cannot reach the client) |
Garbled characters (���) in the frontend stream | TextDecoder.decode(value) was called without { stream: true }, so multi-byte characters split across chunk boundaries are corrupted | Change to decoder.decode(value, { stream: true }) |
| Answer is clearly unrelated to the document | Truncation removed the middle sections where the relevant information was located | If the user's question is strongly related to the middle of the document, do not use this recipe; use add-rag-with-pgvector-cloudbase to let the retriever precisely recall the middle sections |
| Excel file produces garbled output for the LLM | The xlsx library outputs binary by default; sheet_to_csv / sheet_to_json must be called explicitly | See Step 4 code; use XLSX.utils.sheet_to_csv(sheet) |
| Token cost per call is ten times higher than expected | Stuffing 800K tokens as input in one call, with deepseek-v4-pro priced several times higher per token than flash | This is expected behavior; if cost is unacceptable, switch to RAG — only the retrieved few thousand tokens are sent to the prompt |
For the complete error code reference, see https://docs.cloudbase.net/error-code/.
Billing Notes
deepseek-v4-proinput + output token unit prices are both higher thandeepseek-v4-flash. A single 1M-token input call can consume the equivalent of dozens of ordinary flash conversations. Check the current unit prices disclosed in the Console before deploying- Newly provisioned environments receive free token trial credits for the first month (see the Console billing page for the actual quota) — enough to test a dozen or more million-scale requests
- Querying the same long document at high frequency causes token costs to accumulate rapidly, since the full text must be re-stuffed on every request. RAG is the correct solution for this scenario (only the retrieved few thousand tokens are sent per query)
- The Route Handler is itself the backend proxy layer, so credentials never reach the browser, but the
/api/longdocendpoint is still exposed to the public internet. Long-document endpoints are major token consumers — you must add authentication (see add-auth-web-with-cloudbase-sdk) and per-UID rate limiting (e.g. maximum 5 large-document calls per user per hour) to avoid abuse - Add
.env.localto.gitignore; never commit credentials to the repository. In production, store secrets in your organization's secrets management service
Related Documentation
- add-rag-with-pgvector-cloudbase — the alternative path: when documents are continuously growing / queries are high-frequency / precise attribution is required, use embedding + vector store + retrieval; only the retrieved chunks are sent to the prompt
- add-ai-nextjs — basic conversational AI with Next.js + CloudBase AI; the Route Handler pattern in this recipe is inherited directly from that one. For short conversations, use
deepseek-v4-flashvia that recipe - add-ai-wechat-miniprogram — the equivalent CloudBase AI implementation for WeChat Mini Programs
- Model Access — full list of
deepseek-v4-pro/deepseek-v4-flash/ Hunyuan / Kimi / GLM and other models with context lengths - SDK Initialization and Invocation — official initialization guide for
app.ai()on the Node.js server side - SDK API Reference — complete signatures for
createModel / streamText - pdf-parse npm — Node.js PDF text extraction; the parser used in this recipe