Skip to main content

Long Document Q&A with DeepSeek V4 1M-Token Context (No RAG)

In one sentence: the user uploads a PDF / entire codebase / Excel file; the backend parses it into plain text with pdf-parse and stuffs the full content into the prompt; then calls CloudBase AI app.ai().createModel('cloudbase').streamText({ model: 'deepseek-v4-pro' }) to answer in one shot — no RAG embedding pipeline, no vector database, no retrieval step.

Estimated time: 30 minutes | Difficulty: Advanced

Applicable Scenarios

  • One-off Q&A on a single long document: annual report PDFs, multi-page academic papers, legal contracts, white papers — ask and move on
  • Having AI read an entire repository (tens of thousands of lines) for architecture understanding, bug finding, or documentation generation
  • Data Q&A on large Excel / CSV tables (thousands of rows) — serialize the table as plain text and feed it in
  • Early-stage internal knowledge base where the total document count is stable at a few dozen files and a vector database is not yet necessary

Not applicable:

  • Short documents (under a few thousand tokens) — using deepseek-v4-pro's 1M context for this is wasteful; use deepseek-v4-flash from add-ai-nextjs instead
  • Continuously growing documents / multi-document retrieval / need for fine-grained chunk citations — use add-rag-with-pgvector-cloudbase; embedding + vector store + retrieval is designed for "large knowledge bases + precise attribution"
  • Documents exceeding the 800K-token soft limit (rough estimate: 1 token ≈ 1.5 Chinese characters / 4 English characters) — forcing documents beyond this limit results in truncation or errors; a fallback strategy is required
  • A single document that is queried at high frequency (the same PDF queried hundreds of times per day) — re-stuffing the full text every time causes token costs to explode; RAG is more economical in this case

Which Path to Take: Decision Tree

Your ScenarioRecommended Approach
Total document size < 800K tokens, one-off queriesThis recipe (long-context all-in-one)
Total document size < 800K tokens, but a single document is queried hundreds of timesRAG (add-rag-with-pgvector-cloudbase)
Total document size > 800K tokens, or documents are continuously growingRAG
Need precise citations ("this sentence is from page X, paragraph Y")RAG (retrieved chunks carry their own index)
Very short documents (under a few thousand tokens)Go directly to add-ai-nextjs with a flash model; 1M context is not needed

In short: this recipe is simpler engineering but costs more per query; RAG is more complex engineering but costs less per query. Choose based on your document reuse rate.

Prerequisites

DependencyVersion
Next.js14+ (App Router)
@cloudbase/node-sdk3.16.0 or higher (required by AI module)
pdf-parse^1.1.1 (parses PDF buffer into plain text)
Node.js18.17+
Route Handler runtimeMust be nodejsedge is not supported
CloudBase environmentProvisioned, with "AI+" capability enabled in the Console
Modeldeepseek-v4-pro (1M context is exclusive to the pro tier; flash does not support it)

The server side must use @cloudbase/node-sdk. Do not use @cloudbase/js-sdk + signInAnonymously(). Anonymous Web SDK calls are aggressively rate-limited by default (see Web SDK Security Policy), and the long-document scenario involves requests that run for tens of seconds — the Web SDK path simply does not work here.

Step 1: Confirm deepseek-v4-pro Is Available in the Console

  1. Open the CloudBase Console → select your environment → AI+Model Management
  2. Confirm that deepseek-v4-pro appears in the available model list (see full list at Model Access)
  3. deepseek-v4-pro is the pro tier; 1M context is its exclusive capability. deepseek-v4-flash is the short-context, cost-optimized variant and cannot be used for this recipe
  4. Token billing is calculated separately for "input + output"; the pro tier is priced higher per token than flash. A single 1M-token input can consume the equivalent of dozens of flash conversations. Check the current unit prices in the Console before deploying

Step 2: Install Dependencies and Configure Environment Variables

npm install @cloudbase/node-sdk pdf-parse
# Types (optional): npm install -D @types/pdf-parse

.env.local:

CLOUDBASE_ENV=your-env-id
TENCENTCLOUD_SECRETID=your-secret-id
TENCENTCLOUD_SECRETKEY=your-secret-key

Identical to Step 2 in add-ai-nextjs — none of the three variables have a NEXT_PUBLIC_ prefix; the SDK only runs inside the server-side Route Handler. Credentials come from Tencent Cloud Console → API Keys; for production, use a sub-account key with a CAM policy scoped to the current environment.

Step 3: Write the Route Handler — Parse PDF and Stuff the Full Text into the Prompt

Create app/api/longdoc/route.ts:

import tcb from '@cloudbase/node-sdk';
import pdfParse from 'pdf-parse';

export const runtime = 'nodejs'; // Required: edge cannot run the SDK or pdf-parse
export const maxDuration = 180; // Next.js 14+ Route Handler defaults to 60s; not enough for long documents

let app: ReturnType<typeof tcb.init> | null = null;

function getAi() {
if (!app) {
// timeout 120000 (120s): time-to-first-token for million-scale inputs can exceed 30s; default 15s will always time out
app = tcb.init({ env: process.env.CLOUDBASE_ENV!, timeout: 120000 });
}
return app.ai();
}

// 1 token ≈ 1.5 Chinese characters / 4 English characters; leave 20% headroom for the 1M context, soft limit at 800K tokens
const MAX_INPUT_TOKENS = 800_000;

function estimateTokens(text: string): number {
// Simplified estimate: CJK at 1.5 chars/token, other text at 4 chars/token; mixed text uses an average
const cjkChars = (text.match(/[\u4e00-\u9fff]/g) || []).length;
const otherChars = text.length - cjkChars;
return Math.ceil(cjkChars / 1.5 + otherChars / 4);
}

export async function POST(req: Request) {
const form = await req.formData();
const file = form.get('file') as File | null;
const question = (form.get('question') as string) || 'Please summarize this document';

if (!file) {
return new Response(JSON.stringify({ error: 'file required' }), { status: 400 });
}

// 1. Convert File to Buffer; pdf-parse accepts a Buffer
const buf = Buffer.from(await file.arrayBuffer());

// 2. PDF → plain text (pdf-parse concatenates pages automatically; layout is lost but all text is preserved)
const parsed = await pdfParse(buf);
const fullText = parsed.text;

// 3. Token estimate + truncation fallback
const tokens = estimateTokens(fullText);
let docText = fullText;
if (tokens > MAX_INPUT_TOKENS) {
// Simple fallback: keep head + tail (assuming key information is at the beginning and end)
// A smarter approach would use LLM-based summarization for the middle; for more complexity, use RAG
const cjkRatio = (fullText.match(/[\u4e00-\u9fff]/g) || []).length / fullText.length;
const charsPerToken = cjkRatio > 0.5 ? 1.5 : 4;
const keepChars = Math.floor(MAX_INPUT_TOKENS * charsPerToken * 0.45); // keep 45% from head and tail each
docText =
fullText.slice(0, keepChars) +
'\n\n[...middle content omitted because it exceeds the context window...]\n\n' +
fullText.slice(-keepChars);
}

// 4. Build prompt + streamText
const ai = getAi();
const model = ai.createModel('cloudbase');

const result = await model.streamText({
model: 'deepseek-v4-pro', // 1M context requires pro; flash has a short context window
messages: [
{
role: 'system',
content:
'You are a rigorous long-document analysis assistant. Answer questions strictly based on the provided "Reference Document". ' +
'If information is not present in the document, say "Not mentioned in the document" — do not fabricate. ' +
'Answer in the same language as the question. When referencing numbers, dates, or clauses, quote the original wording.',
},
{
role: 'user',
content: `Reference Document (${file.name}, approximately ${tokens} tokens):\n\n${docText}\n\nQuestion: ${question}`,
},
],
});

// 5. AsyncIterable → ReadableStream; frontend reads with bare fetch
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
try {
for await (const chunk of result.textStream) {
controller.enqueue(encoder.encode(chunk));
}
controller.close();
} catch (err) {
controller.error(err);
}
},
});

return new Response(stream, {
headers: {
'Content-Type': 'text/plain; charset=utf-8',
'X-Doc-Tokens': String(tokens),
'X-Doc-Truncated': tokens > MAX_INPUT_TOKENS ? '1' : '0',
},
});
}

Key points:

  • tcb.init({ ..., timeout: 120000 }) must be set explicitly to 120s. Time-to-first-token for million-scale inputs can reach 30s+; the default 60s is marginal for long-document scenarios, and 120s provides comfortable headroom
  • MAX_INPUT_TOKENS = 800_000 is a soft limit. deepseek-v4-pro nominally supports 1M tokens, but in practice you should leave 20% for output tokens + system prompt + safety margin
  • The fallback strategy here is the simplest possible "head + tail" truncation. In production you can layer in: summarizing the middle sections (call flash once to generate a summary), or using embedding to coarsely filter to the top-N most relevant sections for the user query (at which point this is already a simplified RAG — consider going directly to add-rag-with-pgvector-cloudbase)
  • Do not use result.dataStream — it carries chunk metadata intended for the Vercel AI SDK protocol. This recipe uses bare fetch on the frontend and only needs plain text increments

Step 4: Excel / CSV / Code Repositories

PDF uses pdf-parse. For other file types, only replace the pdfParse(buf) step in Step 3 with the corresponding parser — the prompt construction and streamText call are identical.

Excel / .xlsx — install xlsx to parse:

import * as XLSX from 'xlsx';
const wb = XLSX.read(buf, { type: 'buffer' });
const fullText = wb.SheetNames.map((name) => {
const sheet = wb.Sheets[name];
return `## Sheet: ${name}\n${XLSX.utils.sheet_to_csv(sheet)}`;
}).join('\n\n');

Convert each sheet to a CSV string and concatenate them. LLMs handle CSV format well. Tables with thousands of rows are well within deepseek-v4-pro's capacity.

Entire code repository — traverse the file tree and filter out node_modules / .git / dist:

import { promises as fs } from 'node:fs';
import path from 'node:path';

const SKIP_DIRS = new Set(['node_modules', '.git', 'dist', '.next', 'build']);
const ALLOW_EXT = new Set(['.ts', '.tsx', '.js', '.jsx', '.py', '.go', '.md', '.json', '.yaml']);

async function readRepo(root: string): Promise<string> {
const parts: string[] = [];
async function walk(dir: string) {
const entries = await fs.readdir(dir, { withFileTypes: true });
for (const e of entries) {
const full = path.join(dir, e.name);
if (e.isDirectory()) {
if (!SKIP_DIRS.has(e.name)) await walk(full);
} else if (ALLOW_EXT.has(path.extname(e.name))) {
const content = await fs.readFile(full, 'utf-8');
parts.push(`### ${path.relative(root, full)}\n\`\`\`\n${content}\n\`\`\``);
}
}
}
await walk(root);
return parts.join('\n\n');
}

A medium-sized repository (a few thousand files, tens of thousands of lines of code) is approximately 300K–500K tokens, comfortably within deepseek-v4-pro's range.

CSV / TXT / Markdown — use buf.toString('utf-8') directly; no parsing needed.

Step 5: Frontend — Upload and Stream the Answer

Create app/longdoc/page.tsx:

'use client';

import { useState } from 'react';

export default function LongDocQA() {
const [file, setFile] = useState<File | null>(null);
const [question, setQuestion] = useState('Summarize the key conclusions of this document');
const [answer, setAnswer] = useState('');
const [loading, setLoading] = useState(false);
const [meta, setMeta] = useState<{ tokens?: string; truncated?: string }>({});

async function send() {
if (!file || loading) return;
setAnswer('');
setMeta({});
setLoading(true);

try {
const form = new FormData();
form.append('file', file);
form.append('question', question);

const res = await fetch('/api/longdoc', { method: 'POST', body: form });
if (!res.ok || !res.body) throw new Error(`HTTP ${res.status}`);

setMeta({
tokens: res.headers.get('X-Doc-Tokens') || undefined,
truncated: res.headers.get('X-Doc-Truncated') || undefined,
});

const reader = res.body.getReader();
const decoder = new TextDecoder();
let acc = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
// stream: true is required — without it, multi-byte characters split across chunk boundaries produce garbled output
acc += decoder.decode(value, { stream: true });
setAnswer(acc);
}
} catch (err) {
setAnswer(`[Error] ${err instanceof Error ? err.message : String(err)}`);
} finally {
setLoading(false);
}
}

return (
<div style={{ maxWidth: 800, margin: '40px auto', padding: 16 }}>
<h1>Long Document Q&amp;A (DeepSeek V4 Pro)</h1>
<input
type="file"
accept=".pdf"
onChange={(e) => setFile(e.target.files?.[0] || null)}
style={{ display: 'block', margin: '12px 0' }}
/>
<textarea
value={question}
onChange={(e) => setQuestion(e.target.value)}
rows={3}
style={{ width: '100%', padding: 8 }}
placeholder="Ask something about this document"
/>
<button onClick={send} disabled={!file || loading} style={{ marginTop: 12 }}>
{loading ? 'Generating... (first token may take 30s+ for long documents)' : 'Send'}
</button>
{meta.tokens && (
<div style={{ marginTop: 12, color: '#666', fontSize: 13 }}>
Document: ~{meta.tokens} tokens
{meta.truncated === '1' && ' · Exceeded limit — automatically truncated (head + tail kept)'}
</div>
)}
<div
style={{
marginTop: 16,
padding: 12,
background: '#f5f5f5',
whiteSpace: 'pre-wrap',
minHeight: 200,
}}
>
{answer || '(waiting for response)'}
</div>
</div>
);
}

Key details:

  • accept=".pdf" is only a UI hint, not a security control — the backend Route Handler must also validate MIME type / magic bytes; this is mandatory for production
  • Time-to-first-token for long documents can reach 30s+ (the model must ingest the full text before generating output). The loading button label should explicitly say "first token may take 30s+" so users don't think the app is frozen
  • decoder.decode(value, { stream: true })stream: true must not be omitted; the reason is the same as Step 4 in add-ai-nextjs

Verification

  1. Start npm run dev and open http://localhost:3000/longdoc in a browser
  2. Upload a multi-page PDF (e.g. a company earnings report) and ask "What are the key revenue changes in this report?"
  3. In the Network panel, the /api/longdoc response headers should show X-Doc-Tokens with the token count, and X-Doc-Truncated: 0 indicating no truncation was triggered
  4. After 10–40 seconds, streaming output should begin, and the answer should appear incrementally in the UI
  5. In CloudBase Console → AI+ → Call Records, you should see the input + output token count for that call, with the model field showing deepseek-v4-pro
  6. Upload a very large document that exceeds 800K tokens (e.g. an unpacked source code directory or a long novel) — you should see X-Doc-Truncated: 1 and the UI should display "Exceeded limit — automatically truncated"

Common Errors

Error / SymptomCauseFix
timeout / request hangs around 60sNode SDK default timeout: 15000 is always exceeded in long-document scenarios; or the Next.js Route Handler's own maxDuration was not increasedBoth tcb.init({ env, timeout: 120000 }) and export const maxDuration = 180 must be set
model not found / model deepseek-v4-pro is not supportedThe pro model is not enabled in this environment, or it was misspelled as deepseek-v4-flashCheck "Model Management" in the Console against the exact name; see Model Access. Only the pro tier supports 1M context
context length exceeded / prompt too longToken estimate was too low and the actual content exceeded the model limitLower MAX_INPUT_TOKENS from 800K to 700K and observe; the estimate function is lenient for pure-English documents — for English-only content, use 3.5 characters/token
pdf-parse returns an empty text fieldThe uploaded PDF is a scanned / image-based PDF with no text layerpdf-parse can only read text layers; scanned documents require OCR first (use an external OCR service or Tencent Cloud OCR API)
After deploying to Vercel / CloudBase Run: cannot find module '@cloudbase/node-sdk' or XMLHttpRequest is not definedThe Route Handler uses export const runtime = 'edge'; the Edge Runtime lacks the full Node.js APIChange to runtime = 'nodejs'; both the SDK and pdf-parse depend on Node Buffer / fs, which are unavailable in the Edge Runtime
Streaming response breaks midwayAn exception thrown by streamText inside for await was not forwarded to controller.error(err)The try/catch must convert exceptions to controller.error(err) — do not throw (once the Response has been sent, a throw cannot reach the client)
Garbled characters (���) in the frontend streamTextDecoder.decode(value) was called without { stream: true }, so multi-byte characters split across chunk boundaries are corruptedChange to decoder.decode(value, { stream: true })
Answer is clearly unrelated to the documentTruncation removed the middle sections where the relevant information was locatedIf the user's question is strongly related to the middle of the document, do not use this recipe; use add-rag-with-pgvector-cloudbase to let the retriever precisely recall the middle sections
Excel file produces garbled output for the LLMThe xlsx library outputs binary by default; sheet_to_csv / sheet_to_json must be called explicitlySee Step 4 code; use XLSX.utils.sheet_to_csv(sheet)
Token cost per call is ten times higher than expectedStuffing 800K tokens as input in one call, with deepseek-v4-pro priced several times higher per token than flashThis is expected behavior; if cost is unacceptable, switch to RAG — only the retrieved few thousand tokens are sent to the prompt

For the complete error code reference, see https://docs.cloudbase.net/error-code/.

Billing Notes

  • deepseek-v4-pro input + output token unit prices are both higher than deepseek-v4-flash. A single 1M-token input call can consume the equivalent of dozens of ordinary flash conversations. Check the current unit prices disclosed in the Console before deploying
  • Newly provisioned environments receive free token trial credits for the first month (see the Console billing page for the actual quota) — enough to test a dozen or more million-scale requests
  • Querying the same long document at high frequency causes token costs to accumulate rapidly, since the full text must be re-stuffed on every request. RAG is the correct solution for this scenario (only the retrieved few thousand tokens are sent per query)
  • The Route Handler is itself the backend proxy layer, so credentials never reach the browser, but the /api/longdoc endpoint is still exposed to the public internet. Long-document endpoints are major token consumers — you must add authentication (see add-auth-web-with-cloudbase-sdk) and per-UID rate limiting (e.g. maximum 5 large-document calls per user per hour) to avoid abuse
  • Add .env.local to .gitignore; never commit credentials to the repository. In production, store secrets in your organization's secrets management service
  • add-rag-with-pgvector-cloudbasethe alternative path: when documents are continuously growing / queries are high-frequency / precise attribution is required, use embedding + vector store + retrieval; only the retrieved chunks are sent to the prompt
  • add-ai-nextjs — basic conversational AI with Next.js + CloudBase AI; the Route Handler pattern in this recipe is inherited directly from that one. For short conversations, use deepseek-v4-flash via that recipe
  • add-ai-wechat-miniprogram — the equivalent CloudBase AI implementation for WeChat Mini Programs
  • Model Access — full list of deepseek-v4-pro / deepseek-v4-flash / Hunyuan / Kimi / GLM and other models with context lengths
  • SDK Initialization and Invocation — official initialization guide for app.ai() on the Node.js server side
  • SDK API Reference — complete signatures for createModel / streamText
  • pdf-parse npm — Node.js PDF text extraction; the parser used in this recipe