Build a Browser Voice Chatbot with Deepgram + CloudBase AI
In one sentence: Browser
getUserMedia+MediaRecorderrecords a voice clip → uploads to a Cloud Function that calls Deepgramnova-3for STT → the transcript is fed to CloudBase AIstreamText({ model: 'deepseek-v4-flash' })for a streamed reply → the frontend splits the stream on sentence-ending punctuation and sends each sentence toSpeechSynthesisUtteranceto be read aloud as it arrives. A push-to-talk voice chatbot with no API keys ever touching the browser.Estimated time: 60 minutes | Difficulty: Advanced
Applicable Scenarios
- Customer service / AI assistant products with voice interaction: user holds a button to speak and releases to wait for an answer — classic push-to-talk
- Accessibility: visually impaired users ask questions by voice and receive answers by voice, no screen required
- Hands-free scenarios: in-car, kitchen, gym — user cannot type
- You have already completed connect-deepgram-speech-to-text-cloud-function for STT and add-ai-nextjs for streamText — this recipe combines both into a production-ready end-to-end loop
Not applicable:
- Phone-grade real-time bidirectional voice (interruptions, sub-300 ms latency) — that requires WebRTC + a real-time voice API (Deepgram Live, ElevenLabs Conversational AI, OpenAI Realtime). This recipe is push-to-talk; STT uses the batch API.
- Extremely long conversations (200+ TTS turns continuously) — the browser
speechSynthesisqueue has an implicit limit around 200 entries; if it is never flushed it will start dropping items. Either callcancel()periodically to clear the queue or switch to a third-party TTS HTTP API. - High-quality Chinese TTS — the browser's built-in
SpeechSynthesisUtteranceChinese voice quality varies by OS and is generally mediocre. For better quality, proxy a third-party TTS API (Azure / Volcengine / MiniMax) from a Cloud Function. - Word-level timestamp sync for karaoke-style highlighting — this recipe's TTS is browser speech synthesis; there is no audio file. For subtitle sync, use Deepgram
utterancesand render them yourself.
Prerequisites
| Dependency | Version |
|---|---|
| Browser | Chrome / Edge / Safari 14+ (both MediaRecorder and speechSynthesis are supported); Safari on macOS before version 14 does not support webm in MediaRecorder |
@cloudbase/node-sdk | 3.16.0 or above (required for the AI module; also raise the timeout option to 60 s) |
@deepgram/sdk | ^4.x (this recipe uses the v4 API: listen.v1.media.transcribeFile) |
| Node.js | Cloud Function runtime ≥ 18 |
| CloudBase Environment | Already created and the AI+ capability enabled in the Console |
| Deepgram account | One API key (new accounts receive $200 in free credits) |
You will need:
- A device with a microphone (a laptop's built-in mic is fine)
- An HTTPS domain or
localhost—getUserMediais blocked by browsers in non-secure contexts - A CloudBase Environment ID (
envId) and a Tencent Cloud sub-account key pair (SecretId / SecretKey); scoping the key to just this environment via CAM is the safest approach
End-to-End Flow
[Browser]
getUserMedia → MediaRecorder(webm/opus)
↓ Blob → base64 / FormData
↓ HTTP POST
[Web Cloud Function: voice-stt]
Buffer → Deepgram nova-3 transcribeFile
↓ transcript text
↓ HTTP response
[Browser]
fetch('/api/chat')
↓ POST { messages: [...history, { role: 'user', content: transcript }] }
[Web Cloud Function: voice-chat]
CloudBase AI streamText(deepseek-v4-flash)
↓ textStream(AsyncIterable<string>)
↓ ReadableStream(text/plain)
[Browser]
reader.read() accumulates chunks
↓ splits on . / ? / ! / newline
↓ each complete sentence enqueued to TTS
SpeechSynthesisUtterance + speechSynthesis.speak()
Key design decision: STT is batch (short recording sent in one shot), AI output is streamed, TTS is also streamed. STT is not streamed because a short client recording (typically under 30 s) is simpler to upload and transcribe in one batch than to maintain a WebSocket connection. AI must stream — otherwise the user waits several seconds before hearing the first word. TTS must split on sentences — waiting for the full model output before starting TTS defeats the purpose of streaming.
Step 1: Enable AI+ in CloudBase and Obtain a Deepgram Key
This is identical to Step 1 in add-ai-nextjs and connect-deepgram-speech-to-text-cloud-function:
- CloudBase Console → Environment → AI+ → Quick Start. First-time users will see an "Enable Now" button; enabling is free, and usage is billed per token.
- In Model Management, confirm
deepseek-v4-flashis listed (this is CloudBase's current recommended conversational model; see Model Access for the full list). - Log in to the Deepgram Console → API Keys → Create a New API Key, set Scope to
Member, and copy the key (it is only shown once).
This recipe uses two Web Cloud Functions:
voice-stt— HTTP-triggered Web Cloud Function; receives the audio buffer from the frontend and calls Deepgram to return a transcript.voice-chat— HTTP-triggered Web Cloud Function; receives the message history from the frontend and calls CloudBase AI for a streamed reply.
You can also merge them into a single function and route by path, but keeping them separate is clearer.
Step 2: Write voice-stt (Web Cloud Function, Deepgram STT)
Create a new function directory:
mkdir voice-stt && cd voice-stt
npm init -y
npm install --save @deepgram/sdk
index.js:
const { createClient } = require("@deepgram/sdk");
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
// Web Cloud Function entry: event looks like { httpMethod, body, headers, isBase64Encoded, ... }
exports.main = async (event) => {
// The browser sends audio as a base64 string; the Cloud Function decodes it back to a Buffer.
// Using base64 + JSON avoids the size and Content-Type restrictions on raw binary bodies
// in Web Cloud Functions.
if (event.httpMethod !== "POST") {
return { statusCode: 405, body: "method not allowed" };
}
let audioBuffer;
let language = "zh-CN";
try {
const payload = JSON.parse(event.body || "{}");
if (!payload.audioBase64) {
return {
statusCode: 400,
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ ok: false, error: "missing_audioBase64" }),
};
}
audioBuffer = Buffer.from(payload.audioBase64, "base64");
if (payload.language) language = payload.language;
} catch (err) {
return {
statusCode: 400,
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ ok: false, error: "invalid_body", message: err.message }),
};
}
try {
const response = await deepgram.listen.v1.media.transcribeFile(audioBuffer, {
model: "nova-3",
smart_format: true, // auto punctuation + number formatting
language, // "zh-CN" / "en" / "multi" etc.
// diarize / utterances not needed for single-speaker short questions; omitting saves 200-500 ms
});
const transcript =
response?.result?.results?.channels?.[0]?.alternatives?.[0]?.transcript || "";
return {
statusCode: 200,
headers: {
"Content-Type": "application/json",
"Access-Control-Allow-Origin": "*", // replace with your frontend domain in production
},
body: JSON.stringify({ ok: true, transcript }),
};
} catch (err) {
console.error("deepgram failed", err);
return {
statusCode: 500,
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
ok: false,
error: "deepgram_failed",
statusCode: err.statusCode,
message: err.message,
}),
};
}
};
package.json:
{
"name": "voice-stt",
"main": "index.js",
"dependencies": {
"@deepgram/sdk": "^4.0.0"
}
}
A few notes:
- Audio travels as base64, not multipart: After
MediaRecorderproduces a Blob, the browser callsFileReader.readAsDataURLto get a base64 string and POSTs it as JSON. The Cloud Function decodes it withBuffer.from(payload.audioBase64, 'base64'). Parsing multipart/form-data manually in a Web Cloud Function is more work than this simple JSON approach. - Keep recordings short: Web Cloud Function request bodies are typically capped around 6 MB. base64 adds ~33% overhead, but 30 seconds of Opus at 32 kbps mono is only about 120 KB — far under the limit. For minute-long recordings, switch to "upload to Cloud Storage first → Cloud Function reads by fileID", as shown in connect-deepgram-speech-to-text-cloud-function.
- Always pass
language: "zh-CN": nova-3 defaults toen, which phonetically transcribes Chinese into meaningless English letters. For mixed Chinese/English, pass"multi". - No
diarize/utterances: a single-speaker short question does not need them; omitting saves 200–500 ms.
Step 3: Write voice-chat (Web Cloud Function, CloudBase AI Streaming)
Create a new function directory:
mkdir voice-chat && cd voice-chat
npm init -y
npm install --save @cloudbase/node-sdk
index.js:
const tcb = require("@cloudbase/node-sdk");
// Module-level cache — reused across warm invocations of the same Web Cloud Function instance
let app = null;
function getApp() {
if (!app) {
// timeout 60 s: long streaming output from the model can easily exceed the 15 s default
app = tcb.init({ env: process.env.TCB_ENV, timeout: 60000 });
}
return app;
}
// Web Cloud Function entry
exports.main = async (event) => {
if (event.httpMethod === "OPTIONS") {
return {
statusCode: 204,
headers: {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Headers": "Content-Type",
"Access-Control-Allow-Methods": "POST,OPTIONS",
},
};
}
if (event.httpMethod !== "POST") {
return { statusCode: 405, body: "method not allowed" };
}
let messages;
try {
const payload = JSON.parse(event.body || "{}");
messages = payload.messages;
if (!Array.isArray(messages) || messages.length === 0) {
return {
statusCode: 400,
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ ok: false, error: "missing_messages" }),
};
}
} catch (err) {
return {
statusCode: 400,
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ ok: false, error: "invalid_body", message: err.message }),
};
}
const ai = getApp().ai();
const model = ai.createModel("cloudbase");
// System prompt tuned for voice: short sentences, conversational, no markdown
const systemMsg = {
role: "system",
content:
"You are a voice assistant. The user talks to you by voice. Keep responses short and conversational. Use complete sentences ending with a period, question mark, or exclamation point. Do not use markdown, bullet points, or code blocks — plain text only.",
};
const result = await model.streamText({
model: "deepseek-v4-flash",
messages: [systemMsg, ...messages],
});
// Web Cloud Functions support returning a ReadableStream directly as body in recent runtimes
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
try {
for await (const chunk of result.textStream) {
controller.enqueue(encoder.encode(chunk));
}
controller.close();
} catch (err) {
console.error("streamText failed", err);
controller.error(err);
}
},
});
return {
statusCode: 200,
headers: {
"Content-Type": "text/plain; charset=utf-8",
"Access-Control-Allow-Origin": "*",
"Cache-Control": "no-cache",
},
body: stream,
};
};
package.json:
{
"name": "voice-chat",
"main": "index.js",
"dependencies": {
"@cloudbase/node-sdk": "^3.16.0"
}
}
Notes:
- Server SDK + environment credentials: Do not use
@cloudbase/js-sdkwithsignInAnonymously()here. Anonymous identity in a Web Cloud Function is subject to strict rate limiting; see Web SDK Security Policy. provider: 'cloudbase': ThecreateModel('cloudbase')call is what routes billing and auth through CloudBase's unified AI gateway.timeout: 60000: Streaming output over 30 seconds is normal; the SDK default of 15 s will cut the connection.- CORS: Both the OPTIONS preflight and the real response need
Access-Control-Allow-Origin. Replace*with your specific frontend domain in production. - If your runtime does not support returning a
ReadableStreamdirectly, see the compatibility fallback at the end of this recipe.
Step 4: Frontend — Record, Transcribe, Stream, and Speak
Complete single-file HTML demo (split into React/Vue components for production):
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>Voice Chatbot</title>
</head>
<body>
<button id="btn">Hold to Talk</button>
<div id="status">Ready</div>
<div id="dialog"></div>
<script type="module">
// Replace with your Web Cloud Function HTTPS URLs after deployment
const STT_URL = "https://your-env.service.tcloudbase.com/voice-stt";
const CHAT_URL = "https://your-env.service.tcloudbase.com/voice-chat";
const btn = document.getElementById("btn");
const statusEl = document.getElementById("status");
const dialogEl = document.getElementById("dialog");
let mediaRecorder = null;
let chunks = [];
const messages = []; // multi-turn conversation history
// ========== Recording ==========
btn.addEventListener("mousedown", startRecord);
btn.addEventListener("mouseup", stopRecord);
btn.addEventListener("touchstart", (e) => {
e.preventDefault();
startRecord();
});
btn.addEventListener("touchend", (e) => {
e.preventDefault();
stopRecord();
});
async function startRecord() {
try {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
chunks = [];
mediaRecorder = new MediaRecorder(stream, {
mimeType: "audio/webm;codecs=opus",
});
mediaRecorder.ondataavailable = (e) => {
if (e.data.size > 0) chunks.push(e.data);
};
mediaRecorder.onstop = onRecordStop;
mediaRecorder.start();
statusEl.textContent = "Recording...";
} catch (err) {
console.error("getUserMedia failed", err);
statusEl.textContent = "Microphone error: " + err.message;
}
}
function stopRecord() {
if (mediaRecorder && mediaRecorder.state !== "inactive") {
mediaRecorder.stop();
mediaRecorder.stream.getTracks().forEach((t) => t.stop());
}
}
// ========== STT ==========
async function onRecordStop() {
const blob = new Blob(chunks, { type: "audio/webm" });
if (blob.size < 1000) {
statusEl.textContent = "Recording too short";
return;
}
statusEl.textContent = "Transcribing...";
const audioBase64 = await blobToBase64(blob);
let transcript = "";
try {
const res = await fetch(STT_URL, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ audioBase64, language: "zh-CN" }),
});
const data = await res.json();
if (!data.ok) throw new Error(data.error || data.message);
transcript = data.transcript;
} catch (err) {
statusEl.textContent = "STT error: " + err.message;
return;
}
if (!transcript.trim()) {
statusEl.textContent = "Nothing heard, please try again";
return;
}
appendBubble("user", transcript);
messages.push({ role: "user", content: transcript });
statusEl.textContent = "AI generating...";
await streamChatAndSpeak();
}
function blobToBase64(blob) {
return new Promise((resolve, reject) => {
const reader = new FileReader();
reader.onloadend = () => {
// data:audio/webm;base64,xxxxx — strip the prefix
const result = reader.result;
const idx = result.indexOf(",");
resolve(result.slice(idx + 1));
};
reader.onerror = reject;
reader.readAsDataURL(blob);
});
}
// ========== AI streaming + sentence-by-sentence TTS ==========
async function streamChatAndSpeak() {
// Cancel any TTS still playing from the previous turn before starting a new one
window.speechSynthesis.cancel();
const res = await fetch(CHAT_URL, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages }),
});
if (!res.ok || !res.body) {
statusEl.textContent = "AI request failed: " + res.status;
return;
}
const reader = res.body.getReader();
const decoder = new TextDecoder();
let acc = ""; // full transcript for display
let sentenceBuf = ""; // buffer for the current in-progress sentence
const aiBubble = appendBubble("assistant", "");
// Chinese and English sentence-ending punctuation + newlines
const sentenceEnd = /[。!?!?\n]/;
while (true) {
const { done, value } = await reader.read();
if (done) break;
// stream: true is essential — without it, multi-byte UTF-8 characters split across
// chunk boundaries produce replacement characters (U+FFFD)
const text = decoder.decode(value, { stream: true });
acc += text;
aiBubble.textContent = acc;
sentenceBuf += text;
// Repeatedly slice complete sentences out and queue them for TTS
let m;
while ((m = sentenceBuf.match(sentenceEnd))) {
const cut = m.index + 1;
const sentence = sentenceBuf.slice(0, cut).trim();
sentenceBuf = sentenceBuf.slice(cut);
if (sentence) speak(sentence);
}
}
// Stream ended — speak any remaining text that has no terminal punctuation
if (sentenceBuf.trim()) speak(sentenceBuf.trim());
messages.push({ role: "assistant", content: acc });
statusEl.textContent = "Ready";
}
function speak(text) {
const u = new SpeechSynthesisUtterance(text);
u.lang = "zh-CN";
u.rate = 1.05; // slightly faster — more natural for conversational voice
u.pitch = 1;
window.speechSynthesis.speak(u);
}
// ========== UI helpers ==========
function appendBubble(role, text) {
const div = document.createElement("div");
div.style.padding = "8px";
div.style.margin = "8px 0";
div.style.background = role === "user" ? "#eef" : "#f5f5f5";
div.style.whiteSpace = "pre-wrap";
div.textContent = (role === "user" ? "You: " : "AI: ") + text;
dialogEl.appendChild(div);
return {
set textContent(t) {
div.textContent = (role === "user" ? "You: " : "AI: ") + t;
},
get textContent() {
return div.textContent;
},
};
}
</script>
</body>
</html>
A few details worth calling out:
Sentence-splitting regex: /[。!?!?\n]/ matches both Chinese and English sentence-ending punctuation. English commas are deliberately excluded — splitting on commas produces choppy TTS output. If the conversation is primarily English, adding ; (semicolons) to the set can also help with natural pausing.
TTS queue is speechSynthesis itself: calling speak() multiple times enqueues them automatically — you do not need to await the previous sentence. speechSynthesis.cancel() clears the entire queue, so call it at the start of each new turn — otherwise leftover audio from the previous turn competes with the new reply.
stream: true is non-negotiable for multi-byte characters: A single CJK character is 3 bytes in UTF-8. Streaming chunk boundaries regularly split characters in half. Without stream: true, TextDecoder emits \uFFFD (the replacement character \u{FFFD}). With it, the decoder holds the partial bytes and completes them in the next call.
Multi-turn context: the messages array carries the full history in every request. The model follows OpenAI-style role: user / assistant alternation to maintain context. Because the voice system prompt enforces short, conversational answers, token consumption per turn is typically lower than text-only chat.
Push-to-talk vs hold-to-talk: this example uses mousedown / touchstart to start and mouseup / touchend to stop — the most reliable approach on mobile. For VAD (Voice Activity Detection), either include a vad-web package on the frontend or use Deepgram's WebSocket streaming API with built-in endpointing.
Step 5: Deploy
Deploy both Web Cloud Functions:
cd voice-stt
tcb fn deploy voice-stt -e your-env-id --type http
cd ../voice-chat
tcb fn deploy voice-chat -e your-env-id --type http
In the Console:
- voice-stt environment variables:
DEEPGRAM_API_KEY; timeout 30 s; memory 256 MB - voice-chat environment variables:
TCB_ENV= your Environment ID; if deploying outside CloudBase (Vercel / self-hosted), also setTENCENTCLOUD_SECRETID+TENCENTCLOUD_SECRETKEY(CloudBase's own runtime injects them automatically); timeout 60 s minimum; memory 512 MB - Both functions need an HTTPS URL generated under "Function Config → HTTP Trigger", in the form
https://your-env.service.tcloudbaseapp.com/voice-stt. Use these URLs forSTT_URLandCHAT_URLin the frontend.
The frontend HTML can be deployed to CloudBase Static Hosting, or run locally with python3 -m http.server at localhost (note that getUserMedia requires HTTPS or localhost).
Running Verification
- Open the deployed HTML in a browser. It will prompt for microphone permission — click Allow.
- Hold the "Hold to Talk" button and say a question, e.g. "What is CloudBase?", then release.
- The status bar should cycle through:
Recording...→Transcribing...→AI generating...→Ready. - The dialog should show "You: What is CloudBase?" and then "AI: CloudBase is…" growing character by character.
- The browser reads the first complete sentence aloud, then continues reading as more sentences arrive. The first word should be audible within 1.5–3 seconds.
- Ask a follow-up question (e.g. "What can it do?") — the AI should answer in context, confirming multi-turn history works.
- Open DevTools → Network:
/voice-sttreturns JSON with atranscriptfield;/voice-chatshows a streaming response (plain text accumulating, not a single JSON blob). - CloudBase Console → AI+ → Call Records shows token counts; Deepgram Console → Usage shows the audio duration billed.
Common Errors
| Error / Symptom | Cause | Fix |
|---|---|---|
getUserMedia is not a function / permission denied | Page is not served over HTTPS or localhost; browser refuses microphone access | Use an HTTPS domain (CloudBase Static Hosting provides HTTPS automatically), or debug on localhost; corporate self-signed certs also work but must be trusted first |
Recording Blob size === 0 or under 1 KB | MediaRecorder.start() followed immediately by stop() before data flushes; or the browser does not support mimeType | Wait at least 500 ms before stopping; if audio/webm;codecs=opus is unsupported, omit mimeType and let the browser choose (Safari will produce audio/mp4, which Deepgram also accepts) |
Deepgram returns Audio decode failed | Corrupted webm header, or MediaRecorder did not fully flush on stop | Assemble the Blob inside mediaRecorder.onstop from the chunks array — do not send a single raw chunk from ondataavailable; as a last resort, transcode with ffmpeg before uploading |
| Chinese audio transcribed as transliteration (e.g. "Da jia hao") | STT request missing language: "zh-CN"; nova-3 defaults to en | Explicitly pass language: "zh-CN" in both the Cloud Function and the frontend request; use "multi" for mixed Chinese/English |
Chinese text in the AI stream shows \uFFFD (replacement characters) | TextDecoder.decode(value) called without { stream: true }; multi-byte characters split across chunk boundaries | Change to decoder.decode(value, { stream: true }) as shown in this recipe |
| TTS produces no sound / only the first sentence plays | speechSynthesis suspended after inactivity or too many cancel() calls; or queue interrupted | Call speechSynthesis.cancel() once at the start of each new turn to reset; on Chrome, speechSynthesis suspends after ~14 s of silence — call speechSynthesis.resume() before each speak() if there are long pauses between turns |
| TTS cuts off mid-sentence; Chinese text breaks inside a word | Sentence-splitting regex is too aggressive (e.g. treating commas as sentence ends), or the model output contains a long uninterrupted passage without punctuation | Use only [。!?!?\n] in the regex; add "end each sentence with terminal punctuation" to the model's system prompt as shown in the voice-chat function above |
CORS error after deployment | Web Cloud Function response missing Access-Control-Allow-Origin, or OPTIONS preflight not handled | Return 204 + CORS headers for OPTIONS; add Access-Control-Allow-Origin: * (or a specific domain) to POST responses as shown in this recipe |
streamText times out after 60 s / stream drops mid-way | Node SDK default timeout: 15000 is too short; or the Web Cloud Function runtime has its own response time limit | Set tcb.init({ env, timeout: 60000 }); also raise the function timeout in the Console to at least 60 s; for very long outputs consider chunking |
| Second turn: AI does not remember the first turn | Frontend messages array did not include the assistant reply from the previous turn | After reading the stream, push { role: "assistant", content: acc } into messages as shown in this recipe |
| After many turns, TTS stops responding | speechSynthesis queue has an implicit limit (around 200 entries); a saturated queue silently drops new items | Call speechSynthesis.cancel() at the start of each user turn; for very long sessions, consider switching to a third-party TTS HTTP API |
For CloudBase-side error codes see https://docs.cloudbase.net/error-code/. For Deepgram error codes, see the Deepgram documentation.
Pricing Notes
- Deepgram STT:
nova-3is billed per audio minute (not tokens). Pay-As-You-Go is approximately $0.0043 per minute. A 5-second question costs roughly $0.0004; 1,000 questions cost about $0.40. - CloudBase AI:
deepseek-v4-flashis billed per input + output token. Because the system prompt enforces short conversational answers, token cost per turn is typically lower than text-only chat. New environments receive 1 million tokens free for the first month (check the Console billing page for current figures). - Web Cloud Functions: billed per invocation + resource consumption (GB-seconds). Each question involves 2 function calls (stt + chat), billed at standard CloudBase rates.
- TTS is completely free:
SpeechSynthesisUtteranceruns the browser's local speech engine and makes no network requests. No billing applies.
Web Cloud Function Streaming Compatibility Fallback
If your Web Cloud Function runtime does not support returning a ReadableStream directly, use Node's Readable stream instead:
// res.write() style (supported by some runtimes as a streaming response)
const { Readable } = require("stream");
const stream = new Readable({
read() {},
});
(async () => {
try {
for await (const chunk of result.textStream) {
stream.push(chunk);
}
stream.push(null);
} catch (err) {
stream.destroy(err);
}
})();
return {
statusCode: 200,
headers: { "Content-Type": "text/plain; charset=utf-8" },
body: stream,
};
Alternatively, use SSE (text/event-stream): wrap each chunk as data: xxx\n\n and consume it with EventSource on the frontend. SSE adds protocol overhead but provides automatic browser reconnection, which is more robust for long conversations.
Related Documentation
- connect-deepgram-speech-to-text-cloud-function — STT deep dive (pull audio from Cloud Storage, batch transcription with diarization and utterances). The STT portion of this recipe is a "short recording direct upload" variant of that one.
- add-ai-nextjs — CloudBase AI
streamTextin Next.js (Route Handler + ReadableStream). The chat portion of this recipe is a direct port of that approach to a Web Cloud Function. - add-ai-wechat-miniprogram — The same CloudBase AI integration in a Mini Program; Mini Program's built-in identity removes the need for a backend proxy.
- connect-tavily-search-cloud-function — Add live web search to the voice chatbot: call Tavily before
streamText, inject the results into the system prompt, and build a search-augmented voice chatbot. - add-realtime-notifications-database-watch — Persist each conversation turn (transcript + AI reply) to the database and use
watchfor multi-device sync of voice conversation history. - secure-secrets-in-cloud-function — Layered management of
DEEPGRAM_API_KEYandTENCENTCLOUD_SECRETKEYacross local dev, CI, and production. - CloudBase AI SDK — Init and Invocation — Official reference for
app.ai()initialization. - Model Access — Full list of models available beyond
deepseek-v4-flash.