Build a Browser Voice Chatbot with Deepgram + CloudBase AI

In one sentence: Browser getUserMedia + MediaRecorder records a voice clip → uploads to a Cloud Function that calls Deepgram nova-3 for STT → the transcript is fed to CloudBase AI streamText({ model: 'deepseek-v4-flash' }) for a streamed reply → the frontend splits the stream on sentence-ending punctuation and sends each sentence to SpeechSynthesisUtterance to be read aloud as it arrives. A push-to-talk voice chatbot with no API keys ever touching the browser.

Estimated time: 60 minutes | Difficulty: Advanced

Applicable Scenarios

Customer service / AI assistant products with voice interaction: user holds a button to speak and releases to wait for an answer — classic push-to-talk
Accessibility: visually impaired users ask questions by voice and receive answers by voice, no screen required
Hands-free scenarios: in-car, kitchen, gym — user cannot type
You have already completed connect-deepgram-speech-to-text-cloud-function for STT and add-ai-nextjs for streamText — this recipe combines both into a production-ready end-to-end loop

Not applicable:

Phone-grade real-time bidirectional voice (interruptions, sub-300 ms latency) — that requires WebRTC + a real-time voice API (Deepgram Live, ElevenLabs Conversational AI, OpenAI Realtime). This recipe is push-to-talk; STT uses the batch API.
Extremely long conversations (200+ TTS turns continuously) — the browser speechSynthesis queue has an implicit limit around 200 entries; if it is never flushed it will start dropping items. Either call cancel() periodically to clear the queue or switch to a third-party TTS HTTP API.
High-quality Chinese TTS — the browser's built-in SpeechSynthesisUtterance Chinese voice quality varies by OS and is generally mediocre. For better quality, proxy a third-party TTS API (Azure / Volcengine / MiniMax) from a Cloud Function.
Word-level timestamp sync for karaoke-style highlighting — this recipe's TTS is browser speech synthesis; there is no audio file. For subtitle sync, use Deepgram utterances and render them yourself.

Prerequisites

Dependency	Version
Browser	Chrome / Edge / Safari 14+ (both `MediaRecorder` and `speechSynthesis` are supported); Safari on macOS before version 14 does not support `webm` in `MediaRecorder`
`@cloudbase/node-sdk`	`3.16.0` or above (required for the AI module; also raise the `timeout` option to 60 s)
`@deepgram/sdk`	`^4.x` (this recipe uses the v4 API: `listen.v1.media.transcribeFile`)
Node.js	Cloud Function runtime ≥ 18
CloudBase Environment	Already created and the AI+ capability enabled in the Console
Deepgram account	One API key (new accounts receive $200 in free credits)

You will need:

A device with a microphone (a laptop's built-in mic is fine)
An HTTPS domain or localhost — getUserMedia is blocked by browsers in non-secure contexts
A CloudBase Environment ID (envId) and a Tencent Cloud sub-account key pair (SecretId / SecretKey); scoping the key to just this environment via CAM is the safest approach

End-to-End Flow

[Browser]
  getUserMedia → MediaRecorder(webm/opus)
   ↓ Blob → base64 / FormData
   ↓ HTTP POST
[Web Cloud Function: voice-stt]
  Buffer → Deepgram nova-3 transcribeFile
   ↓ transcript text
   ↓ HTTP response
[Browser]
  fetch('/api/chat')
   ↓ POST { messages: [...history, { role: 'user', content: transcript }] }
[Web Cloud Function: voice-chat]
  CloudBase AI streamText(deepseek-v4-flash)
   ↓ textStream(AsyncIterable<string>)
   ↓ ReadableStream(text/plain)
[Browser]
  reader.read() accumulates chunks
   ↓ splits on . / ? / ! / newline
   ↓ each complete sentence enqueued to TTS
  SpeechSynthesisUtterance + speechSynthesis.speak()

Key design decision: STT is batch (short recording sent in one shot), AI output is streamed, TTS is also streamed. STT is not streamed because a short client recording (typically under 30 s) is simpler to upload and transcribe in one batch than to maintain a WebSocket connection. AI must stream — otherwise the user waits several seconds before hearing the first word. TTS must split on sentences — waiting for the full model output before starting TTS defeats the purpose of streaming.

Step 1: Enable AI+ in CloudBase and Obtain a Deepgram Key

This is identical to Step 1 in add-ai-nextjs and connect-deepgram-speech-to-text-cloud-function:

CloudBase Console → Environment → AI+ → Quick Start. First-time users will see an "Enable Now" button; enabling is free, and usage is billed per token.
In Model Management, confirm deepseek-v4-flash is listed (this is CloudBase's current recommended conversational model; see Model Access for the full list).
Log in to the Deepgram Console → API Keys → Create a New API Key, set Scope to Member, and copy the key (it is only shown once).

This recipe uses two Web Cloud Functions:

voice-stt — HTTP-triggered Web Cloud Function; receives the audio buffer from the frontend and calls Deepgram to return a transcript.
voice-chat — HTTP-triggered Web Cloud Function; receives the message history from the frontend and calls CloudBase AI for a streamed reply.

You can also merge them into a single function and route by path, but keeping them separate is clearer.

Step 2: Write voice-stt (Web Cloud Function, Deepgram STT)

Create a new function directory:

mkdir voice-stt && cd voice-stt
npm init -y
npm install --save @deepgram/sdk

index.js:

const { createClient } = require("@deepgram/sdk");

const deepgram = createClient(process.env.DEEPGRAM_API_KEY);

// Web Cloud Function entry: event looks like { httpMethod, body, headers, isBase64Encoded, ... }
exports.main = async (event) => {
  // The browser sends audio as a base64 string; the Cloud Function decodes it back to a Buffer.
  // Using base64 + JSON avoids the size and Content-Type restrictions on raw binary bodies
  // in Web Cloud Functions.
  if (event.httpMethod !== "POST") {
    return { statusCode: 405, body: "method not allowed" };
  }

  let audioBuffer;
  let language = "zh-CN";
  try {
    const payload = JSON.parse(event.body || "{}");
    if (!payload.audioBase64) {
      return {
        statusCode: 400,
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ ok: false, error: "missing_audioBase64" }),
      };
    }
    audioBuffer = Buffer.from(payload.audioBase64, "base64");
    if (payload.language) language = payload.language;
  } catch (err) {
    return {
      statusCode: 400,
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ ok: false, error: "invalid_body", message: err.message }),
    };
  }

  try {
    const response = await deepgram.listen.v1.media.transcribeFile(audioBuffer, {
      model: "nova-3",
      smart_format: true, // auto punctuation + number formatting
      language,           // "zh-CN" / "en" / "multi" etc.
      // diarize / utterances not needed for single-speaker short questions; omitting saves 200-500 ms
    });

    const transcript =
      response?.result?.results?.channels?.[0]?.alternatives?.[0]?.transcript || "";

    return {
      statusCode: 200,
      headers: {
        "Content-Type": "application/json",
        "Access-Control-Allow-Origin": "*", // replace with your frontend domain in production
      },
      body: JSON.stringify({ ok: true, transcript }),
    };
  } catch (err) {
    console.error("deepgram failed", err);
    return {
      statusCode: 500,
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        ok: false,
        error: "deepgram_failed",
        statusCode: err.statusCode,
        message: err.message,
      }),
    };
  }
};

package.json:

{
  "name": "voice-stt",
  "main": "index.js",
  "dependencies": {
    "@deepgram/sdk": "^4.0.0"
  }
}

A few notes:

Audio travels as base64, not multipart: After MediaRecorder produces a Blob, the browser calls FileReader.readAsDataURL to get a base64 string and POSTs it as JSON. The Cloud Function decodes it with Buffer.from(payload.audioBase64, 'base64'). Parsing multipart/form-data manually in a Web Cloud Function is more work than this simple JSON approach.
Keep recordings short: Web Cloud Function request bodies are typically capped around 6 MB. base64 adds ~33% overhead, but 30 seconds of Opus at 32 kbps mono is only about 120 KB — far under the limit. For minute-long recordings, switch to "upload to Cloud Storage first → Cloud Function reads by fileID", as shown in connect-deepgram-speech-to-text-cloud-function.
Always pass language: "zh-CN": nova-3 defaults to en, which phonetically transcribes Chinese into meaningless English letters. For mixed Chinese/English, pass "multi".
No diarize / utterances: a single-speaker short question does not need them; omitting saves 200–500 ms.

Step 3: Write voice-chat (Web Cloud Function, CloudBase AI Streaming)

Create a new function directory:

mkdir voice-chat && cd voice-chat
npm init -y
npm install --save @cloudbase/node-sdk

index.js:

const tcb = require("@cloudbase/node-sdk");

// Module-level cache — reused across warm invocations of the same Web Cloud Function instance
let app = null;
function getApp() {
  if (!app) {
    // timeout 60 s: long streaming output from the model can easily exceed the 15 s default
    app = tcb.init({ env: process.env.TCB_ENV, timeout: 60000 });
  }
  return app;
}

// Web Cloud Function entry
exports.main = async (event) => {
  if (event.httpMethod === "OPTIONS") {
    return {
      statusCode: 204,
      headers: {
        "Access-Control-Allow-Origin": "*",
        "Access-Control-Allow-Headers": "Content-Type",
        "Access-Control-Allow-Methods": "POST,OPTIONS",
      },
    };
  }
  if (event.httpMethod !== "POST") {
    return { statusCode: 405, body: "method not allowed" };
  }

  let messages;
  try {
    const payload = JSON.parse(event.body || "{}");
    messages = payload.messages;
    if (!Array.isArray(messages) || messages.length === 0) {
      return {
        statusCode: 400,
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ ok: false, error: "missing_messages" }),
      };
    }
  } catch (err) {
    return {
      statusCode: 400,
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ ok: false, error: "invalid_body", message: err.message }),
    };
  }

  const ai = getApp().ai();
  const model = ai.createModel("cloudbase");

  // System prompt tuned for voice: short sentences, conversational, no markdown
  const systemMsg = {
    role: "system",
    content:
      "You are a voice assistant. The user talks to you by voice. Keep responses short and conversational. Use complete sentences ending with a period, question mark, or exclamation point. Do not use markdown, bullet points, or code blocks — plain text only.",
  };

  const result = await model.streamText({
    model: "deepseek-v4-flash",
    messages: [systemMsg, ...messages],
  });

  // Web Cloud Functions support returning a ReadableStream directly as body in recent runtimes
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      try {
        for await (const chunk of result.textStream) {
          controller.enqueue(encoder.encode(chunk));
        }
        controller.close();
      } catch (err) {
        console.error("streamText failed", err);
        controller.error(err);
      }
    },
  });

  return {
    statusCode: 200,
    headers: {
      "Content-Type": "text/plain; charset=utf-8",
      "Access-Control-Allow-Origin": "*",
      "Cache-Control": "no-cache",
    },
    body: stream,
  };
};

package.json:

{
  "name": "voice-chat",
  "main": "index.js",
  "dependencies": {
    "@cloudbase/node-sdk": "^3.16.0"
  }
}

Notes:

Server SDK + environment credentials: Do not use @cloudbase/js-sdk with signInAnonymously() here. Anonymous identity in a Web Cloud Function is subject to strict rate limiting; see Web SDK Security Policy.
provider: 'cloudbase': The createModel('cloudbase') call is what routes billing and auth through CloudBase's unified AI gateway.
timeout: 60000: Streaming output over 30 seconds is normal; the SDK default of 15 s will cut the connection.
CORS: Both the OPTIONS preflight and the real response need Access-Control-Allow-Origin. Replace * with your specific frontend domain in production.
If your runtime does not support returning a ReadableStream directly, see the compatibility fallback at the end of this recipe.

Step 4: Frontend — Record, Transcribe, Stream, and Speak

Complete single-file HTML demo (split into React/Vue components for production):

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8" />
    <title>Voice Chatbot</title>
  </head>
  <body>
    <button id="btn">Hold to Talk</button>
    <div id="status">Ready</div>
    <div id="dialog"></div>

    <script type="module">
      // Replace with your Web Cloud Function HTTPS URLs after deployment
      const STT_URL = "https://your-env.service.tcloudbase.com/voice-stt";
      const CHAT_URL = "https://your-env.service.tcloudbase.com/voice-chat";

      const btn = document.getElementById("btn");
      const statusEl = document.getElementById("status");
      const dialogEl = document.getElementById("dialog");

      let mediaRecorder = null;
      let chunks = [];
      const messages = []; // multi-turn conversation history

      // ========== Recording ==========
      btn.addEventListener("mousedown", startRecord);
      btn.addEventListener("mouseup", stopRecord);
      btn.addEventListener("touchstart", (e) => {
        e.preventDefault();
        startRecord();
      });
      btn.addEventListener("touchend", (e) => {
        e.preventDefault();
        stopRecord();
      });

      async function startRecord() {
        try {
          const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
          chunks = [];
          mediaRecorder = new MediaRecorder(stream, {
            mimeType: "audio/webm;codecs=opus",
          });
          mediaRecorder.ondataavailable = (e) => {
            if (e.data.size > 0) chunks.push(e.data);
          };
          mediaRecorder.onstop = onRecordStop;
          mediaRecorder.start();
          statusEl.textContent = "Recording...";
        } catch (err) {
          console.error("getUserMedia failed", err);
          statusEl.textContent = "Microphone error: " + err.message;
        }
      }

      function stopRecord() {
        if (mediaRecorder && mediaRecorder.state !== "inactive") {
          mediaRecorder.stop();
          mediaRecorder.stream.getTracks().forEach((t) => t.stop());
        }
      }

      // ========== STT ==========
      async function onRecordStop() {
        const blob = new Blob(chunks, { type: "audio/webm" });
        if (blob.size < 1000) {
          statusEl.textContent = "Recording too short";
          return;
        }
        statusEl.textContent = "Transcribing...";
        const audioBase64 = await blobToBase64(blob);

        let transcript = "";
        try {
          const res = await fetch(STT_URL, {
            method: "POST",
            headers: { "Content-Type": "application/json" },
            body: JSON.stringify({ audioBase64, language: "zh-CN" }),
          });
          const data = await res.json();
          if (!data.ok) throw new Error(data.error || data.message);
          transcript = data.transcript;
        } catch (err) {
          statusEl.textContent = "STT error: " + err.message;
          return;
        }

        if (!transcript.trim()) {
          statusEl.textContent = "Nothing heard, please try again";
          return;
        }

        appendBubble("user", transcript);
        messages.push({ role: "user", content: transcript });
        statusEl.textContent = "AI generating...";
        await streamChatAndSpeak();
      }

      function blobToBase64(blob) {
        return new Promise((resolve, reject) => {
          const reader = new FileReader();
          reader.onloadend = () => {
            // data:audio/webm;base64,xxxxx — strip the prefix
            const result = reader.result;
            const idx = result.indexOf(",");
            resolve(result.slice(idx + 1));
          };
          reader.onerror = reject;
          reader.readAsDataURL(blob);
        });
      }

      // ========== AI streaming + sentence-by-sentence TTS ==========
      async function streamChatAndSpeak() {
        // Cancel any TTS still playing from the previous turn before starting a new one
        window.speechSynthesis.cancel();

        const res = await fetch(CHAT_URL, {
          method: "POST",
          headers: { "Content-Type": "application/json" },
          body: JSON.stringify({ messages }),
        });
        if (!res.ok || !res.body) {
          statusEl.textContent = "AI request failed: " + res.status;
          return;
        }

        const reader = res.body.getReader();
        const decoder = new TextDecoder();
        let acc = "";        // full transcript for display
        let sentenceBuf = ""; // buffer for the current in-progress sentence
        const aiBubble = appendBubble("assistant", "");

        // Chinese and English sentence-ending punctuation + newlines
        const sentenceEnd = /[。!?!?\n]/;

        while (true) {
          const { done, value } = await reader.read();
          if (done) break;
          // stream: true is essential — without it, multi-byte UTF-8 characters split across
          // chunk boundaries produce replacement characters (U+FFFD)
          const text = decoder.decode(value, { stream: true });
          acc += text;
          aiBubble.textContent = acc;
          sentenceBuf += text;

          // Repeatedly slice complete sentences out and queue them for TTS
          let m;
          while ((m = sentenceBuf.match(sentenceEnd))) {
            const cut = m.index + 1;
            const sentence = sentenceBuf.slice(0, cut).trim();
            sentenceBuf = sentenceBuf.slice(cut);
            if (sentence) speak(sentence);
          }
        }
        // Stream ended — speak any remaining text that has no terminal punctuation
        if (sentenceBuf.trim()) speak(sentenceBuf.trim());

        messages.push({ role: "assistant", content: acc });
        statusEl.textContent = "Ready";
      }

      function speak(text) {
        const u = new SpeechSynthesisUtterance(text);
        u.lang = "zh-CN";
        u.rate = 1.05; // slightly faster — more natural for conversational voice
        u.pitch = 1;
        window.speechSynthesis.speak(u);
      }

      // ========== UI helpers ==========
      function appendBubble(role, text) {
        const div = document.createElement("div");
        div.style.padding = "8px";
        div.style.margin = "8px 0";
        div.style.background = role === "user" ? "#eef" : "#f5f5f5";
        div.style.whiteSpace = "pre-wrap";
        div.textContent = (role === "user" ? "You: " : "AI: ") + text;
        dialogEl.appendChild(div);
        return {
          set textContent(t) {
            div.textContent = (role === "user" ? "You: " : "AI: ") + t;
          },
          get textContent() {
            return div.textContent;
          },
        };
      }
    </script>
  </body>
</html>

A few details worth calling out:

Sentence-splitting regex: /[。!?!?\n]/ matches both Chinese and English sentence-ending punctuation. English commas are deliberately excluded — splitting on commas produces choppy TTS output. If the conversation is primarily English, adding ; (semicolons) to the set can also help with natural pausing.

TTS queue is speechSynthesis itself: calling speak() multiple times enqueues them automatically — you do not need to await the previous sentence. speechSynthesis.cancel() clears the entire queue, so call it at the start of each new turn — otherwise leftover audio from the previous turn competes with the new reply.

stream: true is non-negotiable for multi-byte characters: A single CJK character is 3 bytes in UTF-8. Streaming chunk boundaries regularly split characters in half. Without stream: true, TextDecoder emits \uFFFD (the replacement character \u{FFFD}). With it, the decoder holds the partial bytes and completes them in the next call.

Multi-turn context: the messages array carries the full history in every request. The model follows OpenAI-style role: user / assistant alternation to maintain context. Because the voice system prompt enforces short, conversational answers, token consumption per turn is typically lower than text-only chat.

Push-to-talk vs hold-to-talk: this example uses mousedown / touchstart to start and mouseup / touchend to stop — the most reliable approach on mobile. For VAD (Voice Activity Detection), either include a vad-web package on the frontend or use Deepgram's WebSocket streaming API with built-in endpointing.

Step 5: Deploy

Deploy both Web Cloud Functions:

cd voice-stt
tcb fn deploy voice-stt -e your-env-id --type http

cd ../voice-chat
tcb fn deploy voice-chat -e your-env-id --type http

In the Console:

voice-stt environment variables: DEEPGRAM_API_KEY; timeout 30 s; memory 256 MB
voice-chat environment variables: TCB_ENV = your Environment ID; if deploying outside CloudBase (Vercel / self-hosted), also set TENCENTCLOUD_SECRETID + TENCENTCLOUD_SECRETKEY (CloudBase's own runtime injects them automatically); timeout 60 s minimum; memory 512 MB
Both functions need an HTTPS URL generated under "Function Config → HTTP Trigger", in the form https://your-env.service.tcloudbaseapp.com/voice-stt. Use these URLs for STT_URL and CHAT_URL in the frontend.

The frontend HTML can be deployed to CloudBase Static Hosting, or run locally with python3 -m http.server at localhost (note that getUserMedia requires HTTPS or localhost).

Running Verification

Open the deployed HTML in a browser. It will prompt for microphone permission — click Allow.
Hold the "Hold to Talk" button and say a question, e.g. "What is CloudBase?", then release.
The status bar should cycle through: Recording... → Transcribing... → AI generating... → Ready.
The dialog should show "You: What is CloudBase?" and then "AI: CloudBase is…" growing character by character.
The browser reads the first complete sentence aloud, then continues reading as more sentences arrive. The first word should be audible within 1.5–3 seconds.
Ask a follow-up question (e.g. "What can it do?") — the AI should answer in context, confirming multi-turn history works.
Open DevTools → Network: /voice-stt returns JSON with a transcript field; /voice-chat shows a streaming response (plain text accumulating, not a single JSON blob).
CloudBase Console → AI+ → Call Records shows token counts; Deepgram Console → Usage shows the audio duration billed.

Common Errors

Error / Symptom	Cause	Fix
`getUserMedia is not a function` / `permission denied`	Page is not served over HTTPS or `localhost`; browser refuses microphone access	Use an HTTPS domain (CloudBase Static Hosting provides HTTPS automatically), or debug on `localhost`; corporate self-signed certs also work but must be trusted first
Recording Blob `size === 0` or under 1 KB	`MediaRecorder.start()` followed immediately by `stop()` before data flushes; or the browser does not support `mimeType`	Wait at least 500 ms before stopping; if `audio/webm;codecs=opus` is unsupported, omit `mimeType` and let the browser choose (Safari will produce `audio/mp4`, which Deepgram also accepts)
Deepgram returns `Audio decode failed`	Corrupted webm header, or `MediaRecorder` did not fully flush on `stop`	Assemble the Blob inside `mediaRecorder.onstop` from the `chunks` array — do not send a single raw chunk from `ondataavailable`; as a last resort, transcode with ffmpeg before uploading
Chinese audio transcribed as transliteration (e.g. "Da jia hao")	STT request missing `language: "zh-CN"`; nova-3 defaults to `en`	Explicitly pass `language: "zh-CN"` in both the Cloud Function and the frontend request; use `"multi"` for mixed Chinese/English
Chinese text in the AI stream shows `\uFFFD` (replacement characters)	`TextDecoder.decode(value)` called without `{ stream: true }`; multi-byte characters split across chunk boundaries	Change to `decoder.decode(value, { stream: true })` as shown in this recipe
TTS produces no sound / only the first sentence plays	`speechSynthesis` suspended after inactivity or too many `cancel()` calls; or queue interrupted	Call `speechSynthesis.cancel()` once at the start of each new turn to reset; on Chrome, `speechSynthesis` suspends after ~14 s of silence — call `speechSynthesis.resume()` before each `speak()` if there are long pauses between turns
TTS cuts off mid-sentence; Chinese text breaks inside a word	Sentence-splitting regex is too aggressive (e.g. treating commas as sentence ends), or the model output contains a long uninterrupted passage without punctuation	Use only `[。!?!?\n]` in the regex; add `"end each sentence with terminal punctuation"` to the model's system prompt as shown in the `voice-chat` function above
`CORS error` after deployment	Web Cloud Function response missing `Access-Control-Allow-Origin`, or OPTIONS preflight not handled	Return `204` + CORS headers for OPTIONS; add `Access-Control-Allow-Origin: *` (or a specific domain) to POST responses as shown in this recipe
`streamText` times out after 60 s / stream drops mid-way	Node SDK default `timeout: 15000` is too short; or the Web Cloud Function runtime has its own response time limit	Set `tcb.init({ env, timeout: 60000 })`; also raise the function timeout in the Console to at least 60 s; for very long outputs consider chunking
Second turn: AI does not remember the first turn	Frontend `messages` array did not include the `assistant` reply from the previous turn	After reading the stream, push `{ role: "assistant", content: acc }` into `messages` as shown in this recipe
After many turns, TTS stops responding	`speechSynthesis` queue has an implicit limit (around 200 entries); a saturated queue silently drops new items	Call `speechSynthesis.cancel()` at the start of each user turn; for very long sessions, consider switching to a third-party TTS HTTP API

For CloudBase-side error codes see https://docs.cloudbase.net/error-code/. For Deepgram error codes, see the Deepgram documentation.

Pricing Notes

Deepgram STT: nova-3 is billed per audio minute (not tokens). Pay-As-You-Go is approximately $0.0043 per minute. A 5-second question costs roughly $0.0004; 1,000 questions cost about $0.40.
CloudBase AI: deepseek-v4-flash is billed per input + output token. Because the system prompt enforces short conversational answers, token cost per turn is typically lower than text-only chat. New environments receive 1 million tokens free for the first month (check the Console billing page for current figures).
Web Cloud Functions: billed per invocation + resource consumption (GB-seconds). Each question involves 2 function calls (stt + chat), billed at standard CloudBase rates.
TTS is completely free: SpeechSynthesisUtterance runs the browser's local speech engine and makes no network requests. No billing applies.

Web Cloud Function Streaming Compatibility Fallback

If your Web Cloud Function runtime does not support returning a ReadableStream directly, use Node's Readable stream instead:

// res.write() style (supported by some runtimes as a streaming response)
const { Readable } = require("stream");

const stream = new Readable({
  read() {},
});

(async () => {
  try {
    for await (const chunk of result.textStream) {
      stream.push(chunk);
    }
    stream.push(null);
  } catch (err) {
    stream.destroy(err);
  }
})();

return {
  statusCode: 200,
  headers: { "Content-Type": "text/plain; charset=utf-8" },
  body: stream,
};

Alternatively, use SSE (text/event-stream): wrap each chunk as data: xxx\n\n and consume it with EventSource on the frontend. SSE adds protocol overhead but provides automatic browser reconnection, which is more robust for long conversations.

connect-deepgram-speech-to-text-cloud-function — STT deep dive (pull audio from Cloud Storage, batch transcription with diarization and utterances). The STT portion of this recipe is a "short recording direct upload" variant of that one.
add-ai-nextjs — CloudBase AI streamText in Next.js (Route Handler + ReadableStream). The chat portion of this recipe is a direct port of that approach to a Web Cloud Function.
add-ai-wechat-miniprogram — The same CloudBase AI integration in a Mini Program; Mini Program's built-in identity removes the need for a backend proxy.
connect-tavily-search-cloud-function — Add live web search to the voice chatbot: call Tavily before streamText, inject the results into the system prompt, and build a search-augmented voice chatbot.
add-realtime-notifications-database-watch — Persist each conversation turn (transcript + AI reply) to the database and use watch for multi-device sync of voice conversation history.
secure-secrets-in-cloud-function — Layered management of DEEPGRAM_API_KEY and TENCENTCLOUD_SECRETKEY across local dev, CI, and production.
CloudBase AI SDK — Init and Invocation — Official reference for app.ai() initialization.
Model Access — Full list of models available beyond deepseek-v4-flash.

Applicable Scenarios​

Prerequisites​

End-to-End Flow​

Step 1: Enable AI+ in CloudBase and Obtain a Deepgram Key​

Step 2: Write voice-stt (Web Cloud Function, Deepgram STT)​

Step 3: Write voice-chat (Web Cloud Function, CloudBase AI Streaming)​

Step 4: Frontend — Record, Transcribe, Stream, and Speak​

Step 5: Deploy​

Running Verification​

Common Errors​

Pricing Notes​

Web Cloud Function Streaming Compatibility Fallback​

Related Documentation​