Skip to main content

Proxy Deepgram Speech-to-Text via CloudBase Cloud Function

In one sentence: Use @deepgram/sdk inside a CloudBase Cloud Function to call Deepgram's nova-3 model, pull an audio file from Cloud Storage, perform STT, and get a transcription with punctuation, diarization, and timestamps written directly to the database.

Estimated time: 30 minutes | Difficulty: Advanced

Applicable Scenarios

  • You have meeting recordings, voice memos, or customer service call audio that you want to batch-transcribe without maintaining your own Whisper inference machine.
  • You need "who said what and when" — Deepgram's diarize + utterances gives you speaker IDs and sentence-level timestamps out of the box; building the same thing on top of Whisper requires significant post-processing.
  • Your audio is mixed Chinese and English — nova-3 is Deepgram's current SOTA model and covers both languages.

Not applicable:

  • Real-time low-latency scenarios (live captions, voice assistants): those require a WebSocket streaming connection. This recipe covers the batch (prerecorded) API. A minimal streaming code snippet is shown at the end, but running it inside a Cloud Function is not cost-effective — use a Web Cloud Function (HTTP trigger) or connect directly from the client instead.
  • Extremely high transcription volume (tens of thousands of hours per day): at that scale, Deepgram's per-minute pricing adds up fast — it is worth comparing against self-hosted Whisper Large v3 on GPU.

Prerequisites

DependencyVersion
Node.js (Cloud Function runtime)≥ 18
@cloudbase/node-sdklatest
@deepgram/sdk^4.x (this recipe uses the v4 API)
Cloud Function typeStandard event-driven function; batch transcription does not require a persistent connection
Public network egressCloud Functions can reach the public internet by default; Deepgram's API is at api.deepgram.com, which is reachable from mainland China

You will need:

  • A Deepgram account and an API key (new accounts receive $200 in free credits — enough to get started).
  • A CloudBase Environment with Cloud Storage and a database.
  • A test audio file in Chinese or English; start with something under 30 seconds for quick iteration.

Step 1: Get a Deepgram Key and Configure Environment Variables in CloudBase

  1. Log in to the Deepgram Console, go to API Keys, click Create a New API Key, and set the Scope to Member (transcription calls only).
  2. Copy the generated key — it is only shown once; if you lose it you will need to create a new one.
  3. Open the CloudBase Console → your Environment → Cloud Functions. Note your Environment ID (needed for tcb fn deploy -e later).
  4. After creating the function, add the following under "Function Config → Environment Variables":
    • DEEPGRAM_API_KEY: the key you just copied
    • TCB_ENV: your Environment ID (used by tcb.init({ env }) in the code)

Why not hardcode the key? Beyond the fundamental rule of keeping secrets out of git, environment variables in Cloud Functions can be updated without repackaging. Rotating a key or switching environments only requires a Console update and an instance restart — much faster than a code change.

Step 2: Upload Audio to Cloud Storage

Two quick options:

A. Manual upload via Console: Go to CloudBase Console → Cloud Storage → Upload File, select an .mp3 or .wav file, and note the generated fileID in the form cloud://your-env.xxx/uploads/audio.mp3.

B. Frontend SDK upload: If you already have an upload flow, see add-file-upload-wechat-miniprogram. After the Mini Program or web client calls app.uploadFile, write the fileID to the database; the Cloud Function reads it from there.

Deepgram supports a wide range of formats: mp3, wav, m4a, flac, ogg, webm, mp4, and more — essentially anything ffmpeg can decode. The single-file batch limit is around 2 GB; only consider chunking for truly large files.

Step 3: Write the Cloud Function (Download Audio → Call Deepgram → Save to Database)

Create a new function directory:

mkdir transcribe-audio && cd transcribe-audio
npm init -y
npm install --save @cloudbase/node-sdk @deepgram/sdk

index.js:

const tcb = require("@cloudbase/node-sdk");
const { createClient } = require("@deepgram/sdk");

const app = tcb.init({ env: process.env.TCB_ENV });
const db = app.database();
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);

// event: { fileID: "cloud://...", language?: "zh-CN" | "en", recordId?: string }
exports.main = async (event) => {
const { fileID, language = "zh-CN", recordId } = event;
if (!fileID) {
return { ok: false, error: "missing_fileID" };
}

// 1. Download audio from Cloud Storage into a buffer
let audioBuffer;
try {
const downloadResult = await app.downloadFile({ fileID });
audioBuffer = downloadResult.fileContent; // Buffer
} catch (err) {
console.error("download failed", err);
return { ok: false, error: "download_failed", message: err.message };
}

// 2. Call Deepgram batch transcription
let response;
try {
response = await deepgram.listen.v1.media.transcribeFile(audioBuffer, {
model: "nova-3",
smart_format: true, // auto punctuation + number formatting
punctuate: true,
diarize: true, // speaker diarization; result in utterances[i].speaker
utterances: true, // sentence-level segmentation + timestamps
language, // "zh-CN" / "en" / "auto" etc.
});
} catch (err) {
// Deepgram SDK throws DeepgramError with statusCode/body
console.error("deepgram failed", err);
return {
ok: false,
error: "deepgram_failed",
statusCode: err.statusCode,
message: err.message,
};
}

const transcript =
response?.result?.results?.channels?.[0]?.alternatives?.[0]?.transcript || "";
const utterances = response?.result?.results?.utterances || [];

// 3. Write to database; update if recordId provided, otherwise insert
const payload = {
fileID,
language,
transcript,
utterances, // [{ start, end, speaker, transcript, confidence }]
durationSec: response?.result?.metadata?.duration,
model: "nova-3",
updatedAt: new Date(),
};

if (recordId) {
await db.collection("transcripts").doc(recordId).update(payload);
return { ok: true, recordId, transcript };
} else {
const { id } = await db.collection("transcripts").add({
...payload,
createdAt: new Date(),
});
return { ok: true, recordId: id, transcript };
}
};

A few common pitfalls:

  • app.downloadFile returns { fileContent: Buffer }. Pass the buffer directly to transcribeFile — no need to write to disk. The Cloud Function /tmp directory is writable, but keeping everything in memory is faster.
  • For Chinese audio you must pass language: "zh-CN". Without it the model defaults to en and produces transliterated gibberish (e.g. nova-3 defaults to en).
  • smart_format: true already implies punctuate, but being explicit does not cause conflicts and makes the intent clear if you later swap models.
  • If you need speaker-level granularity, transcript alone is not enough — use the utterances array, where each entry has speaker (integer, starting from 0), start, and end (in seconds).
  • response.result is the SDK v4 wrapper layer. Accessing response.results directly returns undefined — this is the most common v3 → v4 migration mistake.

package.json dependencies should look like:

{
"name": "transcribe-audio",
"main": "index.js",
"dependencies": {
"@cloudbase/node-sdk": "^3.0.0",
"@deepgram/sdk": "^4.0.0"
}
}

Step 4: Deploy and Invoke

Deploy:

tcb login
tcb fn deploy transcribe-audio -e your-env-id

After deployment, go to the Console and do three things:

  1. Function Config → Environment Variables: add DEEPGRAM_API_KEY and TCB_ENV.
  2. Function Config → Execution Timeout: the default 3 seconds is too short — set it to 60 seconds (end-to-end transcription of a 1-minute audio file takes roughly 5–10 seconds; the extra headroom covers download and database write).
  3. Function Config → Memory: 512 MB is sufficient. If your files frequently exceed 100 MB, increase to 1024 MB.

Invoke once locally with tcb:

tcb fn invoke transcribe-audio -e your-env-id \
--params '{"fileID":"cloud://your-env.xxx/uploads/audio.mp3","language":"zh-CN"}'

Expected response:

{
"ok": true,
"recordId": "xxxx",
"transcript": "Hello, this is a test audio file..."
}

Running Verification

  1. Prepare a Chinese or English audio clip under 30 seconds with known content (reading a passage aloud works well), in .mp3 format.
  2. Upload it to Cloud Storage and note the fileID.
  3. Run tcb fn invoke as shown above and check that the transcript field matches what was spoken.
  4. Open the transcripts collection in the database and verify:
    • transcript contains the full text.
    • utterances is an array where each entry has start, end, speaker, and transcript.
    • durationSec approximately matches the audio length (within ±1 second).
  5. Run again with a multi-speaker audio file and confirm that utterances[*].speaker includes at least two distinct integers (0 and 1).

To show results in the frontend in real time, set up add-realtime-notifications-database-watch and watch the record — the frontend will receive the update the moment the function writes to the database.

Common Errors

ErrorCauseFix
401 UnauthorizedDEEPGRAM_API_KEY not set or entered incorrectlyRe-paste the key in the Console environment variables, confirm there are no leading/trailing spaces; redeploy or manually restart the instance after changing the value
Deepgram returns Audio decode failedThe uploaded file is not valid audio, or the format is corruptedVerify the file with ffprobe; browser-recorded webm files often have encoding issues — transcode with ffmpeg -i in.webm out.mp3
Chinese audio transcribed as English transliteration (e.g. "Da jia hao")language: "zh-CN" not passed; model defaults to en and phonetically transcribesExplicitly pass language: "zh-CN"; for mixed Chinese/English try language: "auto", though accuracy is lower than specifying a language
Function times out after 3 seconds; logs show the Deepgram call was just sentDefault Cloud Function timeout is 3 seconds; transcribing 1 minute of audio takes 5–10 secondsConsole → Function Config → Timeout, set to 60 seconds or more; for audio over 5 minutes set 300 seconds and increase memory
Cannot read property 'channels' of undefinedSDK changed from v3 to v4: response.results became response.result.resultsUse response?.result?.results?.channels?.[0] as shown in this recipe; check the Deepgram SDK CHANGELOG before upgrading
Large audio file causes Request body too largeEvent-driven Cloud Functions have a small request body limit; passing the audio bytes directly in the event exceeds itAlways pass a fileID and let the function download from Cloud Storage — never put audio bytes in the event payload
Transcript accuracy is noticeably lowLow audio bitrate, background noise, or overlapping speakersRecord at ≥ 16 kHz mono; consider pre-processing to reduce noise; Deepgram also has redact and filler_words parameters worth tuning

Full Deepgram error codes are in the Deepgram documentation. For CloudBase-side error codes see https://docs.cloudbase.net/error-code/.

Real-Time Streaming Transcription (Optional)

To transcribe audio as it is spoken (live captions, voice assistants), replace transcribeFile with a WebSocket connection. Standard event-driven functions cannot hold persistent connections — use a Web Cloud Function (HTTP trigger) or connect directly from the client using a temporary token:

const connection = await deepgram.listen.v1.connect({
model: "nova-3",
interim_results: true, // partial results as the speaker talks
punctuate: true,
language: "zh-CN",
});

connection.on("open", () => {
// client starts pushing PCM/Opus audio frames to connection
});

connection.on("message", (data) => {
if (data.type === "Results") {
const partial = data.channel.alternatives[0].transcript;
// push to frontend
}
});

connection.connect();

Deepgram streaming latency is under 300 ms, but the deployment model and authentication approach differ from batch transcription. That warrants a separate recipe — this one does not go further.