Audio Transcription v1.0.0+
Convert speech to text using specialized models like Whisper or leverage multimodal models for native audio understanding and analysis.
Table of contents
Convert audio files to text using models like OpenAI’s Whisper or Google’s Gemini. NodeLLM supports both raw transcription and multimodal chat analysis.
Basic Transcription
Use NodeLLM.transcribe() for direct speech-to-text conversion.
const text = await NodeLLM.transcribe("meeting.mp3", {
model: "whisper-1"
});
console.log(text.toString());
Advanced Options
Speed vs Accuracy
You can choose different models or parameters depending on your needs.
await NodeLLM.transcribe("audio.mp3", {
model: "whisper-1",
language: "en", // ISO-639-1 code hint to improve accuracy
prompt: "ZyntriQix, API" // Guide the model with domain-specific terms
});
Diarization & Native Word Timestamps v1.16.0
NodeLLM supports speaker identification (diarization) and word-level timestamps.
const response = await NodeLLM.transcribe("meeting.mp3", {
model: "whisper-1", // or "gpt-4o-transcribe-diarize"
timestamp_granularities: ["word", "segment"],
speakerNames: ["Alice", "Bob"]
});
Accessing Detailed Metadata
The transcribe method returns a Transcription object that provides rich metadata for analysis and persistence.
console.log(`Duration: ${response.duration}s`);
// 1. Iterating through segments
for (const segment of response.segments) {
const speaker = segment.speaker ? `${segment.speaker}: ` : "";
console.log(`[${segment.start}s - ${segment.end}s] ${speaker}${segment.text}`);
}
// 2. Word-level precision (if requested)
console.log(response.words[0]);
// => { word: "Hello", start: 0.5, end: 0.8 }
// 3. Database Persistence
// Every transcription has a .meta property for easy storage
const record = {
audio_id: "audio_123",
transcript: response.text,
metadata: response.meta // Full serializable object
};
Multimodal Chat vs. Transcription
There are two ways to work with audio:
- Transcription (
NodeLLM.transcribe): Best when you need the verbatim text.- Result: “Hello everyone today we are…”
- Multimodal Chat (
chat.ask): Best when you need to analyze or summarize the audio directly, without seeing the raw text first. Supported by models likegemini-1.5-proandgpt-4o.
// Multimodal Chat Example
const chat = NodeLLM.chat("gemini-1.5-pro");
await chat.ask("What is the main topic of this podcast?", {
files: ["podcast.mp3"]
});
Error Handling
Audio files can be large and prone to timeouts.
try {
await NodeLLM.transcribe("large-file.mp3");
} catch (error) {
console.error("Transcription failed:", error.message);
}