Getting Started

Whisper Turbo is a fast, cross-platform implementation of the open source text-to-speech model Whisper.

Installation

npm install whisper-turbo

You can check out a demo of Whisper Turbo here (opens in a new tab).

Usage

Session

A session is a single instance of a model used for inference.

A SessionManager is used to spawn an instance of any one of following the available models:

enum AvailableModels {
    WHISPER_TINY, // 51.4 MB
    WHISPER_BASE, // 96.8 MB
    WHISPER_SMALL, // 313 MB
    WHISPER_MEDIUM, // 972 MB
}

Try out the models on the demo site (opens in a new tab) and choose the one that best fits your needs.

The SessionManager has a single method, loadModel, which returns a promise that resolves to a Result type (opens in a new tab).

class SessionManager {
    async loadModel(
        selectedModel: AvailableModels,
        onLoaded: (result: any) => void,
        onProgress: (progress: number) => void
    ): Promise<Result<InferenceSession, Error>>;
}

💡

Before calling any methods, you must call await initialize() to ensure the WASM module is loaded.

Decoding Options

Decoding options enable you to configure your transcription request. These options are an exact match for those used by the official Whisper CLI.

Some of these options can be crucial to getting the best results from your model.

interface DecodingOptions {
    task: string; // "transcribe" | "translate"
    language?: string; // "en" | "fr" | "es" etc.
    temperature?: number; // 0.0 - 1.0
    sample_len?: number; // 0 - 1000
    best_of?: number;
    beam_size?: number;
    patience?: number;
    length_penalty?: number;
    prompt?: string;
    prefix?: string;
    suppress_tokens?: number[];
    suppress_blank?: boolean;
    without_timestamps?: boolean;
    max_initial_timestamp?: number;
    time_offset?: number;
}

You can see a list of supported languages here (opens in a new tab).

💡

Some of these options are present, but not yet implemented. They will be implemented in the near future.

Because this struct has so many fields, you can use the DecodingOptionsBuilder to set the fields you need, and leave the rest as their default values.

let options = new DecodingOptionsBuilder()
    .setTask("translate")
    .setLanguage("fr")
    .build();

Segment

A segment is a single chunk of the transcription. When a segment is generated, it is passed back to the callback function you provide to the transcribe method.

interface Segment {
    text: string;
    start: number;
    end: number;
    last: boolean;
}

Transcribe

Transcribe is the main entry point for the library. It accepts a Uint8Array of audio data from a file or microphone, and returns a promise to a Result.

function transcribe(
    audio: Uint8Array,
    raw_audio: boolean,
    options: any,
    callback: (decoded: Segment) => void
): Promise<Result<void, Error>>;

It accepts any of the following file formats:

wav, mp3, m4a, mp4, aac.

⚠️

Transcoding can seriously slow down your inference - use WAV files where possible!

The raw_audio parameter is used to indicate whether the audio data is raw PCM data (i.e from the microphone).

Worked example

import {
    initialize,
    SessionManager,
    DecodingOptionsBuilder,
    Segment,
    AvailableModels,
} from "whisper-turbo";
 
async function main() {
    await initialize();
    const session = await new SessionManager().loadModel(
        AvailableModels.WHISPER_TINY,
        () => {
            console.log("Model loaded successfully");
        },
        (p: number) => {
            console.log(`Loading: ${p}%`);
        }
    );
 
    let options = new DecodingOptionsBuilder().setTask("translate").build();
 
    await session.transcribe(audioData, true, options, (segment: Segment) => {
        console.log(segment);
    });
}

Introduction Getting Started