Adding Voice Transcription to a Chatbot: Part 1

Voice transcription has improved in quality and performance, this post I explore how I implemented it using the Whisper model

Dec 02, 2023

Before the ChatGPT boom I remember hearing about OpenAI’s Whisper. Personally I’ve been a skeptic of voice to text features, but I’m convinced the technology has improved from various demos and products like ChatGPT’s new voice chat feature.

I wanted to see if I could add it to my web chat project.

Where to run the model?

In previous posts I explored on-device or edge based LLM models. This paradigm makes even more sense for voice models like Whisper.

Whisper Web by Joshua Xenova (maintainer of transformers.js) demonstrates that voice transcription can be done within the browser using a small model on the phone.

If you try the demo you will notice it requires recording a voice note and then manually transcribing it.

MediaRecorder

The MediaRecorder API is built into browsers to give you access to recording audio and video. After recording the blob is passed to the transformers.js Whisper model. Here is a code snippet showing the basic state management for recording.

const handleRecording = async () => {
    if (isRecording) {
        stopListening();
        const recordedBlob = await stopRecording(); 
        setAudioFromRecording(recordedBlob);
        const audioBuffer = await convertBlobToAudioBuffer(recordedBlob);
        startListening(audioBuffer);
    } else {
        startRecording();
    }
    setIsRecording(!isRecording);
};

return (
    <div>
        <form onSubmit={handleSubmit} className={styles.form}>
            <input
                type="text"
                value={input}
                className={styles.input}
                onChange={(e) => setInput(e.target.value)}
                placeholder="Speak or type..."
            />
        </form>
        <button onClick={handleRecording} className={styles.button}>
            {isRecording ? <StopIcon /> : <MicIcon />}
        </button>
    </div>
);

Emitting transcription updates

React’s `useEffect` allows us to listen to specific events and perform updates to UI state. So once we have output from the model we update state:

useEffect(() => {
    if (transcriber.output && !transcriber.isBusy) {
        setRecognizedText(transcriber.output.text);
    }
}, [transcriber.output, transcriber.isBusy]);

Then we can add an effect on the Input component to update the UI state shared with the keyboard input:

useEffect(() => {
    if (recognizedText) {
        setInput(recognizedText);
    }
}, [recognizedText, setInput]);

Voice Audio Detection

This demo works for the most part, I’m still running into some issues with my version. Having to record, transcribe and send is fine but ideally there is a conversational flow. To do this we need to determine whether the user is still speaking.

I’m relatively new to researching this space but found some voice audio detection models to try. Stay tuned for that in part 2!

Matt’s Substack

Discussion about this post