In part 1 I went over a basic approach of how web browsers support recording audio and my approach to using an on-device Voice to Text model, specifically OpenAI’s Whisper optimized for the edge with transformers.js.
Transcribing voice into text is convenient, but it is also built into phones and computers native keyboards. How can we make it so the chat is conversational without having to press a button before sending?
Voice Audio Detection (VAD)
In order to create a seamless chat experience we need to detect silence. Once the user stops speaking we can then trigger the recording to stop and trigger the transcribed text to the chatbot.
I will admit to not knowing much about these models, the Silero VAD model worked very well with littler configuration. I found a useful library (`npm install @ricky0123/vad-react`) which has instructions for different uses like Node and React. Similar to the Whisper model this is loaded on-device via a converted ONNX model. They provide a `useMicVAD` hook to load the model:
const vad = useMicVAD({
modelURL: "/_next/static/chunks/silero_vad.onnx",
workletURL: "/_next/static/chunks/vad.worklet.bundle.min.js",
startOnLoad: false,
onSpeechEnd: async () => {
if (recording) {
await stopRecording(); // Stop the recording
setRecording(!recording); // Update the recording state
}
},
});The VAD library provides an `onSpeechEnd` callback where we can stop recording and update the recording state.
Auto Sending Transcribed Text
The final piece of the puzzle is to actually send our text to ChatGPT once the transcribed text is complete. To do this I’m adding a simple state variable to the completion of the `useTranscriber` hook (borrowed heavily from the whisper-web example).
case "complete":
const completeMessage = message as TranscriberCompleteData;
setTranscript({
isBusy: false,
text: completeMessage.data.text,
chunks: completeMessage.data.chunks,
});
setIsBusy(false);
setIsComplete(true);
break;Adding `setIsComplete` gives us the data point for a new React effect to listen for that data and send it off to ChatGPT:
useEffect(() => {
if (transcriber.isComplete) {
handleTranscriptionComplete();
}
}, [transcriber]);Final Thoughts
If you tried the demo or Whisper in general you know that the quality is great - being able to run this without a server to process audio makes this much easier and privacy friendly. You will notice a little bit of slowness when the model has to load initially, but it is cached on subsequent visits.
I’m finding myself using voice chat more and more, there is less cognitive effort involved and I imagine more built into applications soon.
My demo is accessible on my “web chat” project, check it out here:
