Local voice transcription.
The way that the Whisper model works seems to be that real-time transcriptions are inæffective, so I shouldn’t bother looking for it. To achieve transcriptions practical for real-time communication, voice activity should be used to identify speech from silence.
Whisper tends to hallucinate phrases when given insufficient data, silence or short phrases. This includes:
- “Thank you” or “Thanks for watching”.
- “Sorry”.
- Subtitle attribution, usually includes a domain name.
Post-processing tends to be necessary for short speech.
Stream avatar
I have ideas of using it for a stream/live camera avatar of sorts “Sheep Zhing.”. Speech-to-text-to-speech, essentially. This had rather humorous results but is terrible for clear communication.
- huwprosser/web-whisper - Python backend.
- Model runs hot and occasionally locks up.
- Long payloads get rejected by the server.
- Considering creating a separate Node.js/Express implementation that invokes a Whisper CLI tool instead.
For whisper.cpp, this command might be sufficient.
whisper-cli.exe -m ./models/tiny.en.bin -np -nt speech.wav -sns