-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major difference with whisper.cpp? #109
Comments
Thanks for the report @bjnortier :) I just tested: 1-) whisper.cpp + We will investigate whether an aggressive hallucination guardrail or some input audio processing is causing this discrepancy for this particular sample and report back. In general, we do not anticipate any quality drop when moving from whisper.cpp to WhisperKit because the evaluation results on librispeech are almost identical and whisper.cpp has higher error rates on earnings22 (Eval results published here, whisper.cpp is not yet tabulated but eval files are in this repo) |
Yes, definitely a beam search thing, thanks for this great example. The model wants to predict no speech in the beginning because the audio is quiet, and stops early as @atiorh mentioned. You can get around this somewhat by forcing the first tokens i.e. telling the model there is definitely speech here, by using the decoding option
I managed to improve it a bit by scaling the loudness and a simple band pass filter for human speech frequency ranges using this file and this command
Will look into what we can clean up on the audio side by default, but beam search will require some reworking with the model, CoreML has some limitations that prevented us from implementing this originally. |
Ok interesting, thanks for the info! |
If you notice more discrepancies, please keep flagging |
Hi
I've been working with whisper.cpp and am the creator of the Hello Transcribe app: https://apps.apple.com/za/app/hello-transcribe/id6443919768
I'm just having as look at WhisperKit today I'm trying the following sample: https://drive.google.com/file/d/1VycwWoXsgsWrmzfG2yZ8N4tWnPEUu-HO/view?usp=share_link
It's a low-quality microphone recording from the Lord of the Rings movie.
With the small Whisper model on an M1 Mac and using the WhisperKit CLI I get no output:
With the TestFlight app some output but low quality:
And with whisper.cpp:
As you can see the whisper.cpp result is WAY better. Is there a set of settings to use to get a similar result? I'm just using default options for both.
The text was updated successfully, but these errors were encountered: