Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major difference with whisper.cpp? #109

Open
bjnortier opened this issue Apr 5, 2024 · 4 comments
Open

Major difference with whisper.cpp? #109

bjnortier opened this issue Apr 5, 2024 · 4 comments
Labels
feature New feature or request needs model updates Requires associated model change

Comments

@bjnortier
Copy link

Hi

I've been working with whisper.cpp and am the creator of the Hello Transcribe app: https://apps.apple.com/za/app/hello-transcribe/id6443919768

I'm just having as look at WhisperKit today I'm trying the following sample: https://drive.google.com/file/d/1VycwWoXsgsWrmzfG2yZ8N4tWnPEUu-HO/view?usp=share_link

It's a low-quality microphone recording from the Lord of the Rings movie.

With the small Whisper model on an M1 Mac and using the WhisperKit CLI I get no output:

$ swift run whisperkit-cli transcribe --model small --verbose --audio-path ~/development/whisper/whisper.cpp/samples/aragorn.wav
...
[WhisperKit] -------------------------------
[WhisperKit] Model Load Time:     1.23 seconds
[WhisperKit] Inference Duration:  0.74 seconds
[WhisperKit] - Decoding Loop:     0.73 seconds
[WhisperKit] Time to first token: 0.34 seconds
[WhisperKit] Total Tokens:        9
[WhisperKit] Tokens per Second:   4.10 tok/s
[WhisperKit] Real Time Factor:    0.01
[WhisperKit] Fallbacks:           0.0
[WhisperKit] [0.00 --> 30.00] <|startoftranscript|><|nocaptions|><|endoftext|>
[WhisperKit] [30.00 --> 60.00] <|startoftranscript|><|nocaptions|><|endoftext|>
[WhisperKit] [60.00 --> 62.04] <|startoftranscript|><|nocaptions|><|endoftext|>


Transcription: 


With the TestFlight app some output but low quality:

Screenshot 2024-04-05 at 11 18 45

And with whisper.cpp:

main: processing 'samples/aragorn.wav' (992576 samples, 62.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:19.380]   I see in your eyes, the same fear that would take the heart of me.
[00:00:19.380 --> 00:00:27.020]   But they may come when the courage of men fails, when we forsake our friends and break
[00:00:27.020 --> 00:00:31.720]   all bonds of fellowship, but it is not this day.
[00:00:31.720 --> 00:00:44.480]   And our walls and shattered shields when the age of men comes crashing down, but it is not this day, this day we fight.
[00:00:44.480 --> 00:01:01.020]   By all that you hold dear, this good earth, I bid you stand at the West!
[00:01:01.020 --> 00:01:02.020]   (door opens)


whisper_print_timings:     load time =   202.22 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =    29.92 ms
whisper_print_timings:   sample time =   155.11 ms /   609 runs (    0.25 ms per run)
whisper_print_timings:   encode time =   531.06 ms /     4 runs (  132.77 ms per run)
whisper_print_timings:   decode time =    10.11 ms /     2 runs (    5.05 ms per run)
whisper_print_timings:   batchd time =  1427.02 ms /   591 runs (    2.41 ms per run)
whisper_print_timings:   prompt time =    30.85 ms /   134 runs (    0.23 ms per run)
whisper_print_timings:    total time =  2391.29 ms

As you can see the whisper.cpp result is WAY better. Is there a set of settings to use to get a similar result? I'm just using default options for both.

@atiorh
Copy link
Contributor

atiorh commented Apr 5, 2024

Thanks for the report @bjnortier :) I just tested:

1-) whisper.cpp + --beam-size 1: Produces the first sentence from the result you shared and early stops
2-) whisper.cpp + --beam-size 5: Almost the same result as you reported above
3-) WhisperKit (always) --beam-size 1: Halts early

We will investigate whether an aggressive hallucination guardrail or some input audio processing is causing this discrepancy for this particular sample and report back.

In general, we do not anticipate any quality drop when moving from whisper.cpp to WhisperKit because the evaluation results on librispeech are almost identical and whisper.cpp has higher error rates on earnings22 (Eval results published here, whisper.cpp is not yet tabulated but eval files are in this repo)

@ZachNagengast
Copy link
Contributor

ZachNagengast commented Apr 5, 2024

Yes, definitely a beam search thing, thanks for this great example. The model wants to predict no speech in the beginning because the audio is quiet, and stops early as @atiorh mentioned. You can get around this somewhat by forcing the first tokens i.e. telling the model there is definitely speech here, by using the decoding option --use-prefill-prompt, but there is still some missing speech.

(crowd shouting) And our walls and shattered shields when the Egypt man comes crashing down. But it is not this day. This day we fight. By all that you hold dear, this good earth, I bid you stand, hand over west. (shouting)

I managed to improve it a bit by scaling the loudness and a simple band pass filter for human speech frequency ranges using this file and this command
swift run whisperkit-cli transcribe --model small --audio-path ~/Downloads/aragorn_loudest_filtered.wav --verbose --word-timestamps --use-prefill-prompt --temperature=0.2

Hold your breath! Hold your breath! Back to Gondor! Abroja! My brothers! I see in your eyes the same fear that would take the heart of me. But they may come when the courage of men fails, when we forsake our friends and break all bonds of fellowship. But it is not this day. An hour of wolves and shattered shields when the age of men comes crashing down. But it is not this day. This day we fight. By all that you hold dear, this good earth I bid you stand, head for the west. (drill whirring)

Will look into what we can clean up on the audio side by default, but beam search will require some reworking with the model, CoreML has some limitations that prevented us from implementing this originally.

@ZachNagengast ZachNagengast added needs model updates Requires associated model change feature New feature or request labels Apr 5, 2024
@bjnortier
Copy link
Author

Ok interesting, thanks for the info!

@atiorh
Copy link
Contributor

atiorh commented Apr 7, 2024

If you notice more discrepancies, please keep flagging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request needs model updates Requires associated model change
Projects
None yet
Development

No branches or pull requests

3 participants