Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0xA0 is causing gtts-cli to send EOF. #353

Open
medanisjbara opened this issue May 27, 2022 · 2 comments
Open

0xA0 is causing gtts-cli to send EOF. #353

medanisjbara opened this issue May 27, 2022 · 2 comments

Comments

@medanisjbara
Copy link

medanisjbara commented May 27, 2022

Prerequisites

  • [*] Did you make sure a similar issue didn't exist?
  • [*] Did you update gTTS to the latest? (pip install --upgrade gTTS)

Current Behaviour (steps to reproduce)

The presence of 0xA0 in the input text is mostly ignored by gtts-cli. But in certain situations (the provided example) It will produce Error: 200 (OK) from TTS API. Probable cause: No audio stream in response. Unsupported language 'en' along with EOF (And it seems to be redirected to stderr without actually having a python error).

$ gtts-cli -f test -o test.mp3

working_test.txt
non_working_test.txt
Even though the files contain 0xA0 which I assumed it will make the file a binary file. The file command says the opposite.

$ file non_working_test.txt
non_working_test.txt: Unicode text, UTF-8 text

gtts-cli didn't complain about none UTF-8 characters. And using iconv to remove non utf-8 characters doesn't change anything.
$ iconv -f utf-8 -t utf-8 -c test does nothing to the file.
And some web pages use that character in between the text. Most text editors show it as space. Which is a bit frustrating to the user (You almost have no clue what to do or what causes the error)
And I can not blame the creator of the page since it seems like (after searching online) 0xA0 is a part of windows-1252 encoding (So if he wrote his blog in microsoft word, there's a big chance it got introduced there).

Expected Behaviour

gtts-cli should ignore that character and continue reading regardless of how and where it is present.

Context

I am writing a simple bash script that reads aloud the user's clipboard or a webpage associated with the url in the user's clipboard.
I personally have been using this command w3m "$(xclip -o)" | gtts-cli -f - | mpv - for over a year to boost productivity when reading. With some variations such less $pdf_file_or_epub_file | gtts-cli -f - | mpv - and so on and so forth.
The script basically does the same (Still very basic and under development).
And I came accross some webpages that caused that error to occure. After Some investigations I found out that the character 0xA0 is what is causing the problem.
So I created an issue and made a small workaround that uses bbe to replace the bad character with none (and then iconv for clean up since it is messing up a couple of things).

Environment

$ gtts-cli --version
gtts-cli, version 2.2.4

$ python --version
Python 3.9.12

$ uname -a
Linux Laptop 5.17.3-tkg-pds #1 TKG SMP PREEMPT Sat Apr 16 06:53:55 CET 2022 x86_64 Intel(R) Celeron(R) N4000 CPU @ 1.10GHz GenuineIntel GNU/Linux
  • OS: Gentoo/Linux x86_64
@medanisjbara
Copy link
Author

I assume this isn't gtts-cli's fault. Since there's no actual python error. So I assume the problem is actually with the google text to speech engine. Yet the behavior itself is confusing. So I hope a fix will be applied.

@pndurette
Copy link
Owner

@medanisjbara Thanks a lot for this well documented behaviour!

Hmm, so it's a windows-1252 character. I wonder if there's anything gTTS should (or shouldn't do) about this, like applying some filtering. I'll have to take a look with the debugging on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants