Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding. of course it is encoding... #4

Open
RMHogervorst opened this issue Apr 12, 2018 · 6 comments
Open

encoding. of course it is encoding... #4

RMHogervorst opened this issue Apr 12, 2018 · 6 comments
Assignees
Labels
monitoring Watching this issue for potential follow-up later

Comments

@RMHogervorst
Copy link

It would be very nice if the text parsing would default to utf-8, because I have something that doesn't seem to be right. 1001 nights

Generous Dealing of Yahya Son of KhÃ\u0081Lid with A Man Who Forged A Letter in His Name.

should be

Generous Dealing of Yahya Son of KhÁLid with A Man Who Forged A Letter in His Name.
@RMHogervorst
Copy link
Author

  The Kingâ\u0080\u0099s Daughter and the Ape

should be

The King’s Daughter and the Ape.

@RMHogervorst
Copy link
Author

This is the file that doesn't work (had to zip it, because github doesn't accept epub)

arab.zip

@RMHogervorst
Copy link
Author

I extracted a few parts and the html files within are encoded correctly that is, there is a charset tag in the

<meta charset="utf-8" />  

So I guess it could read that tag, or default to utf-8
In https://github.com/hrbrmstr/pubcrawl/blob/master/R/clean-text.R#L5:

if (!inherits(doc, "html_document")) doc <- xml2::read_html(doc)

read_html might need the encoding argument (defaults to "")
If I read the html file in directly with rvest::html_text(xml2::read_html("file.html")) it already defaults to utf-8 . So perhaps there is implicit recoding when xslt::xml_xslt is applied to the data?

@RMHogervorst
Copy link
Author

nope thats not it (xml2::read_html(doc) would also always default to utf-8).

@hrbrmstr
Copy link
Owner

So, the default was UTF-8 but I added a pass-through encoding parameter wherever I could and it still looks as though you're going to have to post-process to handle Latin1 or cp1252 (etc) encodings. Vis a vis:

x <- epub_to_text("~/Downloads/b97b.epub", "Latin1")

z <- x$content[1] # just to make it easier to debug in my session

substr(z, 1, 1000) # I added the hard line breaks

[1] "The Book of The Thousand Nights and a Night: a plain and literal translation of the Arabian Nights Entertainments. Translated and annotated by Richard F. Burton; illustrated by Albert Letchford\n    Contents\n      Top\n\tEditorâ\u0080\u0099s Note to this Web 
Edition\n\tDedications to the Original Ten Volumes\n\tThe Translatorâ\u0080\u0099s Foreword.\n\tThe Book of The Thousand Nights and a 
Night\n\tTale of the Trader and the Jinni.\n\tThe First Shaykhâ\u0080\u0099s Story.\n\tThe Second Shaykhâ\u0080\u0099s Story.\n\tThe 
Third Shaykhâ\u0080\u0099s Story.\n\tThe Fisherman and the Jinni.\n\tThe Tale of the Wazir and the Sage Duban.\n\tKing Sindibad and 
his Falcon.\n\tThe Tale of the Husband and the Parrot.\n\tThe Tale of the Prince and the Ogress.\n\tThe Tale of the Ensorcelled 
Prince.\n\tThe Porter and the Three Ladies of Baghdad.\n\tThe First Kalandarâ\u0080\u0099s Tale.\n\tThe Second Kalandarâ\u0080\u0099s 
Tale.\n\tThe Tale of the Envier and the Envied.\n\tThe Third Kalandarâ\u0080\u0099s Tale.\n\tThe Eldest Ladyâ\u0080\u0099s 
Tale.\n\tTale of the Portress.\n\tThe Tale of the Three Apples\n\tTale of Nur Al-Din and his S"

In theory, it should have dealt with ^^ properly since it (honest!) passed it in all the way through and I even do a final iconv() to encoding on the column.

But, if you do (this text is Latin1 btw):

substr(iconv(z, "", to="Latin1"), 1, 1000)

[1] "The Book of The Thousand Nights and a Night: a plain and literal translation of the Arabian Nights Entertainments. Translated 
and annotated by Richard F. Burton; illustrated by Albert Letchford\n    Contents\n      Top\n\tEditor’s Note to this Web 
Edition\n\tDedications to the Original Ten Volumes\n\tThe Translator’s Foreword.\n\tThe Book of The Thousand Nights and a 
Night\n\tTale of the Trader and the Jinni.\n\tThe First Shaykh’s Story.\n\tThe Second Shaykh’s Story.\n\tThe Third Shaykh’s 
Story.\n\tThe Fisherman and the Jinni.\n\tThe Tale of the Wazir and the Sage Duban.\n\tKing Sindibad and his Falcon.\n\tThe Tale of 
the Husband and the Parrot.\n\tThe Tale of the Prince and the Ogress.\n\tThe Tale of the Ensorcelled Prince.\n\tThe Porter and the 
Three Ladies of Baghdad.\n\tThe First Kalandar’s Tale.\n\tThe Second Kalandar’s Tale.\n\tThe Tale of the Envier and the 
Envied.\n\tThe Third Kalandar’s Tale.\n\tThe Eldest Lady’s Tale.\n\tTale of the Portress.\n\tThe Tale of the Three Apples\n\tTale of 
Nur Al-Din and his Son.\n\tThe Hunchback"

it works.

I'll keep this open since it'd like to provide robust support in the long run but at least the iconv() should work ex-post-facto for the edge cases.

@hrbrmstr
Copy link
Owner

(just saw your extended comments)

aye, i even pass encoding along to it and ensure it's a raw vector when processing and still no-go.

something (IMO) "weird" is happening either as a result of read_html() OR in tibble-land causing some issues but iconv() will work ex post facto.

@hrbrmstr hrbrmstr self-assigned this Apr 13, 2018
@hrbrmstr hrbrmstr added the monitoring Watching this issue for potential follow-up later label Apr 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
monitoring Watching this issue for potential follow-up later
Projects
None yet
Development

No branches or pull requests

2 participants