Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing improvements for #31 #32

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

SL-Gundam
Copy link
Contributor

This fully fixes Variant5 #31

Also added an option for ConverterExtra to disable adding the CSS class after a tag. I do need ConverterExtra for table support so this CSS class was an annoying addition for me

Let me know what you think

@SL-Gundam
Copy link
Contributor Author

SL-Gundam commented Feb 4, 2018

Here an explanation behind the commits that fix some handling issues
cee262f - Fixes the handling of the font tag for Variant5 in #31
d0da9e0 - Fixes a cause for extra line endings that should not be present for various examples in #31

@tzi
Copy link
Member

tzi commented Feb 4, 2018

I'm pretty OK to merge the following feature if you add associated test cases:

  • Adjust fixInlineElementSpacing to not trigger for emptyTags
  • Allow to disable adding the CSS class after the tag

I'm not sure about the numeric arguments, because in the HTML spec you can have string attribute value without quotes.
So, It seems to be more a bug about the attribute value parsing than to allow numeric attribute value.

What do you think?

Cheers,
Thomas.

@tzi
Copy link
Member

tzi commented Feb 4, 2018

I separated 2 of your commits about inline space fixing in #33, because it's valuable! 🎉

@SL-Gundam
Copy link
Contributor Author

SL-Gundam commented Feb 4, 2018

Regarding the test cases: I am planning on adding those but before i spend time on that i wanted to at least have your opinion about the work in progress.
I'm still working on the other examples in #31 so it might be i'll work on them first and add the test cases after that.

Regarding attribute values:
Variant2 (#31) will have an unquoted string attribute value so your right that the current fix is not enough

@SL-Gundam
Copy link
Contributor Author

SL-Gundam commented Feb 7, 2018

Variant 4 (#31) is now fixed

6f40bcc - Ignore the last / for url and title comparison (Variant5)
56f4424 - Changed numeric attribute check cee262f to quoted and unquoted
ea771c3 - Variant4 had "*** This message was automatically generated by Microsoft Dynamics CRM ***" escaped as "\*\\*\* This message was automatically generated by Microsoft Dynamics CRM \*\**". This fixes that

Bacause of ea771c3 the test cases are failing. Do you have objections to this commit? if not i will adjust the test cases as required. Otherwise i would like your input on how to handle this

@SL-Gundam
Copy link
Contributor Author

e024197 - Cleans up redundant spaces using rtrim. Since a lot of tests are expecting these spaces, they are currently failing

@SL-Gundam
Copy link
Contributor Author

SL-Gundam commented Feb 15, 2018

That should fix all test cases for the previously proposed changes

@SL-Gundam
Copy link
Contributor Author

SL-Gundam commented Feb 17, 2018

Some improvement for handling Variant1 in #31

7a0f6a1 - Markdownify only works properly with line feed EOLs so any incoming HTML is now converted
e4a1f9b - After a closing p tag the line breaks need to be flushed before text is outputted
a72f108 - Variant1 uses div tags around various lines. Div tags only take up one empty line so decreased this here to improve the layout of the from, to, sent and subject lines.
75cf897 - After a closing p tag a ltrim was not performed. This caused any text after a closing p tag to have a space in front of it. Since a ltrim was performed after br tags, i added these as well for p tags

Variant1 is not perfect but good enough and further improvements might effect other situations undesirably

@SL-Gundam
Copy link
Contributor Author

552ba1b - Ignores Office namespace tag <o:p xmlns:o="#unknown"></o:p>.
I could not find any documentation on other possible office namespace tags. So further tags will have to be added when we encounter them

@SL-Gundam
Copy link
Contributor Author

0bea029 - I had an issue where plaintext content would be parsed for markdown with undesired results. This addition allows me to pull the regular expressions out of Markdownify and apply only the escaping on my content.
475af00 - The Header escaping did not work as desired if the header wasn't on the first line of the text. This modifies that so that it works properly
dc77382 - This escapes any header markdown with =
27131ac - Trivial change but escapes the regular expressions properly

@SL-Gundam
Copy link
Contributor Author

SL-Gundam commented Mar 3, 2018

Some more improvements

f6a3290 - If an unquoted attribute ended the tag, the tag closing character would not be properly detected without this
9efd59c - Some trivial PSR coding standards and add &nbsp; conversion to normal space
fc34c27 - More (|) character is now escaped

*
* @return array escapeInText
*/
public function getescapeInText()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: should this be getEscapeInText?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the variables use a construction where the first word is not capped like here https://github.com/Elephant418/Markdownify/blob/master/src/Converter.php#L169 and here https://github.com/Elephant418/Markdownify/blob/master/src/Converter.php#L185

I kept the function name to adhere to the same name as the variable being retrieved. @tzi so far has not commented on this. But i think he will go over the code when I've finished all of my "improvements" and added the associated test cases per his request

So the short question is. Should the name of the function be 100% exactly the same as the variable being retrieved? or are there guidelines which decide the function name?

@SL-Gundam
Copy link
Contributor Author

@tzi How to best handle the following code if encountered in html

<![if !supportLists]>
<![endif]>

It's comparable to the <o:p></o:p> but that one at least started and ended with the same tag
Would it be best to make it part of $emptyTags? like

'![if'
'![endif]'

@SL-Gundam
Copy link
Contributor Author

xmlns can have numbers in them apparently (Outlook 2010)

xmlns:udcp2p="http://schemas.microsoft.com/data/udc/parttopart"

@SL-Gundam
Copy link
Contributor Author

I had a situation where an html email contained an empty table tag (microsoft onedrive email). This makes sure that this does not result in an error: b9b3f41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants