Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: HTML API: Add an XML serializer. #7408

Draft
wants to merge 15 commits into
base: trunk
Choose a base branch
from
Draft

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Sep 21, 2024

Trac ticket: Core-62091
Built from #7331

Provides a mechanism to serialize an HTML fragment to the XML syntax. YOU PROBABLY SHOULDN'T USE THIS!!!!

REMEMBER that so-called "XHTML" served without a path ending in .xml or without the Content-type: application/xml+xhtml HTTP header will render as HTML and ONE SHOULD NOT SERVE XML/XHTML as HTML!!!

php > var_dump( ( WP_HTML_Processor::create_fragment( '<p>an <img> is worth &AElig thousand words' ) )->serialize_to_xml() );
string(43) "<p>an <img /> is worth Æ thousand words</p>"
php > var_dump( ( WP_HTML_Processor::create_fragment( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() );
string(200) "<svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p>"
php > var_dump( ( WP_HTML_Processor::create_full_parser( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() );
string(315) "<?xml version="1.0" encoding="UTF-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p></body></html>"

Extremely rare cases when it's appropriate to use this

  • Exporting HTML content into an Atom feed without escaping it. HTML may/ought to be escaped like <content type="html">&lt;p&gt;yay&lt;/&gt;</content>, but if the document can be serialized into <content type="xhtml" xmlns="http://www.w3.org/1999/xhtml"><p>yay</p></content>.
  • When attempting to directly embed HTML content into any other XML document without escaping it.

HTML generally cannot be expressed in XML, and according to the HTML specification, Using the XML syntax is not recommended! Prefer escaping the HTML to avoid corruption and data loss.

dmsnell and others added 15 commits September 11, 2024 09:37
The HTML Processor understands HTML regardless of how it's written, but
many other functions are unable to do so. There are all sorts of syntax
peculiarities and semantics that would be helpful to eliminate using the
knowledge contained in the HTML Processor.

This patch introduces `WP_HTML_Processor::normalize( $html )` as a
method which takes a fragment of HTML as input and then returns a
serialized version of the input, "cleaning it up" by balancing all
tags, providing all missing optional tags, re-encoding all text,
removing all duplicate attributes, and double-quote-escaping all
attribute values.

Core-62036
If code later in the processing pipeline adds unquoted attributes
and doesn't add the requisite space following that, then another
parser might find that the solidus is part of the attribute value
instead of serving as a self-closing flag.

Co-authored-by: Weston Ruter <[email protected]>
Co-authored-by: Weston Ruter <[email protected]>
Copy link

github-actions bot commented Sep 21, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, siliconforks.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@hubot hubot deleted the html-api/normalize-to-xml branch September 21, 2024 00:53
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@siliconforks
Copy link

php > var_dump( ( WP_HTML_Processor::create_fragment( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() );
string(200) "<svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p>"
php > var_dump( ( WP_HTML_Processor::create_full_parser( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() );
string(315) "<?xml version="1.0" encoding="UTF-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p></body></html>"

Are the above examples actually right? Is the xmlns="http://www.w3.org/1999/xhtml" supposed to be on the foreignObject element like that?

Compare the above code to the example here:

https://developer.mozilla.org/en-US/docs/Web/SVG/Element/foreignObject

  • Exporting HTML content into an Atom feed without escaping it. HTML may/ought to be escaped like <content type="html">&lt;p&gt;yay&lt;/&gt;</content>, but if the document can be serialized into <content type="xhtml" xmlns="http://www.w3.org/1999/xhtml"><p>yay</p></content>.

The above Atom example has basically the same issue - is the xmlns="http://www.w3.org/1999/xhtml" supposed to be on the content element?

Compare to the example here:

https://en.wikipedia.org/wiki/Atom_(web_standard)#Example_of_an_Atom_1.0_feed

@dmsnell
Copy link
Member Author

dmsnell commented Sep 21, 2024

Thanks @siliconforks.

You're right, in that the new default namespace applies to the foreignObject itself, which isn't correct. This PR is a big WIP though - honestly I would be just as happy if it always raised an exception 🙃

But I'm still exploring and trying to understand what needs to occur and how it can be done in order to transform as safely as possible. I'll add WIP to the title.

@dmsnell dmsnell changed the title HTML API: Add an XML serializer. WIP: HTML API: Add an XML serializer. Sep 21, 2024
@dmsnell dmsnell marked this pull request as draft September 21, 2024 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants