Skip to content

Define processing instruction attributes#1454

Open
foolip wants to merge 16 commits into
mainfrom
foolip/pi-attributes
Open

Define processing instruction attributes#1454
foolip wants to merge 16 commits into
mainfrom
foolip/pi-attributes

Conversation

@foolip
Copy link
Copy Markdown
Member

@foolip foolip commented Mar 4, 2026

(See WHATWG Working Mode: Changes for more details.)


Preview | Diff

@foolip foolip mentioned this pull request Mar 4, 2026
5 tasks
Comment thread dom.bs
Comment thread dom.bs
<pre class=idl>
[Exposed=Window]
interface ProcessingInstruction : CharacterData {
constructor(DOMString target, optional DOMString data = "");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially thought this would be a bit ugly compared to using IDL records or some such so you can set the attributes directly, but I think overall it's probably okay if we have to keep around the dual-data-structure anyway.

Comment thread dom.bs Outdated
Comment thread dom.bs Outdated
Comment thread dom.bs Outdated
</div>

<div algorithm>
<p>To <dfn>update attributes from data</dfn> for a {{ProcessingInstruction}} <a for=/>node</a> <var>pi</var>:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this parser is compatible with how xsl-stylesheet is parsed today?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was based on how this works in Chromium:

https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/core/dom/processing_instruction.cc;l=120-146;drc=bc971aee7d328d1e886ba76adc540805ea480981

In WebKit it is similar:

https://github.com/WebKit/WebKit/blob/9af8ffff919a1a1478b8a43cef821adb49a4e6a1/Source/WebCore/dom/ProcessingInstruction.cpp#L99
https://github.com/WebKit/WebKit/blob/26d7991c939a41bbab5d01263900598b67220208/Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp#L1556-L1574

In Gecko it's a little bit different, every attribute lookup uses an attribute parser and I can't tell if it exactly matches XML or not:
https://github.com/mozilla-firefox/firefox/blob/22d04b52b0eb8d9fa11bf8ede5ccc0243a07c5ba/dom/xml/XMLStylesheetProcessingInstruction.cpp#L92
https://github.com/mozilla-firefox/firefox/blob/afe51e48c9deecccbaf64a74d6f207456433df98/dom/xml/ProcessingInstruction.cpp#L77
https://github.com/mozilla-firefox/firefox/blob/22d04b52b0eb8d9fa11bf8ede5ccc0243a07c5ba/dom/base/nsContentUtils.cpp#L1842-L1943

The difference from what's already implemented is that we'd use the HTML parser in HTML documents, instead of unconditionally using the XML parser.

For compat, this difference means that existing content that creates a <?xml-stylesheet href="..."?> PI and moves it to an HTML document would take the HTML parser path instead of XML. For that to make a difference, I think that content would need to be invalid XML attribute syntax (like <?xml-stylesheet href='...'?>) and depends on it being broken (doing nothing) when being moved to HTML.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference is what happens if pi.data is set after the PI is moved to an HTML document (or constructed with createProcessingInstruction). In the new scenario invalid XML yet valid HTML can be used in that data for the xml-stylesheet attributes.

@foolip
Copy link
Copy Markdown
Member Author

foolip commented Mar 5, 2026

The approach here is to parse attributes from the PI data, and to keep data and attributes in sync whenever one of them changes. Alternatives considered:

Attribute mode: Treat data as legacy and switch to "attribute mode" whenever attributes are mutated, replacing data with the empty string and optionally making it readonly. Serialization would need to change to use the attributes in this mode. Not pursued because setAttribute() would have the non-obvious side effect of making future data mutations throw or (alternatively) have no effect on serialization. This option is only attractive if it could be the default, but I can't imagine that's web compatible.

Just data: Parse data whenever getting an attribute. This seems close to how Gecko handled attributes for <?xml-stylesheet?>. But updating data in setAttribute() requires a parsing all attributes. There would be no observable difference to the current keep-everything-in-sync approach.

Attributes for HTML, data for XML: Since PIs are new in HTML, we could have "attribute mode" in HTML and keep data behavior intact for XML. But this would make the APIs significantly different in HTML and XML documents, and it would leave <?xml-stylesheet?> unexplained.

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request Mar 5, 2026
Add setAttribute/getAttribute/hasAttribute/removeAttribute etc. to
ProcessingInstruction. They act similarly to attributes on elements,
however they don't produce attribute mutation records.

See spec PR: whatwg/dom#1454

Bug: 431374376
Change-Id: I659e772f0fb11a6ffe2fe3876cf61ee0f6ad042a
Comment thread dom.bs
<li><p>Append U+0022 (") to <var>data</var>.

<li><p>Append the result of
<a href="https://html.spec.whatwg.org/#escapingString">escaping a string</a> given
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because https://html.spec.whatwg.org/#escapingString isn't exported and obviously this needs to be fixed. I'm just not sure yet if the escaping should be the same for HTML and XML, or if it should be split like parsing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Producing &nbsp;, which isn't in https://www.w3.org/TR/xml-stylesheet/#NT-PredefEntityRef seems like a bad idea in the XML case at least.

brave-builds pushed a commit to brave/chromium that referenced this pull request Mar 5, 2026
Add setAttribute/getAttribute/hasAttribute/removeAttribute etc. to
ProcessingInstruction. They act similarly to attributes on elements,
however they don't produce attribute mutation records.

See spec PR: whatwg/dom#1454

Bug: 431374376
Change-Id: I659e772f0fb11a6ffe2fe3876cf61ee0f6ad042a
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7607900
Reviewed-by: Philip Jägenstedt <foolip@chromium.org>
Commit-Queue: Noam Rosenthal <nrosenthal@google.com>
Cr-Commit-Position: refs/heads/main@{#1594810}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request Mar 5, 2026
Add setAttribute/getAttribute/hasAttribute/removeAttribute etc. to
ProcessingInstruction. They act similarly to attributes on elements,
however they don't produce attribute mutation records.

See spec PR: whatwg/dom#1454

Bug: 431374376
Change-Id: I659e772f0fb11a6ffe2fe3876cf61ee0f6ad042a
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7607900
Reviewed-by: Philip Jägenstedt <foolip@chromium.org>
Commit-Queue: Noam Rosenthal <nrosenthal@google.com>
Cr-Commit-Position: refs/heads/main@{#1594810}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request Mar 5, 2026
Add setAttribute/getAttribute/hasAttribute/removeAttribute etc. to
ProcessingInstruction. They act similarly to attributes on elements,
however they don't produce attribute mutation records.

See spec PR: whatwg/dom#1454

Bug: 431374376
Change-Id: I659e772f0fb11a6ffe2fe3876cf61ee0f6ad042a
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7607900
Reviewed-by: Philip Jägenstedt <foolip@chromium.org>
Commit-Queue: Noam Rosenthal <nrosenthal@google.com>
Cr-Commit-Position: refs/heads/main@{#1594810}
Comment thread dom.bs Outdated
"{{InvalidCharacterError!!exception}}" {{DOMException}}. <!-- Gecko does this. -->

<li>Return a new {{ProcessingInstruction}}
<li>Let <var>pi</var> be a new {{ProcessingInstruction}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix this algorithm to use <li><p> as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But even better. We should make this algorithm invoke "initialize a ProcessingInstruction node" that essentially does what this algorithm does except for allocating a new node. Then we can reuse it for the constructor and keep them synchronized indefinitely.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the style of event's "initialize" which required disambiguation invocations of that.

Not setting target and data on creation does leave the two undefined until "initialize" is called since https://dom.spec.whatwg.org/#concept-pi-target and https://dom.spec.whatwg.org/#concept-cd-data don't have defaults. Saying that they default to the empty string would be weird in the case of target since that cannot be the empty string after creation+initialization.

If you think this is ugly we could instead make it a "validate target and data" algorithm that only has the exception-throwing bits.

Comment thread dom.bs Outdated
Comment thread dom.bs Outdated
"<code>&gt;&lt;/attrs&gt;</code>".

<li><p>Let <var>fragment</var> be the result of invoking the
<a>fragment parsing algorithm steps</a> with <var>context</var> and <var>markup</var>.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's great that the parsing rules depend on the context document. I also don't see how this is handling errors in the XML case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XML errors would result in a "SyntaxError" exception per https://html.spec.whatwg.org/#xml-fragment-parsing-algorithm, and thus leave the PI without any attributes.

If you don't want this to depend on the document, we could use the HTML parser unconditionally, but that would mean that <?xml-stylesheet href=style.css?> starts working in XML, and might force non-browsers that process XML documents with such PIs to parse those attribute as HTML.

I think using a different parser depending on the document is the most consistent with existing APIs like innerHTML, and avoids affecting existing content as much as possible.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is not a context where we can rethrow the exception in so we'd have to handle it. But I don't think we need to use the XML parser.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see what you're saying.

If we don't use the XML parser, do you mean always use the HTML parser, or something else?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that, but it always lowercases attributes so I don’t think we can without an additional flag or some such. Not sure what would be best.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the option that would be most consistent with elements is that we don't have a constructor at all and switch the parser based on the node document.

If we do want a parser (seems nice) then we could also consider an "XML parser-created" flag that we set and use the XML parser only in that case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's similar to document.createElement() case-folding in HTML but not in XML. Whether to support a constructor or not impacts DX a bit, but imo doesn't need to affect the logic for HTML vs XML parser.

Using XML parser in XML docs and HTML parser in HTML docs seems reasonable to me, if we want to keep xml-stylesheet behavior as-is. I assume some folks would be upset if browsers start to use an HTML parser to parse the xml-stylesheet pseudo-attributes in XML documents.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we are extending what processing instructions can do so I think we should do that consistently across the syntaxes. Having subtle differences depending on how you got your processing instruction or what document it is currently in is not great, even if there is some precedent with other nodes.

I don't think people will be upset that browsers support some new ways to spell xml-stylesheet pseudo-attributes. XSLT removal is the significant change there.

And document.createElement() is rather terrible for its lowercasing as well by the way. That behavior has costed us all a lot of time trying to get it right and nail it down.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm. Yeah I guess it would work to always use the HTML parser, even for xml-stylesheet in XML documents. There can still be a document conformance rule that xml-stylesheet PIs in XML documents conform to the XML syntax.

Without XSLT support, xml-stylesheet will only be useful for CSS, which is probably less interesting for non-browser XML tools.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only use the HTML parser, we can consider starting the parser in the right state and stopping parsing when it would switch to the data state.

shubhamg13 pushed a commit to shubhamg13/wpt that referenced this pull request Mar 6, 2026
Add setAttribute/getAttribute/hasAttribute/removeAttribute etc. to
ProcessingInstruction. They act similarly to attributes on elements,
however they don't produce attribute mutation records.

See spec PR: whatwg/dom#1454

Bug: 431374376
Change-Id: I659e772f0fb11a6ffe2fe3876cf61ee0f6ad042a
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7607900
Reviewed-by: Philip Jägenstedt <foolip@chromium.org>
Commit-Queue: Noam Rosenthal <nrosenthal@google.com>
Cr-Commit-Position: refs/heads/main@{#1594810}
Comment thread dom.bs Outdated
@foolip
Copy link
Copy Markdown
Member Author

foolip commented Mar 10, 2026

We need to decide which parser is used to parse attributes from data, and I think these are all of the options:

  1. HTML or XML parser depending on the node document, similar to innerHTML. This was the initial state of this PR, via fragment parsing algorithm steps
  2. HTML parser everywhere except for <?xml-stylesheet?>, which would use the XML parser.
  3. Always use the HTML parser
  4. Always use the XML parser
  5. A bespoke parser with whatever properties we'd like, possibly being more strict for <?xml-stylesheet?>.

Option 2 seems like the most plausible to me right now, if we decide to make <?xml-stylesheet x:href="" href="..."?> no longer work.

@annevk WDYT?

@annevk
Copy link
Copy Markdown
Member

annevk commented Mar 11, 2026

I'd prefer 4-5 over 1-2 personally. And between 3-5 it's between 3 and 5. It will have to be somewhat custom I think whatever we pick because of casing and :.

I don't think the differences of 1 have served us well for other features. I don't see a real need to carry them over. 2 seems very weird.

I don't think we should care about the exact requirements of https://www.w3.org/TR/xml-stylesheet/ but it does expect x:href="" href="" to work. I think it would be somewhat odd to regress on that as that level of strictness is weird.

@foolip
Copy link
Copy Markdown
Member Author

foolip commented Mar 11, 2026

Thanks @annevk!

Always using the HTML parser is straightforward in spec and implementation, the only thing that's not ideal about is that the more lenient parsing of <?xml-namespace?> may affect non-browser tooling. Basically what @zcorpan said, "I assume some folks would be upset if browsers start to use an HTML parser to parse the xml-stylesheet pseudo-attributes in XML documents."

With a bespoke parser, that's mostly straightforward with "collect a sequence of code points", but is there a way to decode character entities in attribute values that doesn't duplicate all of the tokenizer states character entities? Or could we do something simpler that ends up diverging slightly from how regular attribute values are parsed?

@annevk
Copy link
Copy Markdown
Member

annevk commented Mar 11, 2026

So the concrete worry is that people creating dedicated XML content today would start relying on more leniency in browsers and this would have downstream repercussions? I think that's an acceptable risk. We have evolved XML (with <template> and <script> for instance) and will continue to do so (with new processing instructions). Tooling will adapt as needed.

If we can't reuse the HTML parser directly or want to create our own, following what https://w3c.github.io/webvtt/#webvtt-cue-text-tokenizer did seems somewhat reasonable, although it would be better with a proper hook in HTML.

@foolip
Copy link
Copy Markdown
Member Author

foolip commented Mar 12, 2026

I've changed this to always use the HTML parser now. The constraints of wanting to support <?marker name=x> (without quotes) and not wanting the APIs to depend on context only leaves option 3 and 5 from #1454 (comment), and reusing the HTML parser is the simplest of the two.

On lowercasing, I think this behavior might still be surprising:

pi = new ProcessingInstruction('target', 'FOO=v1');
pi.getAttribute('foo') == 'v1'; // normal in HTML, no surprise
pi.setAttribute('FOO', 'v2');
pi.getAttribute('foo') == 'v1' && pi.getAttribute('FOO') == 'v2'; // unlike HTML, but OK
pi.data += ' ';
pi.getAttribute('foo') == 'v1' && !pi.hasAttribute('FOO'); // collapsed after roundtrip, surprise?

We could (1) live with it (2) lowercase in all APIs or (3) use a bespoke parser that preserves attribute case. Any of these would be OK for me.

If we stick with the HMTL parser, there's also a layering question remaining. Both "Parse HTML from a string" and "escaping a string" would need to be exported in HTML for this to work as-is, but I think it would be cleaner to add and export "parse attributes" and "serialize attributes" to HTML for this purpose.

@noamr
Copy link
Copy Markdown
Contributor

noamr commented Mar 12, 2026

I've changed this to always use the HTML parser now. The constraints of wanting to support <?marker name=x> (without quotes) and not wanting the APIs to depend on context only leaves option 3 and 5 from #1454 (comment), and reusing the HTML parser is the simplest of the two.

On lowercasing, I think this behavior might still be surprising:

pi = new ProcessingInstruction('target', 'FOO=v1');
pi.getAttribute('foo') == 'v1'; // normal in HTML, no surprise
pi.setAttribute('FOO', 'v2');
pi.getAttribute('foo') == 'v1' && pi.getAttribute('FOO') == 'v2'; // unlike HTML, but OK
pi.data += ' ';
pi.getAttribute('foo') == 'v1' && !pi.hasAttribute('FOO'); // collapsed after roundtrip, surprise?

We could (1) live with it (2) lowercase in all APIs or (3) use a bespoke parser that preserves attribute case. Any of these would be OK for me.

If we stick with the HMTL parser, there's also a layering question remaining. Both "Parse HTML from a string" and "escaping a string" would need to be exported in HTML for this to work as-is, but I think it would be cleaner to add and export "parse attributes" and "serialize attributes" to HTML for this purpose.

Always lowercasing in the PI attribute API makes sense to me.

@annevk
Copy link
Copy Markdown
Member

annevk commented Mar 12, 2026

Either everything needs to canonicalize to lowercase or we pass a flag to the HTML parser that tells it to preserve case (and don't do anything with casing in the API of course). It feels odd for it to lowercase, but I could live with it and it would certainly be easier. Whether it's surprising probably depends on the kind of web developer.

@foolip
Copy link
Copy Markdown
Member Author

foolip commented Mar 12, 2026

I agree that the sensible options are to either lowercase everywhere or preserve case everywhere. I've gone with the former for now, but could live with the parser change too.

lando-worker Bot pushed a commit to mozilla-firefox/firefox that referenced this pull request Mar 12, 2026
…tructions, a=testonly

Automatic update from web-platform-tests
Support attribute API for processing instructions

Add setAttribute/getAttribute/hasAttribute/removeAttribute etc. to
ProcessingInstruction. They act similarly to attributes on elements,
however they don't produce attribute mutation records.

See spec PR: whatwg/dom#1454

Bug: 431374376
Change-Id: I659e772f0fb11a6ffe2fe3876cf61ee0f6ad042a
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7607900
Reviewed-by: Philip Jägenstedt <foolip@chromium.org>
Commit-Queue: Noam Rosenthal <nrosenthal@google.com>
Cr-Commit-Position: refs/heads/main@{#1594810}

--

wpt-commits: 57fc0105def90d26b060cbf58232ddd09fe622c7
wpt-pr: 58294
@WebReflection
Copy link
Copy Markdown

WebReflection commented Mar 12, 2026

It feels odd for it to lowercase,

that's what any other attribute on HTML space does though, right?

document.body.setAttribute('WUT', 123);
document.body.getAttribute('wut'); // "123"

edit I am not saying for developers it's OK to abuse lower-casing, I am just saying for "some" developer that breaks expectations

@zcorpan
Copy link
Copy Markdown
Member

zcorpan commented Mar 12, 2026

I'm OK with using the HTML parser and lowercasing in the API.

@foolip
Copy link
Copy Markdown
Member Author

foolip commented Apr 24, 2026

I looked through what remains to be done here, and I think its's just to export "Parse HTML from a string" and "escaping a string" in HTML or add appropriate hooks to invert the layering. If anyone knows of more, please comment.

Comment thread dom.bs

<ol>
<li><p>If <var>name</var> is not a <a>valid attribute local name</a>, then <a>throw</a> an
"{{InvalidCharacterError!!exception}}" {{DOMException}}.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for other maps (e.g., DOMTokenList) we validate this for each accessor. Should we do that here as well?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be less surprising if we align with element attributes? (throw only on setters)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to match how setAttribute() works on element more than DOMTokenList, but can you explain what validation you're thinking of? DOMTokenList's contains() doesn't do any validation, and the new pi.hasAttribute() also doesn't.

My intention is that no invalid data be able to be set on the map, so that the getters don't need any validation.

Copy link
Copy Markdown
Member

@annevk annevk May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can consider this resolved. It seems good to be consistent with elements.

Comment thread dom.bs Outdated
Comment thread dom.bs Outdated
Comment thread dom.bs Outdated
Comment thread dom.bs
<li><p>Let <var>attributes</var> be <a>this</a>'s <a for=ProcessingInstruction>attribute map</a>.

<li>
<p>If <var>attributes</var>[<var>name</var>] does not <a for=set>exist</a>:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong exist. You want for=map.

Comment thread dom.bs
<li><p><a for=ProcessingInstruction>Initialize</a> <a>this</a> with <var>target</var> and
<var>data</var>.

<li><p><a>Update attributes from data</a> for <a>this</a>.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant with Initialize above.

Comment thread dom.bs
<li><p>Let <var>html</var> be the concatenation of "<code>&lt;html </code>",
<var>pi</var>'s <a for=CharacterData>data</a>, and "<code>&gt;</code>".

<li><p><a href="https://html.spec.whatwg.org/#parse-html-from-a-string">Parse HTML from a string</a>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that we'll forget about fixing this.

Comment thread dom.bs
</div>

<div algorithm>
<p>To <dfn export for=ProcessingInstruction id=concept-pi-initialize>initialize</dfn> a
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<p>To <dfn export for=ProcessingInstruction id=concept-pi-initialize>initialize</dfn> a
<p>To <dfn export for=ProcessingInstruction>initialize</dfn> a

Comment thread dom.bs
</div>

<div algorithm>
<p>To <dfn export for=ProcessingInstruction id=concept-pi-initialize>initialize</dfn> a
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need exporting? If HTML also needs to create processing instructions we should probably have a dedicated create, I think.

Comment thread dom.bs

<p>{{ProcessingInstruction}} <a for=/>nodes</a> have an associated
<dfn export id=concept-pi-attribute-map for=ProcessingInstruction>attribute map</dfn>, which is a
<a for=/>map</a>, initially empty.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move these two associated concepts to before the constructor.

Comment thread dom.bs
<p>If <var>attributes</var>[<var>name</var>] does not <a for=set>exist</a>:

<ol>
<li><p>If <var>force</var> is not given or is true, set <var>attributes</var>[<var>name</var>]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use <a for=map>set</a> here as well.

Comment thread dom.bs
<li><p>Let <var>document</var> be a new {{Document}} node whose <a for=Document>type</a> is
"<code>html</code>".

<li><p>Let <var>html</var> be the concatenation of "<code>&lt;html </code>",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<a for=string>concatenation</a>

Comment thread dom.bs
given <var>document</var> and <var>html</var>.

<li><p><a for=list>For each</a> <var>attribute</var> of <var>document</var>'s
<a>document element</a>'s <a for=Element>attribute list</a>, set <var>pi</var>'s
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<a>document element</a>'s <a for=Element>attribute list</a>, set <var>pi</var>'s
<a>document element</a>'s <a for=Element>attribute list</a>: set <var>pi</var>'s

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I guess we should link set here as well? Might be good to do another audit.

@hsivonen
Copy link
Copy Markdown
Member

hsivonen commented May 12, 2026

So far, when we've made DOM APIs change behavior based on the HTMLness flag on the Document, it has caused issues.

I'm well aware that concerns about XML are unfashionable, but I think it's bad to risk creating a browser-XML dialect that would be incompatible with other XML just to be consistent with the rest of HTML parsing on the HTML side.

That is, I'm uneasy about broadering the pseudo-attribute syntax from https://www.w3.org/TR/xml-stylesheet/ on the XML side and I'm uneasy about making the HTML and XML sides differ, which logically results in considering the use of https://www.w3.org/TR/xml-stylesheet/ syntax on the HTML side as well.

How bad would that be? WebKit and Blink seem to delegate to an XML tokenizer. Gecko has a special-purpose tokenizer.

Looking at Google HTML/CSS Style Guide and use cases, it seems to me that using HTML attribute tokenization instead of https://www.w3.org/TR/xml-stylesheet/ isn't really something that's needed from use cases at hand and would add support for things that are frowned upon anyway these days when writing HTML, so this seems to be a matter of "for consistency" with the rest of HTML.

If we believe that we need to re-use HTML attribute syntax "for consistency" with the full entity set and everything in HTML, I think it would be then less bad to make the behavior depend on HTMLness flag of the Document in order not to introduce a browser dialect of xml-stylesheet on the XML side than to introduce the use the HTML tokenizer to xml-stylesheet in XML, which would introduce a browser dialect of XML (for CSS even if XSLT goes away).

As for @annevk 's remark about WebVTT: In Gecko, the WebVTT tokenizer is implemented in JS and calls into HTML parsing in an expensive way to support the full HTML entity set. Reusing the WebVTT tokenizer on the spec level would not result in any implementation ease in Gecko.

Supporting the full HTML entity set and the way semicolon omission works is both in spec and implementation the hardest part. WebVTT delegates to HTML. If we want to support the full HTML entity set, we should delegate to the HTML parser outright (like the PR currently does) instead of doing the WebVTT thing of supposedly defining a simplified thing but then jumping into the hard part of the more complex thing both spec and implementation-wise anyway.

I think it's sad to have that complexity "for consistency". Perhaps "for consistency" is the right thing, but I still find it sad to have the complexity, especially for something that's "Do not use" in the Google style guide and spec-wise a backward-compat feature in the sense that we're not extending the entity set.

@noamr
Copy link
Copy Markdown
Contributor

noamr commented May 14, 2026

So far, when we've made DOM APIs change behavior based on the HTMLness flag on the Document, it has caused issues.

I'm well aware that concerns about XML are unfashionable, but I think it's bad to risk creating a browser-XML dialect that would be incompatible with other XML just to be consistent with the rest of HTML parsing on the HTML side.

That is, I'm uneasy about broadering the pseudo-attribute syntax from https://www.w3.org/TR/xml-stylesheet/ on the XML side and I'm uneasy about making the HTML and XML sides differ, which logically results in considering the use of https://www.w3.org/TR/xml-stylesheet/ syntax on the HTML side as well.

How bad would that be? WebKit and Blink seem to delegate to an XML tokenizer. Gecko has a special-purpose tokenizer.

Looking at Google HTML/CSS Style Guide and use cases, it seems to me that using HTML attribute tokenization instead of https://www.w3.org/TR/xml-stylesheet/ isn't really something that's needed from use cases at hand and would add support for things that are frowned upon anyway these days when writing HTML, so this seems to be a matter of "for consistency" with the rest of HTML.

Do you mean things like <?start name=foo>, which would be HTML-compliant but not XML-compliant?
I think it's more than consistency... it's keeping the HTML fault tolerance.

If we believe that we need to re-use HTML attribute syntax "for consistency" with the full entity set and everything in HTML, I think it would be then less bad to make the behavior depend on HTMLness flag of the Document in order not to introduce a browser dialect of xml-stylesheet on the XML side than to introduce the use the HTML tokenizer to xml-stylesheet in XML, which would introduce a browser dialect of XML (for CSS even if XSLT goes away).

I'm fine with that and that was the original PR. @annevk opposed this though?

As for @annevk 's remark about WebVTT: In Gecko, the WebVTT tokenizer is implemented in JS and calls into HTML parsing in an expensive way to support the full HTML entity set. Reusing the WebVTT tokenizer on the spec level would not result in any implementation ease in Gecko.

Supporting the full HTML entity set and the way semicolon omission works is both in spec and implementation the hardest part. WebVTT delegates to HTML. If we want to support the full HTML entity set, we should delegate to the HTML parser outright (like the PR currently does) instead of doing the WebVTT thing of supposedly defining a simplified thing but then jumping into the hard part of the more complex thing both spec and implementation-wise anyway.

I think it's sad to have that complexity "for consistency". Perhaps "for consistency" is the right thing, but I still find it sad to have the complexity, especially for something that's "Do not use" in the Google style guide and spec-wise a backward-compat feature in the sense that we're not extending the entity set.

I think people would find it surprising that parsing pseudo-attributes would work different from parsing attributes.
I personally prefer to use the document-flag thing.
But all that being said, this whole thing is not something I feel strongly about. Developers would adapt to whatever we decide here.
If you and @annevk can reach a consensus on either of theh 3 options, I'm OK to align.

@annevk
Copy link
Copy Markdown
Member

annevk commented May 14, 2026

See #1454 (comment) for my opinion on the various options.

@noamr
Copy link
Copy Markdown
Contributor

noamr commented May 18, 2026

So far, when we've made DOM APIs change behavior based on the HTMLness flag on the Document, it has caused issues.

I'm well aware that concerns about XML are unfashionable, but I think it's bad to risk creating a browser-XML dialect that would be incompatible with other XML just to be consistent with the rest of HTML parsing on the HTML side.

That is, I'm uneasy about broadering the pseudo-attribute syntax from https://www.w3.org/TR/xml-stylesheet/ on the XML side and I'm uneasy about making the HTML and XML sides differ, which logically results in considering the use of https://www.w3.org/TR/xml-stylesheet/ syntax on the HTML side as well.

How bad would that be? WebKit and Blink seem to delegate to an XML tokenizer. Gecko has a special-purpose tokenizer.

Looking at Google HTML/CSS Style Guide and use cases, it seems to me that using HTML attribute tokenization instead of https://www.w3.org/TR/xml-stylesheet/ isn't really something that's needed from use cases at hand and would add support for things that are frowned upon anyway these days when writing HTML, so this seems to be a matter of "for consistency" with the rest of HTML.

If we believe that we need to re-use HTML attribute syntax "for consistency" with the full entity set and everything in HTML, I think it would be then less bad to make the behavior depend on HTMLness flag of the Document in order not to introduce a browser dialect of xml-stylesheet on the XML side than to introduce the use the HTML tokenizer to xml-stylesheet in XML, which would introduce a browser dialect of XML (for CSS even if XSLT goes away).

As for @annevk 's remark about WebVTT: In Gecko, the WebVTT tokenizer is implemented in JS and calls into HTML parsing in an expensive way to support the full HTML entity set. Reusing the WebVTT tokenizer on the spec level would not result in any implementation ease in Gecko.

Supporting the full HTML entity set and the way semicolon omission works is both in spec and implementation the hardest part. WebVTT delegates to HTML. If we want to support the full HTML entity set, we should delegate to the HTML parser outright (like the PR currently does) instead of doing the WebVTT thing of supposedly defining a simplified thing but then jumping into the hard part of the more complex thing both spec and implementation-wise anyway.

I think it's sad to have that complexity "for consistency". Perhaps "for consistency" is the right thing, but I still find it sad to have the complexity, especially for something that's "Do not use" in the Google style guide and spec-wise a backward-compat feature in the sense that we're not extending the entity set.

I've discussed this with @foolip and @tunetheweb and I think the best course of action is to go back to using the XML parser here also for HTML (option 4 here. This is the strictest options and we can loosen things if there is a pushback later from web developers, which I think there won't be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants