Henri Sivonen: Bogo-XML Declaration Returns to Gecko

Вторник, 01 Июня 2021 г. 17:02 + в цитатник

Firefox 89 was released today. This release (again!) honors a character encoding declaration made via syntax that looks like an XML declaration used in text/html (if there are no other character encoding declarations).

Before HTML parsing was specified, Internet Explorer did not support declaring the encoding of a text/html document using the XML declaration syntax. However, Gecko, WebKit, and Presto did. Unfortunately, I didn’t realize that they did.

When Hixie specified HTML parsing, consistent with IE, he didn’t make the spec sensitive to the XML declaration syntax in a particular way. I am unable to locate any discussion in the WHATWG mailing list archives about whether an encoding declaration made using the XML declaration syntax in text/html should be honored when processing text/html.

When I implemented the specified HTML parsing algorithm in Gecko, I also implemented the internal encoding declaration handling per specification. As a side effect, in Firefox 4, I removed Gecko’s support for the XML declaration syntax for declaring the character encoding in text/html. I don’t recall this having been a knowingly-made decision: The rewrite just did strictly what the spec said.

When WebKit and Presto implemented the specified HTML parsing algorithm, they only implemented the tokenization and tree building parts and kept their old ways for handling character encoding declarations. That is, they continued to honor the XML declaration syntax for declaring the character encoding text/html. I don’t recall the developers of the either engine raising this as a spec issue back then.

The closest to the issue getting raised as a spec issue was for the wrong reason, which made people push back instead of fixing the spec.

When Blink forked, it inherited WebKit’s behavior. When Microsoft switched from EdgeHTML to Blink, Gecko became the only actively-developed major engine not to support the XML declaration syntax for declaring the character encoding text/html. Since unlabeled UTF-8 is not automatically detected, this became a Web compatibility issue with pages that declare UTF-8 but only using the XML declaration syntax (i.e. without a BOM, a meta, or HTTP-layer declaration as well).

And that’s why support for declaring the character encoding via the XML declaration syntax came to the HTML spec and back to Gecko.

What Can We learn?

When the majority of engines has a behavior, we should assume that content is authored with the expectation that that behavior exists, and we can’t rely on assuming that all content is tested with the engine that doesn’t have the behavior even if that engine has the majority market share.
(In general, the HTML parsing algorithm upheld IE behaviors a bit too much. I regret that I didn’t push for non-IE behavior in tokenization when a less-than sign is encountered inside a tag token.)
Instead of just trusting the spec, also check with other engines do.
If you aren’t willing to implement what the spec says, you should raise the issue of the standardization forum.
If an issue is raised for a bad reason, pay attention to if there is an adjacent issue that needs fixing for a good reason.
“We comply with the spec” is unlikely to be a winning response to a long-standing Web compatibilty bug.

https://hsivonen.fi/xml-decl/