> The Sanitizer API is a proposed new browser API to bring a safe and easy-to-use capability to sanitize HTML into the web platform [and] is currently being incubated in the Sanitizer API WICG, with the goal of bringing this to the WHATWG.
Which would replace the need for sanitizing user-entered content with libraries like DOMPurify by having it built into the browser's API.
The proposed specification has additional information: https://github.com/WICG/sanitizer-api/
The theory is that the parse->serialize->parse round-trip is not idempotent and that sanitization is element context-dependent, so having a pure string->string function opens a new class of vulnerabilities. Having a stateful setHTML() function defined on elements means the HTML context-specific rules for tables, SVG, MathML etc. are baked in, and eliminates double-parsing errors.
Are MXSS errors actually that common?
Seriously, we got CSP before setHTML() WTF!
CSP is nasty. Removing essential functionality to mitigate possible security flaws, ignoring the developer's intent. CSP is like taping your mouth shut to lose weight... But you still sit through 3 meals a day... Basically smashing the food against your face.
No, the reason is that the problem is underspecified and unsatisfiable.
The whole notion of HTML "sanitization" is the ultimate "just do what I mean". It's the customer who cannot articulate what they need. It's «Hey, how about if there were some sort of `import "nobugs"`?»
"HTML sanitization" is never going to be solved because it's not solvable.
There's no getting around knowing whether or any arbitrary string is legitimate markup from a trusted source or some untrusted input that needs to be treated like text. This is a hard requirement. (And if you already have this information, then the necessary tools have been available for years—decades, even: `innerHTML` and `textContent`—or if you don't like the latter, then it's trivial to write your own `escapeText` subroutine that's correct, well-formed, and sound.) No new DOMPurify alternative or native API baked into the browser is going to change this, ever.
I think maybe a better api would be to add an unsafe html tag so it would look something like:
<unsafe>
all unsafe code here
</unsafe>
Then if the browsers do indeed support it, it would work even without javascript.But in any case, you really should be validating everything server side.
Well this is clearly wrong isn't it? You need a whitelist of elements, not a blacklist. That lesson is at least 2 decades old.
You don’t want developers trying to rely on client-only sanitization for user input submitted to the server. Sanitizing while setting a user-face UI makes sense.
This is why people should really use XHTML, the strict XML dialect of HTML, in order to avoid these nasty parsing surprises. It has the predictable behavior that you want.
In XHTML, the code does exactly what it says it does. If you write <table><a></a></table> like the example on the mXSS page, then you get a table element and an anchor child. As another example, if you write <table><td>xyz</td></table>, that's exactly what you get, and there are no implicit <tbody> or <tr> inserted inside.
It's just wild as I continue to watch the world double down for decades on HTML and all its wild behavior in parsing. Furthermore, HTML's syntax is a unique snowflake, whereas XML is a standardized language that just so happens to be used in SVG, MathML, Atom, and other standards - no need to relearn syntax every single time.