Escaping Behavior of Attribute Values in the Nokogiri HTML5 Parser #3516

tohosaku · 2025-05-26T21:07:58Z

tohosaku
May 26, 2025

It seems that when Nokogiri parses HTML5, it does not escape the < character, but it does escape &.

Other parsers/sanitizers appear to escape all of <, >, and &. Is Nokogiri's current behavior intentional?

Also, is there a way to use Nokogiri’s HTML5 parser to get the output <a href="<&">hello</a> from the input <a href="<&">hello</a>?

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri"
end

require "nokogiri"
require "cgi"

# sample 1
xml = '<a href="<&">hello</a>'
doc1 = Nokogiri::HTML5.fragment(xml)
doc1.children.each do |node|
  node.each do |key, val|
    node[key] = CGI.escapeHTML val
  end
end
p doc1.to_html #=> "<a href=\"&amp;lt;&amp;amp;\">hello</a>"

# sample 2
xml = '<a href="<&">hello</a>'
doc2 = Nokogiri::HTML5.fragment(xml)

p doc2.to_html #=> "<a href=\"<&amp;\">hello</a>"

Answered by stevecheckoway

May 26, 2025

I think you're right about this. Looking at the standard, it seems < and > should always be turned into < and >. That doesn't match my recollection of the standard. I recall that only being true outside of attributes. And indeed, the Nokogiri code matches my recollection.

    } else if (!attr && ch == '<') {
      replacement = "&lt;";
    } else if (!attr && ch == '>') {
      replacement = "&gt;";
    } else {

It looks like this change was introduced last week. It should be easy enough to change Nokogiri's behavior here.

View full answer

stevecheckoway · 2025-05-26T22:19:01Z

stevecheckoway
May 26, 2025
Maintainer

I think you're right about this. Looking at the standard, it seems < and > should always be turned into < and >. That doesn't match my recollection of the standard. I recall that only being true outside of attributes. And indeed, the Nokogiri code matches my recollection.

    } else if (!attr && ch == '<') {
      replacement = "&lt;";
    } else if (!attr && ch == '>') {
      replacement = "&gt;";
    } else {

It looks like this change was introduced last week. It should be easy enough to change Nokogiri's behavior here.

0 replies

stevecheckoway · 2025-05-26T22:36:57Z

stevecheckoway
May 26, 2025
Maintainer

The obvious patch (remove !attr && ) fails several html5lib-test tests which explicitly test for the old behavior.

  1) Failure:
TestHtml5Serialize#test_serializing_html_innerHTML_7 [test/html5/test_serialize.rb:509]:
Expected: "<a b=\"<\"></a>"
  Actual: "<a b=\"&lt;\"></a>"

  2) Failure:
TestHtml5Serialize#test_serializing_html_outerHTML_7 [test/html5/test_serialize.rb:513]:
--- expected
+++ actual
@@ -1 +1 @@
-"<span><a b=\"<\"></a></span>"
+"<span><a b=\"&lt;\"></a></span>"


  3) Failure:
TestHtml5Serialize#test_serializing_html_innerHTML_8 [test/html5/test_serialize.rb:509]:
Expected: "<a b=\">\"></a>"
  Actual: "<a b=\"&gt;\"></a>"

  4) Failure:
TestHtml5Serialize#test_serializing_html_outerHTML_8 [test/html5/test_serialize.rb:513]:
--- expected
+++ actual
@@ -1 +1 @@
-"<span><a b=\">\"></a></span>"
+"<span><a b=\"&gt;\"></a></span>"


6603 runs, 113380 assertions, 4 failures, 0 errors, 23 skips

2 replies

flavorjones May 28, 2025
Maintainer

Yeah, I saw that change go by in my feed and immediately was like, "ugh, this is going to be a pain in the ass" because lots of applications have tests (particularly sanitization-related tests) that assert on this behavior.

I'm also annoyed that nobody has updated the html5lib test suite for this. When we have something ready to go let's submit a patch to https://github.com/html5lib/html5lib-tests

flavorjones May 28, 2025
Maintainer

(I may have time to work on this over the weekend)

tohosaku · 2025-05-26T23:10:22Z

tohosaku
May 26, 2025
Author

Thank you for your quick response!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Escaping Behavior of Attribute Values in the Nokogiri HTML5 Parser #3516

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Escaping Behavior of Attribute Values in the Nokogiri HTML5 Parser #3516

Uh oh!

tohosaku May 26, 2025

Replies: 3 comments · 2 replies

Uh oh!

stevecheckoway May 26, 2025 Maintainer

Uh oh!

stevecheckoway May 26, 2025 Maintainer

Uh oh!

Uh oh!

flavorjones May 28, 2025 Maintainer

Uh oh!

flavorjones May 28, 2025 Maintainer

Uh oh!

tohosaku May 26, 2025 Author

tohosaku
May 26, 2025

Replies: 3 comments 2 replies

stevecheckoway
May 26, 2025
Maintainer

stevecheckoway
May 26, 2025
Maintainer

flavorjones May 28, 2025
Maintainer

flavorjones May 28, 2025
Maintainer

tohosaku
May 26, 2025
Author