Escaping Behavior of Attribute Values in the Nokogiri HTML5 Parser #3516
-
It seems that when Nokogiri parses HTML5, it does not escape the < character, but it does escape &. Other parsers/sanitizers appear to escape all of <, >, and &. Is Nokogiri's current behavior intentional? Also, is there a way to use Nokogiri’s HTML5 parser to get the output #! /usr/bin/env ruby
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "nokogiri"
end
require "nokogiri"
require "cgi"
# sample 1
xml = '<a href="<&">hello</a>'
doc1 = Nokogiri::HTML5.fragment(xml)
doc1.children.each do |node|
node.each do |key, val|
node[key] = CGI.escapeHTML val
end
end
p doc1.to_html #=> "<a href=\"&lt;&amp;\">hello</a>"
# sample 2
xml = '<a href="<&">hello</a>'
doc2 = Nokogiri::HTML5.fragment(xml)
p doc2.to_html #=> "<a href=\"<&\">hello</a>" |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
I think you're right about this. Looking at the standard, it seems } else if (!attr && ch == '<') {
replacement = "<";
} else if (!attr && ch == '>') {
replacement = ">";
} else { It looks like this change was introduced last week. It should be easy enough to change Nokogiri's behavior here. |
Beta Was this translation helpful? Give feedback.
-
The obvious patch (remove
|
Beta Was this translation helpful? Give feedback.
-
Thank you for your quick response! |
Beta Was this translation helpful? Give feedback.
I think you're right about this. Looking at the standard, it seems
<
and>
should always be turned into<
and>
. That doesn't match my recollection of the standard. I recall that only being true outside of attributes. And indeed, the Nokogiri code matches my recollection.It looks like this change was introduced last week. It should be easy enough to change Nokogiri's behavior here.