module Nokogiri::HTML5
def reencode(body, content_type = nil)
http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding
http://bugs.ruby-lang.org/issues/2567
the HTML5 standard.
this lead, Nokogiri::HTML5 attempts to do likewise, while attempting to more closely follow
Accordingly, Nokogiri::HTML4::Document.parse provides limited encoding detection. Following
the Gumbo parser *only* supports utf-8.
consumers of HTML as the default for HTML is iso-8859-1, most "good" producers use utf-8, and
default_ by the Ruby Net::HTTP library. This being said, it is a very real problem for
Charset sniffing is a complex and controversial topic that understandably isn't done _by
def reencode(body, content_type = nil) if body.encoding == Encoding::ASCII_8BIT encoding = nil # look for a Byte Order Mark (BOM) initial_bytes = body[0..2].bytes if initial_bytes[0..2] == [0xEF, 0xBB, 0xBF] encoding = Encoding::UTF_8 elsif initial_bytes[0..1] == [0xFE, 0xFF] encoding = Encoding::UTF_16BE elsif initial_bytes[0..1] == [0xFF, 0xFE] encoding = Encoding::UTF_16LE end # look for a charset in a content-encoding header if content_type encoding ||= content_type[/charset=["']?(.*?)($|["';\s])/i, 1] end # look for a charset in a meta tag in the first 1024 bytes unless encoding data = body[0..1023].gsub(/<!--.*?(-->|\Z)/m, "") data.scan(/<meta.*?>/im).each do |meta| encoding ||= meta[/charset=["']?([^>]*?)($|["'\s>])/im, 1] end end # if all else fails, default to the official default encoding for HTML encoding ||= Encoding::ISO_8859_1 # change the encoding to match the detected or inferred encoding body = body.dup begin body.force_encoding(encoding) rescue ArgumentError body.force_encoding(Encoding::ISO_8859_1) end end body.encode(Encoding::UTF_8) end