class HTML::Selector
See www.w3.org/TR/css3-selectors/
matches any element whose identifier consists of one or more digits.
selector = HTML::Selector.new “#?”, /^d+$/
For example:
values are converted to strings.
The substitution value may be a string or a regular expression. All other
next value in the argument list following the CSS expression.
A substitution takes the form of a question mark (?
) and uses the
You can use substitution with identifiers, class names and element values.
=== Substitution Values
Matches all paragraphs that do not have the class .post
.
p:not(.post)
not another using :not
. For example:
And you can always select an element that matches one set of rules but
Selects the first four paragraphs, ignoring all others.
div p:nth-of-type(-n+4)
ignoring all other elements.
Selects the fourth paragraph in the div
, counting only paragraphs, and
div p:nth-of-type(4)
other elements, since those are also counted.
Selects the fourth paragraph in the div
, but not if the div
contains
div p:nth-child(4)
Selects every second row in the table starting with the first one.
table tr:nth-child(odd)
For example:
figure out.
But after reading the examples and trying a few combinations, it’s easy to
tricky and the CSS specification doesn’t do a much better job explaining it.
As you can see, :nth-child<tt> pseudo class and its variant can get quite<br><br>match the simple selector.<br>* <tt>:not(selector)
– Match the element only if the element does not
only elements of its type.
* :nth-last-of-type(..)
– As above, but counts from the last child and
* :nth-last-child(..)
– As above, but counts from the last child.
* :nth-of-type(..)
– As above, but only counts elements of its type.
fourth). Same as :nth-child(2n+2)
.
* :nth-child(even)
– Match element in the even position (i.e. second,
Same as :nth-child(2n+1)
.
* :nth-child(odd)
– Match element in the odd position (i.e. first, third).
elements of its parent element.
in each group of a
child elements, up to the first b
child
* :nth-child(-an+b)
– Match the element if it is the first child (element)
in each group of a
child elements of its parent element.
* :nth-child(an+b)
– Match the element if it is the b-th child (element)
of its parent element. The value b
specifies its index, starting with 1.
* :nth-child(b)
– Match the element if it is the b-th child (element)
of its parent element of its type.
* :last-of-type
– Match the element if it is the last child (element)
of its parent element.
* :last-child
– Match the element if it is the last child (element)
of its parent element of its type.
* :first-of-type
– Match the element if it is the first child (element)
of its parent element.
* :first-child
– Match the element if it is the first child (element)
of its parent element and its type.
* :only-of-type
– Match the element if it is the only child (element)
of its parent element.
* :only-child
– Match the element if it is the only child (element)
as its text content (ignoring leading and trailing whitespace).
* :content(string)
– Match the element only if it has string
and no text content.
* :empty
– Match the element only if it has no child elements,
(no parent element).
* :root
– Match the element only if it is the root element
elements in a given position:
Pseudo classes were introduced in CSS 3. They are most often used to select
=== Pseudo classes
the first element, the #match method may return more than one match.
Since children and sibling selectors may match more than one element given
or against the second expression.
* expr1, expr2
– Match any element against the first expression,
that comes after an element that matches the first expression.
* expr1 ~ expr2
– Match any element against the second expression
if it immediately follows an element that matches the first expression.
* expr1 + expr2
– Match any element against the second expression
if it is the child of an element that matches the first expression.
* expr1 > expr2
– Match any element against the second expression
if it has some parent element that matches the first expression.
* expr1 expr2
– Match any element against the second expression
Complex selectors use a combination of expressions to match elements:
=== Alternatives, siblings, children
[class~=my_class]
.my_class
and so do the following two selectors:
[id=my_id]
#my_id
For example, the following two selectors match the same element:
word.
* name|=word
– The attribute value must start with specified
word (space separated).
* name~=word
– The attribute value must contain the specified
specified value.
* name*=value
– The attribute value must contain the
specified value.
* name$=value
– The attribute value must end with the
specified value.
* name^=value
– The attribute value must start with the
name and value.
* name=value
– The element must have an attribute with that
* name
– The element must have an attribute with that name.
Several operators are supported for matching attributes:
=== Attribute Values
<form method=“post” action=“/logout”>
but will not match the element:
<form class=“login form” method=“post” action=“/login”>
This selector will match the following element:/login
.
It must also have an attribute called action
with the value
It may have other classes, but the class login
is required to match.
The matched element must be of type form
and have the class login
.
selector = HTML::Selector.new “form.login”
For example:
Space separation is used for descendant selectors.
negation in any order. Do not separate these parts with spaces!
followed by identifier, class names, attributes, pseudo classes and
When using a combination of the above, the element name comes first
negation expression.
* :not(expr)
– Match an element that does not match the
such as :nth-child
and :empty
.
* :pseudo-class
– Match an element based on a pseudo class,
attribute and value. (More operators are supported see below)
* [attr=value]
– Match an element that has the specified
* [attr]
– Match an element that has the specified attribute.
class names if more than one specified.
* .class
– Match an element based on its class name, allid
attribute). For example, #page
.
* #id
– Match an element based on its identifier (the
to match any element.
For example, p
to match a paragraph. You can use *
* name
– Match an element based on its name (tag name).
Selectors can match elements using any of the following criteria:
=== Expressions
end
puts “Found text field with name #{match.attributes}”
matches.each do |match|
matches = selector.select(element)
selector = HTML::Selector.new “input”
For example:
if no match is found
This method returns an array of all matching elements, an empty array
one element and going through all children in depth-first order.
Use the #select method to select all matching elements starting with
=== Selecting Elements
end
puts “Element is a login form”
if selector.match(element)
For example:
match found.
the method returns an array with all matched elements, of nil
if no
or nil
if the element does not match. For complex selectors (see below)
For simple selectors, the method returns an array with that element,
Use the #match method to determine if an element matches the selector.
=== Matching Elementslogin
and an attribute action
with the value /login
.
creates a new selector that matches any form
element with the class
selector = HTML::Selector.new “form.login”
For example:
HTML elements.
The Selector
class uses CSS selector expressions to match and select
Selects HTML elements using CSS 2 selectors.
def attribute_match(equality, value)
Create a regular expression to match an attribute value based
def attribute_match(equality, value) regexp = value.is_a?(Regexp) ? value : Regexp.escape(value.to_s) case equality when "=" then # Match the attribute value in full Regexp.new("^#{regexp}$") when "~=" then # Match a space-separated word within the attribute value Regexp.new("(^|\s)#{regexp}($|\s)") when "^=" # Match the beginning of the attribute value Regexp.new("^#{regexp}") when "$=" # Match the end of the attribute value Regexp.new("#{regexp}$") when "*=" # Match substring of the attribute value regexp.is_a?(Regexp) ? regexp : Regexp.new(regexp) when "|=" then # Match the first space-separated item of the attribute value Regexp.new("^#{regexp}($|\s)") else raise InvalidSelectorError, "Invalid operation/value" unless value.empty? # Match all attributes values (existence check) // end end
def for_class(cls)
Selector.for_class(cls) => selector
:call-seq:
def for_class(cls) self.new([".?", cls]) end
def for_id(id)
Selector.for_id(id) => selector
:call-seq:
def for_id(id) self.new(["#?", id]) end
def initialize(selector, *values)
are used for value substitution.
The first argument is the selector expression. All other arguments
Creates a new selector from a CSS 2 selector expression.
Selector.new(string, [values ...]) => selector
:call-seq:
def initialize(selector, *values) raise ArgumentError, "CSS expression cannot be empty" if selector.empty? @source = "" values = values[0] if values.size == 1 && values[0].is_a?(Array) # We need a copy to determine if we failed to parse, and also # preserve the original pass by-ref statement. statement = selector.strip.dup # Create a simple selector, along with negation. simple_selector(statement, values).each { |name, value| instance_variable_set("@#{name}", value) } @alternates = [] @depends = nil # Alternative selector. if statement.sub!(/^\s*,\s*/, "") second = Selector.new(statement, values) @alternates << second # If there are alternate selectors, we group them in the top selector. if alternates = second.instance_variable_get(:@alternates) second.instance_variable_set(:@alternates, []) @alternates.concat alternates end @source << " , " << second.to_s # Sibling selector: create a dependency into second selector that will # match element immediately following this one. elsif statement.sub!(/^\s*\+\s*/, "") second = next_selector(statement, values) @depends = lambda do |element, first| if element = next_element(element) second.match(element, first) end end @source << " + " << second.to_s # Adjacent selector: create a dependency into second selector that will # match all elements following this one. elsif statement.sub!(/^\s*~\s*/, "") second = next_selector(statement, values) @depends = lambda do |element, first| matches = [] while element = next_element(element) if subset = second.match(element, first) if first && !subset.empty? matches << subset.first break else matches.concat subset end end end matches.empty? ? nil : matches end @source << " ~ " << second.to_s # Child selector: create a dependency into second selector that will # match a child element of this one. elsif statement.sub!(/^\s*>\s*/, "") second = next_selector(statement, values) @depends = lambda do |element, first| matches = [] element.children.each do |child| if child.tag? && subset = second.match(child, first) if first && !subset.empty? matches << subset.first break else matches.concat subset end end end matches.empty? ? nil : matches end @source << " > " << second.to_s # Descendant selector: create a dependency into second selector that # will match all descendant elements of this one. Note, elsif statement =~ /^\s+\S+/ && statement != selector second = next_selector(statement, values) @depends = lambda do |element, first| matches = [] stack = element.children.reverse while node = stack.pop next unless node.tag? if subset = second.match(node, first) if first && !subset.empty? matches << subset.first break else matches.concat subset end elsif children = node.children stack.concat children.reverse end end matches.empty? ? nil : matches end @source << " " << second.to_s else # The last selector is where we check that we parsed # all the parts. unless statement.empty? || statement.strip.empty? raise ArgumentError, "Invalid selector: #{statement}" end end end
def match(element, first_only = false)
puts "Element is a login form"
if selector.match(element)
For example:
Use +first_only=true+ if you are only interested in the first element.
found.
returns an array with all matching elements, nil if no match is
For a complex selector (sibling and descendant) this method
element if the element matches, nil otherwise.
For a simple selector this method returns an array with the
Matches an element against the selector.
match(element, first?) => array or nil
:call-seq:
def match(element, first_only = false) # Match element if no element name or element name same as element name if matched = (!@tag_name || @tag_name == element.name) # No match if one of the attribute matches failed for attr in @attributes if element.attributes[attr[0]] !~ attr[1] matched = false break end end end # Pseudo class matches (nth-child, empty, etc). if matched for pseudo in @pseudo unless pseudo.call(element) matched = false break end end end # Negation. Same rules as above, but we fail if a match is made. if matched && @negation for negation in @negation if negation[:tag_name] == element.name matched = false else for attr in negation[:attributes] if element.attributes[attr[0]] =~ attr[1] matched = false break end end end if matched for pseudo in negation[:pseudo] if pseudo.call(element) matched = false break end end end break unless matched end end # If element matched but depends on another element (child, # sibling, etc), apply the dependent matches instead. if matched && @depends matches = @depends.call(element, first_only) else matches = matched ? [element] : nil end # If this selector is part of the group, try all the alternative # selectors (unless first_only). if !first_only || !matches @alternates.each do |alternate| break if matches && first_only if subset = alternate.match(element, first_only) if matches matches.concat subset else matches = subset end end end end matches end
def next_element(element, name = nil)
With the +name+ argument, returns the next element with that name,
Return the next element after this one. Skips sibling text nodes.
def next_element(element, name = nil) if siblings = element.parent.children found = false siblings.each do |node| if node.equal?(element) found = true elsif found && node.tag? return node if (name.nil? || node.name == name) end end end nil end
def next_selector(statement, values)
separators (alternate) and apply them to the selector group of the
for reuse. The only logic deals with the need to detect comma
This method is called from four places, so it helps to put it here
eventually, and array of substitution values.
Passes the remainder of the statement that will be reduced to zero
Called to create a dependent selector (sibling, descendant, etc).
def next_selector(statement, values) second = Selector.new(statement, values) # If there are alternate selectors, we group them in the top selector. if alternates = second.instance_variable_get(:@alternates) second.instance_variable_set(:@alternates, []) @alternates.concat alternates end second end
def nth_child(a, b, of_type, reverse)
* +of_type+ -- True to test only elements of this type (of-type).
* +b+ -- Value of b part.
* +a+ -- Value of a part.
pseudo class, given the following arguments:
Returns a lambda that can match an element against the nth-child
def nth_child(a, b, of_type, reverse) # a = 0 means select at index b, if b = 0 nothing selected return lambda { |element| false } if a == 0 && b == 0 # a < 0 and b < 0 will never match against an index return lambda { |element| false } if a < 0 && b < 0 b = a + b + 1 if b < 0 # b < 0 just picks last element from each group b -= 1 unless b == 0 # b == 0 is same as b == 1, otherwise zero based lambda do |element| # Element must be inside parent element. return false unless element.parent && element.parent.tag? index = 0 # Get siblings, reverse if counting from last. siblings = element.parent.children siblings = siblings.reverse if reverse # Match element name if of-type, otherwise ignore name. name = of_type ? element.name : nil found = false for child in siblings # Skip text nodes/comments. if child.tag? && (name == nil || child.name == name) if a == 0 # Shortcut when a == 0 no need to go past count if index == b found = child.equal?(element) break end elsif a < 0 # Only look for first b elements break if index > b if child.equal?(element) found = (index % a) == 0 break end else # Otherwise, break if child found and count == an+b if child.equal?(element) found = (index % a) == b break end end index += 1 end end found end end
def only_child(of_type)
Creates a only child lambda. Pass +of-type+ to only look at
def only_child(of_type) lambda do |element| # Element must be inside parent element. return false unless element.parent && element.parent.tag? name = of_type ? element.name : nil other = false for child in element.parent.children # Skip text nodes/comments. if child.tag? && (name == nil || child.name == name) unless child.equal?(element) other = true break end end end !other end end
def select(root)
puts "Found text field with name #{match.attributes['name']}"
matches.each do |match|
matches = selector.select(element)
selector = HTML::Selector.new "input[type=text]"
For example:
itself.
The root node may be any element in the document, or the document
Returns an empty array if no match is found.
with one node and traversing through all children depth-first.
Selects and returns an array with all matching elements, beginning
select(root) => array
:call-seq:
def select(root) matches = [] stack = [root] while node = stack.pop if node.tag? && subset = match(node, false) subset.each do |match| matches << match unless matches.any? { |item| item.equal?(match) } end elsif children = node.children stack.concat children.reverse end end matches end
def select_first(root)
Similar to #select but returns the first matching element. Returns +nil+
def select_first(root) stack = [root] while node = stack.pop if node.tag? && subset = match(node, true) return subset.first if !subset.empty? elsif children = node.children stack.concat children.reverse end end nil end
def simple_selector(statement, values, can_negate = true)
negation. Called a second time with false since negation
Called the first time with +can_negate+ true to allow
+pseudo+ (classes) and +negation+.
Returns a hash with the values +tag_name+, +attributes+,
substitution values.
Creates a simple selector given the statement and array of
def simple_selector(statement, values, can_negate = true) tag_name = nil attributes = [] pseudo = [] negation = [] # Element name. (Note that in negation, this can come at # any order, but for simplicity we allow if only first). statement.sub!(/^(\*|[[:alpha:]][\w\-]*)/) do |match| match.strip! tag_name = match.downcase unless match == "*" @source << match "" # Remove end # Get identifier, class, attribute name, pseudo or negation. while true # Element identifier. next if statement.sub!(/^#(\?|[\w\-]+)/) do |match| id = $1 if id == "?" id = values.shift end @source << "##{id}" id = Regexp.new("^#{Regexp.escape(id.to_s)}$") unless id.is_a?(Regexp) attributes << ["id", id] "" # Remove end # Class name. next if statement.sub!(/^\.([\w\-]+)/) do |match| class_name = $1 @source << ".#{class_name}" class_name = Regexp.new("(^|\s)#{Regexp.escape(class_name)}($|\s)") unless class_name.is_a?(Regexp) attributes << ["class", class_name] "" # Remove end # Attribute value. next if statement.sub!(/^\[\s*([[:alpha:]][\w\-:]*)\s*((?:[~|^$*])?=)?\s*('[^']*'|"[^*]"|[^\]]*)\s*\]/) do |match| name, equality, value = $1, $2, $3 if value == "?" value = values.shift else # Handle single and double quotes. value.strip! if (value[0] == ?" || value[0] == ?') && value[0] == value[-1] value = value[1..-2] end end @source << "[#{name}#{equality}'#{value}']" attributes << [name.downcase.strip, attribute_match(equality, value)] "" # Remove end # Root element only. next if statement.sub!(/^:root/) do |match| pseudo << lambda do |element| element.parent.nil? || !element.parent.tag? end @source << ":root" "" # Remove end # Nth-child including last and of-type. next if statement.sub!(/^:nth-(last-)?(child|of-type)\((odd|even|(\d+|\?)|(-?\d*|\?)?n([+\-]\d+|\?)?)\)/) do |match| reverse = $1 == "last-" of_type = $2 == "of-type" @source << ":nth-#{$1}#{$2}(" case $3 when "odd" pseudo << nth_child(2, 1, of_type, reverse) @source << "odd)" when "even" pseudo << nth_child(2, 2, of_type, reverse) @source << "even)" when /^(\d+|\?)$/ # b only b = ($1 == "?" ? values.shift : $1).to_i pseudo << nth_child(0, b, of_type, reverse) @source << "#{b})" when /^(-?\d*|\?)?n([+\-]\d+|\?)?$/ a = ($1 == "?" ? values.shift : $1 == "" ? 1 : $1 == "-" ? -1 : $1).to_i b = ($2 == "?" ? values.shift : $2).to_i pseudo << nth_child(a, b, of_type, reverse) @source << (b >= 0 ? "#{a}n+#{b})" : "#{a}n#{b})") else raise ArgumentError, "Invalid nth-child #{match}" end "" # Remove end # First/last child (of type). next if statement.sub!(/^:(first|last)-(child|of-type)/) do |match| reverse = $1 == "last" of_type = $2 == "of-type" pseudo << nth_child(0, 1, of_type, reverse) @source << ":#{$1}-#{$2}" "" # Remove end # Only child (of type). next if statement.sub!(/^:only-(child|of-type)/) do |match| of_type = $1 == "of-type" pseudo << only_child(of_type) @source << ":only-#{$1}" "" # Remove end # Empty: no child elements or meaningful content (whitespaces # are ignored). next if statement.sub!(/^:empty/) do |match| pseudo << lambda do |element| empty = true for child in element.children if child.tag? || !child.content.strip.empty? empty = false break end end empty end @source << ":empty" "" # Remove end # Content: match the text content of the element, stripping # leading and trailing spaces. next if statement.sub!(/^:content\(\s*(\?|'[^']*'|"[^"]*"|[^)]*)\s*\)/) do |match| content = $1 if content == "?" content = values.shift elsif (content[0] == ?" || content[0] == ?') && content[0] == content[-1] content = content[1..-2] end @source << ":content('#{content}')" content = Regexp.new("^#{Regexp.escape(content.to_s)}$") unless content.is_a?(Regexp) pseudo << lambda do |element| text = "" for child in element.children unless child.tag? text << child.content end end text.strip =~ content end "" # Remove end # Negation. Create another simple selector to handle it. if statement.sub!(/^:not\(\s*/, "") raise ArgumentError, "Double negatives are not missing feature" unless can_negate @source << ":not(" negation << simple_selector(statement, values, false) raise ArgumentError, "Negation not closed" unless statement.sub!(/^\s*\)/, "") @source << ")" next end # No match: moving on. break end # Return hash. The keys are mapped to instance variables. {:tag_name=>tag_name, :attributes=>attributes, :pseudo=>pseudo, :negation=>negation} end
def to_s #:nodoc:
def to_s #:nodoc: @source end