class HexaPDF::Content::Processor
methods.
provided. Both can directly be invoked from the ‘show_text’ and ‘show_text_with_positioning’
Two utility methods #decode_text and #decode_text_with_positioning for extracting text are
== Text Processing
while parsing inline images and do not reflect separate operators.
operators ‘ID’ and ‘EI’ exist for inline images, they are not used because they are consumed
For inline images only the ‘BI’ operator mapped to ‘inline_image’ is used. Although also the
canvas.
processor could use the processing state to extract the text. Or paint the content on a
to concern itself with ensuring the consistency of the processing state. For example, the
The task of these methods is to do something useful with the content itself, it doesn’t need
’save_graphics_state“.
OPERATOR_MESSAGE_NAME_MAP constant. For example, the operator ‘q’ is mapped to
they exist). Each PDF operator name is mapped to a nicer message name via the
After that methods corresponding to the operator names are invoked on the processor object (if
for this task and not more, so they are very specific and normally don’t need to be changed.
actually modify the #graphics_state object. However, operator implementations are only used
the processing state is consistent. For example, operators that modify the graphics state do
The operator implementations (see the Operator module) are called first and they ensure that
== How Processing Works
setup (like modifying the graphics state) is done before further processing.
these operators are usually processed with a Processor instance that ensures that the needed
When a content stream is read, operators and their operands are extracted. After extracting
== General Information
This class is used for processing content operators extracted from a content stream.
def decode_horizontal_text(array)
writing direction is horizontal.
Decodes the given array containing text and positioning information while assuming that the
def decode_horizontal_text(array) font = graphics_state.font scaled_char_space = graphics_state.scaled_character_spacing scaled_word_space = (font.word_spacing_applicable? ? graphics_state.scaled_word_spacing : 0) scaled_font_size = graphics_state.scaled_font_size below_baseline = font.bounding_box[1] * scaled_font_size / \ graphics_state.scaled_horizontal_scaling + graphics_state.text_rise above_baseline = font.bounding_box[3] * scaled_font_size / \ graphics_state.scaled_horizontal_scaling + graphics_state.text_rise text = CompositeBox.new array.each do |item| if item.kind_of?(Numeric) graphics_state.tm.translate(-item * scaled_font_size, 0) else font.decode(item).each do |code_point| char = font.to_utf8(code_point) width = font.width(code_point) * scaled_font_size + scaled_char_space + \ (code_point == 32 ? scaled_word_space : 0) matrix = graphics_state.ctm.dup.premultiply(*graphics_state.tm) fragment = GlyphBox.new(code_point, char, *matrix.evaluate(0, below_baseline), *matrix.evaluate(width, below_baseline), *matrix.evaluate(0, above_baseline)) text << fragment graphics_state.tm.translate(width, 0) end end end text.freeze end
def decode_text(data)
The argument may either be a simple text string (+Tj+ operator) or an array that contains
Decodes the given text object and returns it as UTF-8 string.
def decode_text(data) if data.kind_of?(Array) data = data.each_with_object(''.b) {|obj, result| result << obj if obj.kind_of?(String) } end font = graphics_state.font font.decode(data).map {|code_point| font.to_utf8(code_point) }.join end
def decode_text_with_positioning(data)
predetermined but not the height. The latter is chosen to be the height and offset of the
For each glyph a GlyphBox object is computed. For horizontal fonts the width is
text strings together with positioning information (+TJ+ operator).
The argument may either be a simple text string (+Tj+ operator) or an array that contains
Decodes the given text object and returns it as a CompositeBox object.
def decode_text_with_positioning(data) data = Array(data) if graphics_state.font.writing_mode == :horizontal decode_horizontal_text(data) else decode_vertical_text(data) end end
def decode_vertical_text(_data)
Decodes the given array containing text and positioning information while assuming that the
def decode_vertical_text(_data) raise "Not yet implemented" end
def initialize(resources = nil)
It is not mandatory to set the resources dictionary on initialization but it needs to be set
while processing operators.
Initializes a new processor that uses the resources PDF dictionary for resolving resources
def initialize(resources = nil) @operators = Operator::DEFAULT_OPERATORS.dup @graphics_state = GraphicsState.new @graphics_object = :none @original_resources = nil self.resources = resources end
def paint_xobject(name)
It checks if the XObject is a Form XObject and if so, processes the contents of the Form
Provides a default implementation for the 'Do' operator.
def paint_xobject(name) xobject = resources.xobject(name) return unless xobject[:Subtype] == :Form res = resources graphics_state.save graphics_state.ctm.premultiply(*xobject[:Matrix]) if xobject.key?(:Matrix) xobject.process_contents(self, original_resources: @original_resources) graphics_state.restore self.resources = res end
def process(operator, operands = [])
The operator is first processed with an operator implementation (if any) to ensure correct
Processes the operator with the given operands.
def process(operator, operands = []) @operators[operator].invoke(self, *operands) if @operators.key?(operator) msg = OPERATOR_MESSAGE_NAME_MAP[operator] send(msg, *operands) if msg && respond_to?(msg, true) end
def resources=(res)
needed because form XObject don't need to have a resources dictionary and can use the page's
The first time resources are set, they are also stored as the "original" resources. This is
Sets the resources dictionary used during processing.
def resources=(res) @original_resources = res if @original_resources.nil? @resources = res end