class HexaPDF::Content::Parser

and a Processor object which is used for processing the parsed operators.
To parse some contents the #parse method needs to be called with the contents to be parsed
ever called for inline images because the ID and EI operators are handled by the parser.
handled specially and this is the reason for this class. Therefore only the BI operator is
Since inline images don’t follow the normal PDF object parsing rules, they need to be
inline images.
A content stream is mostly just a stream of PDF objects. However, there is one exception:
== Overview
This class knows how to correctly parse a content stream.

def self.parse(contents, processor = nil, &block)

Creates a new Parser object and calls #parse.

def self.parse(contents, processor = nil, &block)
  new.parse(contents, processor, &block)
end

def parse(contents, processor = nil, &block) #:yields: object, params

:yields: object, params
necessary.
Note: The parameters array is reused for each processed operator, so duplicate it if

maintained), one can use the block form to handle the parsed objects and their parameters.
If a full-blown Processor is not needed (e.g. because the graphics state doesn't need to be

operator.
Parses the contents and calls the processor object or the given block for each parsed

def parse(contents, processor = nil, &block) #:yields: object, params
  raise ArgumentError, "Argument processor or block is needed" if processor.nil? && block.nil?
  if processor.nil?
    block.singleton_class.send(:alias_method, :process, :call)
    processor = block
  end
  tokenizer = Tokenizer.new(contents, raise_on_eos: true)
  params = []
  loop do
    obj = tokenizer.next_object(allow_keyword: true)
    if obj.kind_of?(Tokenizer::Token)
      if obj == 'BI'
        params = parse_inline_image(tokenizer)
      end
      processor.process(obj.to_sym, params)
      params.clear
    else
      params << obj
    end
  end
end

def parse_inline_image(tokenizer)

Parses the inline image at the current position.

def parse_inline_image(tokenizer)
  # BI has already been read, so read the image dictionary
  dict = {}
  while (key = tokenizer.next_object(allow_keyword: true) rescue Tokenizer::NO_MORE_TOKENS)
    if key == 'ID'
      break
    elsif key == Tokenizer::NO_MORE_TOKENS
      raise HexaPDF::Error, "EOS while trying to read dictionary key for inline image"
    elsif !key.kind_of?(Symbol)
      raise HexaPDF::Error, "Inline image dictionary keys must be PDF name objects"
    end
    value = tokenizer.next_object rescue Tokenizer::NO_MORE_TOKENS
    if value == Tokenizer::NO_MORE_TOKENS
      raise HexaPDF::Error, "EOS while trying to read dictionary value for inline image"
    end
    dict[key] = value
  end
  # one whitespace character after ID
  tokenizer.next_byte
  real_end_found = false
  image_data = ''.b
  # find the EI operator and handle EI appearing inside the image data
  until real_end_found
    data = tokenizer.scan_until(/(?=EI(?:[#{Tokenizer::WHITESPACE}]|\z))/o)
    if data.nil?
      raise HexaPDF::Error, "End inline image marker EI not found"
    end
    image_data << data
    tokenizer.pos += 2
    last_pos = tokenizer.pos
    # Check if we found EI inside of the image data
    count = 0
    while count < MAX_TOKEN_CHECK
      token = tokenizer.next_object(allow_keyword: true) rescue Tokenizer::NO_MORE_TOKENS
      if token == Tokenizer::NO_MORE_TOKENS
        count += MAX_TOKEN_CHECK
      elsif token.kind_of?(Tokenizer::Token) &&
          !Processor::OPERATOR_MESSAGE_NAME_MAP.key?(token.to_sym)
        break #  invalid token
      end
      count += 1
    end
    if count >= MAX_TOKEN_CHECK
      real_end_found = true
    else
      image_data << "EI"
    end
    tokenizer.pos = last_pos
  end
  [dict, image_data]
end

Namespace

HexaPDF::Content

Class Methods

:: parse

Instance Methods

Defined in

lib/hexapdf/content/parser.rb

Modules

Classes