class HexaPDF::Content::Parser
and a Processor object which is used for processing the parsed operators.
To parse some contents the #parse method needs to be called with the contents to be parsed
ever called for inline images because the ID and EI operators are handled by the parser.
handled specially and this is the reason for this class. Therefore only the BI operator is
Since inline images don’t follow the normal PDF object parsing rules, they need to be
inline images.
A content stream is mostly just a stream of PDF objects. However, there is one exception:
== Overview
This class knows how to correctly parse a content stream.
def self.parse(contents, processor = nil, &block)
def self.parse(contents, processor = nil, &block) new.parse(contents, processor, &block) end
def parse(contents, processor = nil, &block) #:yields: object, params
necessary.
Note: The parameters array is reused for each processed operator, so duplicate it if
maintained), one can use the block form to handle the parsed objects and their parameters.
If a full-blown Processor is not needed (e.g. because the graphics state doesn't need to be
operator.
Parses the contents and calls the processor object or the given block for each parsed
def parse(contents, processor = nil, &block) #:yields: object, params raise ArgumentError, "Argument processor or block is needed" if processor.nil? && block.nil? if processor.nil? block.singleton_class.send(:alias_method, :process, :call) processor = block end tokenizer = Tokenizer.new(contents, raise_on_eos: true) params = [] loop do obj = tokenizer.next_object(allow_keyword: true) if obj.kind_of?(Tokenizer::Token) if obj == 'BI' params = parse_inline_image(tokenizer) end processor.process(obj.to_sym, params) params.clear else params << obj end end end
def parse_inline_image(tokenizer)
def parse_inline_image(tokenizer) # BI has already been read, so read the image dictionary dict = {} while (key = tokenizer.next_object(allow_keyword: true) rescue Tokenizer::NO_MORE_TOKENS) if key == 'ID' break elsif key == Tokenizer::NO_MORE_TOKENS raise HexaPDF::Error, "EOS while trying to read dictionary key for inline image" elsif !key.kind_of?(Symbol) raise HexaPDF::Error, "Inline image dictionary keys must be PDF name objects" end value = tokenizer.next_object rescue Tokenizer::NO_MORE_TOKENS if value == Tokenizer::NO_MORE_TOKENS raise HexaPDF::Error, "EOS while trying to read dictionary value for inline image" end dict[key] = value end # one whitespace character after ID tokenizer.next_byte real_end_found = false image_data = ''.b # find the EI operator and handle EI appearing inside the image data until real_end_found data = tokenizer.scan_until(/(?=EI(?:[#{Tokenizer::WHITESPACE}]|\z))/o) if data.nil? raise HexaPDF::Error, "End inline image marker EI not found" end image_data << data tokenizer.pos += 2 last_pos = tokenizer.pos # Check if we found EI inside of the image data count = 0 while count < MAX_TOKEN_CHECK token = tokenizer.next_object(allow_keyword: true) rescue Tokenizer::NO_MORE_TOKENS if token == Tokenizer::NO_MORE_TOKENS count += MAX_TOKEN_CHECK elsif token.kind_of?(Tokenizer::Token) && !Processor::OPERATOR_MESSAGE_NAME_MAP.key?(token.to_sym) break # invalid token end count += 1 end if count >= MAX_TOKEN_CHECK real_end_found = true else image_data << "EI" end tokenizer.pos = last_pos end [dict, image_data] end