class PDF::Reader::Content

  • resource_font
    - resource_pattern
    - resource_colorspace
    - resource_extgstate
    - resource_xobject
    - resource_procset
    invoke_xobject “IM1”.
    If it gets mapped to the name “IM1”, then it can be placed on the page using
    to be referred to by name in the page content. For example, an XObject can hold an image.
    In most cases, these callbacks associate a name with each resource, allowing it
    on a page:
    after begin_page_container and begin_page if the relevant resources exist
    including things like fonts and images. The following callbacks may appear
    Each page and page_container can contain a range of resources required for the page,
    == Resource Callbacks
    - end_page
    - begin_page
    - end_page_container
    - begin_page_container
    - end_document
    - begin_document
    - end_compatibility_section,
    - begin_compatibility_section
    == Misc Callbacks
    - append_curved_segment_final_point_replicated
    - set_clipping_path_with_even_odd
    - set_clipping_path_with_nonzero
    - set_line_width
    - append_curved_segment_initial_point_replicated
    - paint_area_with_shading_pattern
    - set_color_for_nonstroking_and_special
    - set_color_for_stroking_and_special
    - set_color_for_nonstroking
    - set_color_for_stroking
    - stroke_path
    - close_and_stroke_path
    - set_color_rendering_intent
    - set_rgb_color_for_nonstroking
    - set_rgb_color_for_stroking
    - append_rectangle
    - restore_graphics_state
    - save_graphics_state
    - end_path
    - define_marked_content_point
    - set_miter_limit
    - begin_new_subpath
    - append_line
    - set_cmyk_color_for_nonstroking
    - set_cmyk_color_for_stroking,
    - set_line_cap_style
    - set_line_join_style
    - begin_inline_image_data
    - set_flatness_tolerance
    - close_subpath
    - set_graphics_state_parameters
    - set_gray_for_nonstroking
    - set_gray_for_stroking
    - fill_path_with_even_odd
    - fill_path_with_nonzero
    - fill_path_with_nonzero
    - end_marked_content
    - end_inline_image
    - define_marked_content_with_pl
    - invoke_xobject
    - set_glyph_width_and_bounding_box
    - set_glyph_width
    - set_line_dash
    - set_nonstroke_color_space
    - set_stroke_color_space
    - concatenate_matrix
    - append_curved_segment
    - begin_text_object
    - begin_marked_content
    - begin_inline_image
    - begin_marked_content_with_pl
    - fill_stroke_with_even_odd
    - close_fill_stroke_with_even_odd
    - fill_stroke
    - close_fill_stroke
    == Graphics Callbacks
    - set_spacing_next_line_show_text
    - move_to_next_line_and_show_text
    - set_horizontal_text_scaling
    - set_word_spacing
    - set_text_rise
    - set_text_rendering_mode
    - set_text_matrix_and_text_line_matrix
    - set_text_leading
    - show_text_with_positioning
    - show_text
    - set_text_font_and_size
    - move_text_position_and_set_leading
    - move_text_position
    - set_character_spacing
    - move_to_start_of_next_line
    - end_text_object
    string may not be byte-by-byte identical with the string that was originally written to the PDF.
    when doing a comparison on strings returned from PDF::Reader (when doing unit tests for example). The
    PDF was generated, there’s a good chance the text is NOT stored as UTF-8 internally so be careful
    All text passed into these callbacks will be encoded as UTF-8. Depending on where (and when) the
    == Text Callbacks
    puts params.inspect
    contents of the array using something like:
    further experimentation, define the callback with just a *params parameter, then print out the
    You should be able to infer the basic command the callback is reporting based on the name. For
    def fill_stroke(*params)
    def show_text(string, *params)
    def end_page
    def begin_document
    method definitions are:
    paramters, or where you don’t need them, the *params argument can be left off. Some example callback
    Some callbacks will include parameters which will be passed in as an array. For callbacks that supply no
    implement the ones you need - the rest will be ignored.
    The following callbacks are available and should be methods defined on your receiver class. Only
    = Available Callbacks
    If it is defined it will be called. If not, processing will continue.
    is defined.
    some content is found that will trigger a callback, the receiver is checked to see if the callback
    The callback methods should exist on the receiver object passed into the constructor. Whenever
    found.
    Walks the PDF file and calls the appropriate callback methods when something of interest is
    ###############################################################################

def callback (name, params=[])

calls the name callback method on the receiver class with params as the arguments
###############################################################################
def callback (name, params=[])
  @receiver.send(name, *params) if @receiver.respond_to?(name)
end

def content_stream (instructions)

it contains
Reads a PDF content stream and calls all the appropriate callback methods for the operators
###############################################################################
def content_stream (instructions)
  @buffer = Buffer.new(StringIO.new(instructions))
  @parser = Parser.new(@buffer, @xref)
  @params = [] if @params.nil?
  until @buffer.eof?
    loop do
      token = @parser.parse_token(OPERATORS)
      if token.kind_of?(Token) and OPERATORS.has_key?(token) 
        @current_font = @params.first if OPERATORS[token] == :set_text_font_and_size
        # handle special cases in response to certain operators
        if OPERATORS[token].to_s.include?("show_text") && @fonts[@current_font]
          # convert any text to utf-8
          @params = @fonts[@current_font].to_utf8(@params)
        elsif token == "ID"
          # inline image data, first convert the current params into a more familiar hash
          map = {}
          @params.each_slice(2) do |a|
            map[a.first] = a.last
          end
          @params = [map]
          # read the raw image data from the buffer without tokenising
          @params << @buffer.read_until("EI")
        end
        callback(OPERATORS[token], @params)
        @params.clear
        break
      end
      @params << token
    end
  end
rescue EOFError => e
end

def document (root)

Begin processing the document
###############################################################################
def document (root)
  callback(:begin_document, [root])
  walk_pages(@xref.object(root['Pages']))
  callback(:end_document)
end

def initialize (receiver, xref)

- xref - a PDF::Reader::Xref object that contains references to all the objects in a PDF file
- receiver - an object containing the required callback methods
Create a new PDF::Reader::Content object to process the contents of PDF file
###############################################################################
def initialize (receiver, xref)
  @receiver = receiver
  @xref     = xref
  @fonts ||= {}
end

def resolve_references(obj)

Convert any PDF::Reader::Resource objects into a real object
###############################################################################
def resolve_references(obj)
  case obj
  when PDF::Reader::Reference then resolve_references(@xref.object(obj))
  when Hash                   then obj.each { |key,val| obj[key] = resolve_references(val) }
  when Array                  then obj.collect { |item| resolve_references(item) }
  else
    obj
  end
end

def walk_pages (page)

its content
Walk over all pages in the PDF file, calling the appropriate callbacks for each page and all
###############################################################################
def walk_pages (page)
  
  if page['Resources']
    res = page['Resources']
    page.delete('Resources')
  end
  # extract page content
  if page['Type'] == "Pages"
    callback(:begin_page_container, [page])
    walk_resources(@xref.object(res)) if res
    page['Kids'].each {|child| walk_pages(@xref.object(child))}
    callback(:end_page_container)
  elsif page['Type'] == "Page"
    callback(:begin_page, [page])
    walk_resources(@xref.object(res)) if res
    @page = page
    @params = []
    page['Contents'].to_a.each do |cstream|
      obj, stream = @xref.object(cstream)
      content_stream(stream)
    end if page.has_key?('Contents') and page['Contents']
    callback(:end_page)
  end
end

def walk_resources(resources)

###############################################################################
def walk_resources(resources)
  resources = resolve_references(resources)
  
  # extract any procset information
  if resources['ProcSet']
    callback(:resource_procset, resources['ProcSet'])
  end
  # extract any xobject information
  if resources['XObject']
    @xref.object(resources['XObject']).each do |name, val|
      obj, stream = @xref.object(val)
      callback(:resource_xobject, [name, obj, stream])
    end
  end
  # extract any extgstate information
  if resources['ExtGState']
    @xref.object(resources['ExtGState']).each do |name, val|
      callback(:resource_extgstate, [name, @xref.object(val)])
    end
  end
  # extract any colorspace information
  if resources['ColorSpace']
    @xref.object(resources['ColorSpace']).each do |name, val|
      callback(:resource_colorspace, [name, @xref.object(val)])
    end
  end
  # extract any pattern information
  if resources['Pattern']
    @xref.object(resources['Pattern']).each do |name, val|
      callback(:resource_pattern, [name, @xref.object(val)])
    end
  end
  # extract any font information
  if resources['Font']
    @xref.object(resources['Font']).each do |label, desc|
      desc = @xref.object(desc)
      @fonts[label] = PDF::Reader::Font.new
      @fonts[label].label = label
      @fonts[label].subtype = desc['Subtype'] if desc['Subtype']
      @fonts[label].basefont = desc['BaseFont'] if desc['BaseFont']
      @fonts[label].encoding = PDF::Reader::Encoding.factory(@xref.object(desc['Encoding']))
      @fonts[label].descendantfonts = desc['DescendantFonts'] if desc['DescendantFonts']
      if desc['ToUnicode']
        obj, cmap = @xref.object(desc['ToUnicode'])
        
        # this stream is a cmap
        begin
          @fonts[label].tounicode = PDF::Reader::CMap.new(cmap)
        rescue
          # if the CMap fails to parse, don't worry too much. Means we can't translate the text properly
        end
      end
      callback(:resource_font, [label, @fonts[label]])
    end
  end
end