class PDF::Reader::Content
-
resource_font
- resource_pattern
- resource_colorspace
- resource_extgstate
- resource_xobject
- resource_procset
invoke_xobject “IM1”.
If it gets mapped to the name “IM1”, then it can be placed on the page using
to be referred to by name in the page content. For example, an XObject can hold an image.
In most cases, these callbacks associate a name with each resource, allowing it
on a page:
after begin_page_container and begin_page if the relevant resources exist
including things like fonts and images. The following callbacks may appear
Each page and page_container can contain a range of resources required for the page,
== Resource Callbacks
- end_page
- begin_page
- end_page_container
- begin_page_container
- end_document
- begin_document
- end_compatibility_section,
- begin_compatibility_section
== Misc Callbacks
- append_curved_segment_final_point_replicated
- set_clipping_path_with_even_odd
- set_clipping_path_with_nonzero
- set_line_width
- append_curved_segment_initial_point_replicated
- paint_area_with_shading_pattern
- set_color_for_nonstroking_and_special
- set_color_for_stroking_and_special
- set_color_for_nonstroking
- set_color_for_stroking
- stroke_path
- close_and_stroke_path
- set_color_rendering_intent
- set_rgb_color_for_nonstroking
- set_rgb_color_for_stroking
- append_rectangle
- restore_graphics_state
- save_graphics_state
- end_path
- define_marked_content_point
- set_miter_limit
- begin_new_subpath
- append_line
- set_cmyk_color_for_nonstroking
- set_cmyk_color_for_stroking,
- set_line_cap_style
- set_line_join_style
- begin_inline_image_data
- set_flatness_tolerance
- close_subpath
- set_graphics_state_parameters
- set_gray_for_nonstroking
- set_gray_for_stroking
- fill_path_with_even_odd
- fill_path_with_nonzero
- fill_path_with_nonzero
- end_marked_content
- end_inline_image
- define_marked_content_with_pl
- invoke_xobject
- set_glyph_width_and_bounding_box
- set_glyph_width
- set_line_dash
- set_nonstroke_color_space
- set_stroke_color_space
- concatenate_matrix
- append_curved_segment
- begin_text_object
- begin_marked_content
- begin_inline_image
- begin_marked_content_with_pl
- fill_stroke_with_even_odd
- close_fill_stroke_with_even_odd
- fill_stroke
- close_fill_stroke
== Graphics Callbacks
- set_spacing_next_line_show_text
- move_to_next_line_and_show_text
- set_horizontal_text_scaling
- set_word_spacing
- set_text_rise
- set_text_rendering_mode
- set_text_matrix_and_text_line_matrix
- set_text_leading
- show_text_with_positioning
- show_text
- set_text_font_and_size
- move_text_position_and_set_leading
- move_text_position
- set_character_spacing
- move_to_start_of_next_line
- end_text_object
string may not be byte-by-byte identical with the string that was originally written to the PDF.
when doing a comparison on strings returned from PDF::Reader (when doing unit tests for example). The
PDF was generated, there’s a good chance the text is NOT stored as UTF-8 internally so be careful
All text passed into these callbacks will be encoded as UTF-8. Depending on where (and when) the
== Text Callbacks
puts params.inspect
contents of the array using something like:
further experimentation, define the callback with just a *params parameter, then print out the
You should be able to infer the basic command the callback is reporting based on the name. For
def fill_stroke(*params)
def show_text(string, *params)
def end_page
def begin_document
method definitions are:
paramters, or where you don’t need them, the *params argument can be left off. Some example callback
Some callbacks will include parameters which will be passed in as an array. For callbacks that supply no
implement the ones you need - the rest will be ignored.
The following callbacks are available and should be methods defined on your receiver class. Only
= Available Callbacks
If it is defined it will be called. If not, processing will continue.
is defined.
some content is found that will trigger a callback, the receiver is checked to see if the callback
The callback methods should exist on the receiver object passed into the constructor. Whenever
found.
Walks the PDF file and calls the appropriate callback methods when something of interest is
###############################################################################
def callback (name, params=[])
###############################################################################
def callback (name, params=[]) @receiver.send(name, *params) if @receiver.respond_to?(name) end
def content_stream (instructions)
Reads a PDF content stream and calls all the appropriate callback methods for the operators
###############################################################################
def content_stream (instructions) @buffer = Buffer.new(StringIO.new(instructions)) @parser = Parser.new(@buffer, @xref) @params = [] if @params.nil? until @buffer.eof? loop do token = @parser.parse_token(OPERATORS) if token.kind_of?(Token) and OPERATORS.has_key?(token) @current_font = @params.first if OPERATORS[token] == :set_text_font_and_size # handle special cases in response to certain operators if OPERATORS[token].to_s.include?("show_text") && @fonts[@current_font] # convert any text to utf-8 @params = @fonts[@current_font].to_utf8(@params) elsif token == "ID" # inline image data, first convert the current params into a more familiar hash map = {} @params.each_slice(2) do |a| map[a.first] = a.last end @params = [map] # read the raw image data from the buffer without tokenising @params << @buffer.read_until("EI") end callback(OPERATORS[token], @params) @params.clear break end @params << token end end rescue EOFError => e end
def document (root)
###############################################################################
def document (root) callback(:begin_document, [root]) walk_pages(@xref.object(root['Pages'])) callback(:end_document) end
def initialize (receiver, xref)
- receiver - an object containing the required callback methods
Create a new PDF::Reader::Content object to process the contents of PDF file
###############################################################################
def initialize (receiver, xref) @receiver = receiver @xref = xref @fonts ||= {} end
def resolve_references(obj)
###############################################################################
def resolve_references(obj) case obj when PDF::Reader::Reference then resolve_references(@xref.object(obj)) when Hash then obj.each { |key,val| obj[key] = resolve_references(val) } when Array then obj.collect { |item| resolve_references(item) } else obj end end
def walk_pages (page)
Walk over all pages in the PDF file, calling the appropriate callbacks for each page and all
###############################################################################
def walk_pages (page) if page['Resources'] res = page['Resources'] page.delete('Resources') end # extract page content if page['Type'] == "Pages" callback(:begin_page_container, [page]) walk_resources(@xref.object(res)) if res page['Kids'].each {|child| walk_pages(@xref.object(child))} callback(:end_page_container) elsif page['Type'] == "Page" callback(:begin_page, [page]) walk_resources(@xref.object(res)) if res @page = page @params = [] page['Contents'].to_a.each do |cstream| obj, stream = @xref.object(cstream) content_stream(stream) end if page.has_key?('Contents') and page['Contents'] callback(:end_page) end end
def walk_resources(resources)
def walk_resources(resources) resources = resolve_references(resources) # extract any procset information if resources['ProcSet'] callback(:resource_procset, resources['ProcSet']) end # extract any xobject information if resources['XObject'] @xref.object(resources['XObject']).each do |name, val| obj, stream = @xref.object(val) callback(:resource_xobject, [name, obj, stream]) end end # extract any extgstate information if resources['ExtGState'] @xref.object(resources['ExtGState']).each do |name, val| callback(:resource_extgstate, [name, @xref.object(val)]) end end # extract any colorspace information if resources['ColorSpace'] @xref.object(resources['ColorSpace']).each do |name, val| callback(:resource_colorspace, [name, @xref.object(val)]) end end # extract any pattern information if resources['Pattern'] @xref.object(resources['Pattern']).each do |name, val| callback(:resource_pattern, [name, @xref.object(val)]) end end # extract any font information if resources['Font'] @xref.object(resources['Font']).each do |label, desc| desc = @xref.object(desc) @fonts[label] = PDF::Reader::Font.new @fonts[label].label = label @fonts[label].subtype = desc['Subtype'] if desc['Subtype'] @fonts[label].basefont = desc['BaseFont'] if desc['BaseFont'] @fonts[label].encoding = PDF::Reader::Encoding.factory(@xref.object(desc['Encoding'])) @fonts[label].descendantfonts = desc['DescendantFonts'] if desc['DescendantFonts'] if desc['ToUnicode'] obj, cmap = @xref.object(desc['ToUnicode']) # this stream is a cmap begin @fonts[label].tounicode = PDF::Reader::CMap.new(cmap) rescue # if the CMap fails to parse, don't worry too much. Means we can't translate the text properly end end callback(:resource_font, [label, @fonts[label]]) end end end