class Ferret::Index::Index
information on how to use this class.
This is a simplified interface to the index. See the TUTORIAL for more
def add_document(doc, analyzer = nil)
index << {:id => row.id, :title => row.title, :date => row.date}
to store the id;
Or if you are indexing data stored in a database, you'll probably want
index << {:title => "Programming Ruby", :content => "yada yada yada"}
index << {:title => "Programming Ruby", :content => "blah blah blah"}
some fields;
you could probably just use SimpleSearch. So let's give our documents
But these are pretty simple documents. If this is all you want to index
index << ["And here", "is another", "new document", "to be indexed"]
index << "This is a new document to be indexed"
(unless you specify the default_field when you create the index).
This will store all the strings in the "" (ie empty string) field
To add a document you can simply add a string or an array of strings.
There are three ways to add a document to the index.
discarded.
IndexWriter::MAX_FIELD_LENGTH terms for a given field, the remainder are
the local analyzer if provided. If the document contains more than
Adds a document to this index, using the provided analyzer instead of
def add_document(doc, analyzer = nil) @dir.synchronize do ensure_writer_open() if doc.is_a?(String) or doc.is_a?(Array) doc = {@default_input_field => doc} end # delete existing documents with the same key if @key if @key.is_a?(Array) query = @key.inject(BooleanQuery.new()) do |bq, field| bq.add_query(TermQuery.new(field, doc[field].to_s), :must) bq end query_delete(query) else id = doc[@key].to_s if id @writer.delete(@key, id) end end end ensure_writer_open() if analyzer old_analyzer = @writer.analyzer @writer.analyzer = analyzer @writer.add_document(doc) @writer.analyzer = old_analyzer else @writer.add_document(doc) end flush() if @auto_flush end end
def add_indexes(indexes)
merging sub-collection indexes with this method.
perhaps all in memory. The complete index can then be created by
be indexed in parallel, on a different thread, process or machine and
collection can be broken into sub-collections. Each sub-collection can
This may be used to parallelize batch indexing. A large document
Store::Directory or an array of any single one of these.
index. You can pass a single Index::Index, Index::Reader,
Merges all segments from an index or an array of indexes into this
def add_indexes(indexes) @dir.synchronize do ensure_writer_open() indexes = [indexes].flatten # make sure we have an array return if indexes.size == 0 # nothing to do if indexes[0].is_a?(Index) indexes.delete(self) # don't merge with self indexes = indexes.map {|index| index.reader } elsif indexes[0].is_a?(Ferret::Store::Directory) indexes.delete(@dir) # don't merge with self indexes = indexes.map {|dir| IndexReader.new(dir) } elsif indexes[0].is_a?(IndexReader) indexes.delete(@reader) # don't merge with self else raise ArgumentError, "Unknown index type when trying to merge indexes" end ensure_writer_open @writer.add_readers(indexes) end end
def batch_delete(docs)
docs:: An Array of docs to be deleted, or a Hash (in which case the keys
be deleted.
term and the documents that contain that term in the +:id_field+ will
If the +id+ is a String or a Symbol then the +id+ will be considered a
Ferret document number and the corresponding document will be deleted.
document +id+'s. If the +id+ is an Integers then it is considered a
it is a Hash, then its keys will be used instead as the Array of
If +docs+ is an Array then it will be considered an array of +id+'s. If
If +docs+ is a Hash or an Array then a batch delete will be performed.
def batch_delete(docs) docs = docs.keys if docs.is_a?(Hash) raise ArgumentError, "must pass Array or Hash" unless docs.is_a? Array ids = [] terms = [] docs.each do |doc| case doc when String then terms << doc when Symbol then terms << doc.to_s when Integer then ids << doc else raise ArgumentError, "Cannot delete for arg of type #{id.class}" end end if ids.size > 0 ensure_reader_open ids.each {|id| @reader.delete(id)} end if terms.size > 0 ensure_writer_open() @writer.delete(@id_field, terms) end return self end
def batch_update(docs)
])
{:id => '253', :content => 'bla bla bal'}
{:id => '133', :content => 'yada yada yada'},
@index.batch_update([
# this is recommended as it guarantees no duplicate keys
# will replace the documents with the +id+'s id:133 and id:254
})
92 => {:id => '253', :content => 'bla bla bal'}
2 => {:id => '133', :content => 'yada yada yada'},
@index.batch_update({
# will replace the documents with the Ferret Document numbers 2 and 92
})
'253' => {:id => '253', :content => 'bla bla bal'}
'133' => {:id => '133', :content => 'yada yada yada'},
@index.batch_update({
# will replace the documents with the +id+'s id:133 and id:254
== Examples
exist. A new document will simply be created.
Note: No error will be raised if the document does not currently
documents that contain that term in the +:id_field+ will be deleted.
String or a Symbol then the +id+ will be considered a term and the
number and the corresponding document will be deleted. If the +id+ is a
with.If the +id+ is an Integer then it is considered a Ferret document
+id+'s and the values will be the new documents to replace the old ones
If you pass a Hash then the keys of the Hash will be considered the
=== Hash
to delete the old document that this document is replacing.
and each of those documents must have an +:id_field+ which will be used
If you pass an Array then each value needs to be a Document or a Hash
=== Array (recommended)
an Array.
Batch updates the documents in an index. You can pass either a Hash or
def batch_update(docs) @dir.synchronize do ids = values = nil case docs when Array ids = docs.collect{|doc| doc[@id_field].to_s} if ids.include?(nil) raise ArgumentError, "all documents must have an #{@id_field} " "field when doing a batch update" end when Hash ids = docs.keys docs = docs.values else raise ArgumentError, "must pass Hash or Array, not #{docs.class}" end batch_delete(ids) ensure_writer_open() docs.each {|new_doc| @writer << new_doc } flush() end end
def close
def close @dir.synchronize do if not @open raise(StandardError, "tried to close an already closed directory") end @searcher.close() if @searcher @reader.close() if @reader @writer.close() if @writer @dir.close() if @close_dir @open = false end end
def close_all()
def close_all() @dir.synchronize do @searcher.close if @searcher @reader.close if @reader @writer.close if @writer @reader = nil @searcher = nil @writer = nil end end
def delete(arg)
term and the documents that contain that term in the +:id_field+ will be
If the +id+ is a String or a Symbol then the +id+ will be considered a
Ferret document number and the corresponding document will be deleted.
document +id+'s. If the +id+ is an Integer then it is considered a
it is a Hash, then its keys will be used instead as the Array of
If +arg+ is an Array then it will be considered an array of +id+'s. If
If +arg+ is a Hash or an Array then a batch delete will be performed.
no document exists.
parameter to when you create the Index object. Will fail quietly if the
+id+ field. The +id+ field is either :id or whatever you set +:id_field+
If +arg+ is a String then search for the documents with +arg+ in the
document number. Will raise an error if the document does not exist.
If +arg+ is an Integer then delete the document based on the internal
the document to delete depends on the type of the argument passed.
Deletes a document/documents from the index. The method for determining
def delete(arg) @dir.synchronize do if arg.is_a?(String) or arg.is_a?(Symbol) ensure_writer_open() @writer.delete(@id_field, arg.to_s) elsif arg.is_a?(Integer) ensure_reader_open() cnt = @reader.delete(arg) elsif arg.is_a?(Hash) or arg.is_a?(Array) batch_delete(arg) else raise ArgumentError, "Cannot delete for arg of type #{arg.class}" end flush() if @auto_flush end return self end
def deleted?(n)
def deleted?(n) @dir.synchronize do ensure_reader_open() return @reader.deleted?(n) end end
def do_process_query(query)
def do_process_query(query) if query.is_a?(String) if @qp.nil? @qp = Ferret::QueryParser.new(@options) end # we need to set this every time, in case a new field has been added @qp.fields = @reader.fields unless options[:all_fields] || options[:fields] @qp.tokenized_fields = @reader.tokenized_fields unless options[:tokenized_fields] query = @qp.parse(query) end return query end
def do_search(query, options)
def do_search(query, options) ensure_searcher_open() query = do_process_query(query) return @searcher.search(query, options) end
def doc(*arg)
the +id+ field. The +id+ field is either :id or whatever you set
If +arg+ is a String then search for the first document with +arg+ in
internal document number.
If +arg+ is a Range, then return the documents within the range based on
document number.
If +arg+ is an Integer then return the document based on the internal
depends on the type of the argument passed.
Retrieves a document/documents from the index. The method for retrieval
def doc(*arg) @dir.synchronize do id = arg[0] if id.kind_of?(String) or id.kind_of?(Symbol) ensure_reader_open() term_doc_enum = @reader.term_docs_for(@id_field, id.to_s) return term_doc_enum.next? ? @reader[term_doc_enum.doc] : nil else ensure_reader_open(false) return @reader[*arg] end end end
def each
documents so you don't need to call #load on the document to load all the
iterate through all documents in the index. This method preloads the
def each @dir.synchronize do ensure_reader_open (0...@reader.max_doc).each do |i| yield @reader[i].load unless @reader.deleted?(i) end end end
def ensure_reader_open(get_latest = true)
def ensure_reader_open(get_latest = true) raise "tried to use a closed index" if not @open if @reader if get_latest latest = false begin latest = @reader.latest? rescue Lock::LockError => le sleep(@options[:lock_retry_time]) # sleep for 2 seconds and try again latest = @reader.latest? end if not latest @searcher.close if @searcher @reader.close return @reader = IndexReader.new(@dir) end end else if @writer @writer.close @writer = nil end return @reader = IndexReader.new(@dir) end return false end
def ensure_searcher_open()
def ensure_searcher_open() raise "tried to use a closed index" if not @open if ensure_reader_open() or not @searcher @searcher = Searcher.new(@reader) end end
def ensure_writer_open()
def ensure_writer_open() raise "tried to use a closed index" if not @open return if @writer if @reader @searcher.close if @searcher @reader.close @reader = nil @searcher = nil end @writer = IndexWriter.new(@options) end
def explain(query, doc)
Computing an explanation is as expensive as executing the query over the
and, for good performance, should not be displayed with every hit.
This is intended to be used in developing Similarity implementations,
+query+.
Returns an Explanation that describes how +doc+ scored against
def explain(query, doc) @dir.synchronize do ensure_searcher_open() query = do_process_query(query) return @searcher.explain(query, doc) end end
def field_infos
Returns the field_infos object so that you can add new fields to the
def field_infos @dir.synchronize do ensure_writer_open() return @writer.field_infos end end
def flush()
will automatically flush when you perform an operation that reads the
NOTE: this is not necessary if you are only using this class. All writes
will make sure that all writes are written to it.
Flushes all writes to the index. This will not optimize the index but it
def flush() @dir.synchronize do if @reader if @searcher @searcher.close @searcher = nil end @reader.commit elsif @writer @writer.close @writer = nil end end end
def has_deletions?()
Returns true if any documents have been deleted since the index was last
def has_deletions?() @dir.synchronize do ensure_reader_open() return @reader.has_deletions? end end
def highlight(query, doc_id, options = {})
Alternatively you may want to use the HTML entity
excerpt hits the start or end of the field.
at the beginning and end of excerpts (unless the
ellipsis:: Default: "...". This is the string that is appended
+:pre_tag+. Try tag "\033[m" in the terminal.
post_tag:: Default: "". This tag should close the
a terminal.
"" tag with a class. Try "\033[36m" for use in
match. You'll probably want to change this to a
pre_tag:: Default: "". Tag to place to the left of the
num_excerpts:: Default: 2. Number of excerpts to return.
:all to highlight the entire field.
terms will be in the centre of the excerpt. Set to
excerpt_length:: Default: 150. Length of excerpt to show. Highlighted
need to call this method multiple times.
you want to highlight multiple fields then you will
specify which field you want to highlight here. If
is the field that is usually highlighted but you can
field:: Default: @options[:default_field]. The default_field
=== Options
search methods). There are also a number of options you can pass;
the id of the document you want to highlight (usually returned by the
either a query String or a Ferret::Search::Query object. The doc_id is
Returns an array of strings with the matches highlighted. The +query+ can
def highlight(query, doc_id, options = {}) @dir.synchronize do ensure_searcher_open() @searcher.highlight(do_process_query(query), doc_id, options[:field]||@options[:default_field], options) end end
def initialize(options = {}, &block)
# do stuff with index. Most of your actions will be cached.
Ferret::I.new() do |index|
closed at the index of the box. For example;
You can also pass a block if you like. The index will be yielded and
:handle_parse_errors => false)
:default_slop => 2,
index = Index::Index.new(:dir => directory,
:auto_flush => true)
:create_if_missing => false,
index = Index::Index.new(:path => '/path/to/index',
index = Index::Index.new(:analyzer => WhiteSpaceAnalyzer.new())
== Examples
large indexes, hence the default.
However, performance will be a lot slower for
in the field in anyway to get correct results.
on. You won't need to pad or normalize the data
fields which you want to perform range queries
range queries. This is useful if you have number
the standard RangeQuery when parsing
use_typed_range_query:: Default: true. Use TypedRangeQuery instead of
set this to true.
Index to close it when it is closed itself then
Directory object to this class and you want
close_dir:: Default: false. If you explicitly pass a
at the latest version.
commit lock when detecting if the IndexReader is
long to wait before retrying to obtain the
lock_retry_time:: Default: 2 seconds. This parameter specifies how
service.
should think about setting up a DRb indexing
concerned about performance. In that case you
performance impact so don't use it if you are
errors. Setting :auto_flush to true has a huge
accessing the index and you don't want lock
This is useful if you have multiple processes
you do a write (includes delete) to the index.
want the index automatically flushed every time
auto_flush:: Default: false. Set this option to true if you
or that they are not broken up by the analyzer.
sure that your key/keys are either untokenized
id) should be find however. Also, you must make
performance is a concern. A single field key (or
down indexing so it should not be done if
object. Using a multiple field key will slow
existing document will be replaced by the new
with a same key as an existing document, the
the key for the index. So if you add a document
you can set a field or an array of fields to be
if you really know what you are doing. Basically
key:: Default: nil. Expert: This should only be used
searched.
index["cat"], this will be the field that is
example, if you do a lookup by term "cat", ie
search when doing searches on a term. For
id_field:: Default: "id". This field is as the field to
to the index using #add_document or <<.
that will be used when you add a simple string
default_input_field:: Default: "id". This specifies the default field
* IndexWriter
* QueryParser
See;
=== Options
Please look at the options for the constructors to these classes.
you can supply to IndexWriter and QueryParser, you can also set here.
in memory. But this class is highly configurable and every option that
If you create an Index without any options, it'll simply create an index
def initialize(options = {}, &block) super() if options[:key] @key = options[:key] if @key.is_a?(Array) @key.flatten.map {|k| k.to_s.intern} end else @key = nil end if (fi = options[:field_infos]).is_a?(String) options[:field_infos] = FieldInfos.load(fi) end @close_dir = options[:close_dir] if options[:dir].is_a?(String) options[:path] = options[:dir] end if options[:path] @close_dir = true begin @dir = FSDirectory.new(options[:path], options[:create]) rescue IOError => io @dir = FSDirectory.new(options[:path], options[:create_if_missing] != false) end elsif options[:dir] @dir = options[:dir] else options[:create] = true # this should always be true for a new RAMDir @close_dir = true @dir = RAMDirectory.new end @dir.extend(MonitorMixin) unless @dir.kind_of? MonitorMixin options[:dir] = @dir options[:lock_retry_time]||= 2 @options = options if (!@dir.exists?("segments")) || options[:create] IndexWriter.new(options).close end options[:analyzer]||= Ferret::Analysis::StandardAnalyzer.new if options[:use_typed_range_query].nil? options[:use_typed_range_query] = true end @searcher = nil @writer = nil @reader = nil @options.delete(:create) # only create the first time if at all @auto_flush = @options[:auto_flush] || false if (@options[:id_field].nil? and @key.is_a?(Symbol)) @id_field = @key else @id_field = @options[:id_field] || :id end @default_field = (@options[:default_field]||= :*) @default_input_field = options[:default_input_field] || @id_field if @default_input_field.respond_to?(:intern) @default_input_field = @default_input_field.intern end @open = true @qp = nil if block yield self self.close end end
def optimize()
optimizes the index. This should only be called when the index will no
def optimize() @dir.synchronize do ensure_writer_open() @writer.optimize() @writer.close() @writer = nil end end
def persist(directory, create = true)
like to merge with the existing directory. This defaults to
exist or copy over an existing directory. False if you'd
create:: True if you'd like to create the directory if it doesn't
like to store the index.
representing the path to the directory where you would
directory:: This can either be a Store::Directory object or a String
into a file system index.
creating the new index, however this is a simple way to turn a RAM index
Index::Index#add_indexes method and you will have more options when
the file system. The same thing can be achieved by using the
This is a simple utility method for saving an in memory or RAM index to
def persist(directory, create = true) synchronize do close_all() old_dir = @dir if directory.is_a?(String) @dir = FSDirectory.new(directory, create) elsif directory.is_a?(Ferret::Store::Directory) @dir = directory end @dir.extend(MonitorMixin) unless @dir.kind_of? MonitorMixin @options[:dir] = @dir @options[:create_if_missing] = true add_indexes([old_dir]) end end
def process_query(query)
def process_query(query) @dir.synchronize do ensure_searcher_open() return do_process_query(query) end end
def query_delete(query)
string (in which case it is parsed by the standard query parser)
query:: The query to find documents you wish to delete. Can either be a
Delete all documents returned by the query.
def query_delete(query) @dir.synchronize do ensure_writer_open() ensure_searcher_open() query = do_process_query(query) @searcher.search_each(query, :limit => :all) do |doc, score| @reader.delete(doc) end flush() if @auto_flush end end
def query_update(query, new_val)
#=> {:id => "28", :title => "My Oh My", :artist => "David Gray"}
index["28"]
#=> {:id => "26", :title => "Babylon", :artist => "David Gray"}
index["26"]
index.query_update('artist:"David Grey"', {:artist => "David Gray"})
# correct
index << {:id => "29", :title => "My Oh My", :artist => "David Grey"}
index << {:id => "26", :title => "Babylon", :artist => "David Grey"}
=== Example
if they exist.
That is, the old fields are replaced by values in the new hash
case, all fields in the hash are merged into the old hash.
the default field is updated, or it can be a hash, in which
new_val:: The values we are updating. This can be a string in which case
parser) or an actual query object.
a string (in which case it is parsed by the standard query
query:: The query to find documents you wish to update. Can either be
Update all the documents returned by the query.
def query_update(query, new_val) @dir.synchronize do ensure_writer_open() ensure_searcher_open() docs_to_add = [] query = do_process_query(query) @searcher.search_each(query, :limit => :all) do |id, score| document = @searcher[id].load if new_val.is_a?(Hash) document.merge!(new_val) else new_val.is_a?(String) or new_val.is_a?(Symbol) document[@default_input_field] = new_val.to_s end docs_to_add << document @reader.delete(id) end ensure_writer_open() docs_to_add.each {|doc| @writer << doc } flush() if @auto_flush end end
def reader
Get the reader for this index.
def reader ensure_reader_open() return @reader end
def scan(query, options = {})
# start_doc will be nil now if results is empty, ie no more matches
start_doc = results.last
yield results # or do something with them
results = @searcher.scan(query, :start_doc => start_doc)
begin
start_doc = 0
=== Options
TODO: add option to return loaded documents instead
+:all+ to return all results.
returned, also called the page size. Set +:limit+ to
limit:: Default: 50. This is the number of results you want
search to start your next search. See the example below.
you need to use the last matched doc in the previous
through the index in increments of 50 documents at a time
document to start the scan from. So if you scanning
which refers to the offset in the result-set. This is the
+:offset+ parameter used in the other search methods
NOTE very carefully that this is not the same as the
start_doc:: Default: 0. The start document to start the search from.
=== Options
it returns.
This search method just needs to find +:limit+ number of matches before
to look at every single match to decide which one has the highest score.
documents and you only want say 50 of them. The other search methods need
very large index when there are potentially thousands of matching
There is a big performance advange when using this search method on a
found. It returns an array of the matching document numbers.
starting at +:start_doc+ and stopping when +:limit+ matches have been
Run a query through the Searcher on the index, ignoring scoring and
def scan(query, options = {}) @dir.synchronize do ensure_searcher_open() query = do_process_query(query) @searcher.scan(query, options) end end
def search(query, options = {})
Boolean value specifying whether the result should be
and the Searcher object as its parameters and returns a
filter_proc:: a filter Proc is a Proc which takes the doc_id, the score
filter:: a Filter object to filter the search results with
on this, see the documentation for SortField
to specify a fields type to sort it correctly. For more
an integer or a float. Keep this in mind as you may need
first term in the index and seeing if it can be parsed as
will try to determine a field's type by looking at the
example; "rating DESC, author, title". Note that Ferret
want the field reversed, all separated by commas. For
which cannot contain spaces and the word "DESC" if you
should be sorted. A sort string is made up of field names
sort:: A Sort object or sort string describing how the field
+:all+ to return all results
returned, also called the page size. Set +:limit+ to
limit:: Default: 10. This is the number of results you want
then 20 and so on.
then set +:offset+ to 10 and look at the next 10 results,
don't find the result you want among the first 10 results
results. Let's say you have a page size of 10. If you
result-set to return. This is used for paging through
offset:: Default: 0. The offset of the start of the section of the
=== Options
Here are the options;
object or a query string that can be parsed by the Ferret::QueryParser.
returned with the relevant results. The +query+ is a built in Query
Run a query through the Searcher on the index. A TopDocs object is
def search(query, options = {}) @dir.synchronize do return do_search(query, options) end end
def search_each(query, options = {}) # :yield: doc, score
end
puts "hit document number #{doc} with a score of #{score}"
index.search_each(query, options = {}) do |doc, score|
eg.
=== Example
returns:: The total number of hits.
included in the result set.
Boolean value specifying whether the result should be
and the Searcher object as its parameters and returns a
filter_proc:: a filter Proc is a Proc which takes the doc_id, the score
filter:: a Filter object to filter the search results with
on this, see the documentation for SortField
to specify a fields type to sort it correctly. For more
an integer or a float. Keep this in mind as you may need
first term in the index and seeing if it can be parsed as
will try to determine a field's type by looking at the
example; "rating DESC, author, title". Note that Ferret
want the field reversed, all separated by commas. For
which cannot contain spaces and the word "DESC" if you
should be sorted. A sort string is made up of field names
sort:: A Sort object or sort string describing how the field
+:all+ to return all results
returned, also called the page size. Set +:limit+ to
limit:: Default: 10. This is the number of results you want
then 20 and so on.
then set +:offset+ to 10 and look at the next 10 results,
don't find the result you want among the first 10 results
results. Let's say you have a page size of 10. If you
result-set to return. This is used for paging through
offset:: Default: 0. The offset of the start of the section of the
=== Options
options;
the range 0.0..1.0 when the max-score is greater than 1.0. Here are the
taking boosts into account. This method will also normalize scores to
possible for the score to be greater than 1.0 for some queries and
+searcher[doc_id]+) and the search score for that document. It is
reference documents in the Searcher object like this;
Searcher#search_each method yields the internal document id (used to
query string that can be validly parsed by the Ferret::QueryParser. The
returned with the relevant results. The +query+ is a Query object or a
Run a query through the Searcher on the index. A TopDocs object is
def search_each(query, options = {}) # :yield: doc, score @dir.synchronize do ensure_searcher_open() query = do_process_query(query) @searcher.search_each(query, options) do |doc, score| yield doc, score end end end
def searcher
Get the searcher for this index.
def searcher ensure_searcher_open() return @searcher end
def size()
def size() @dir.synchronize do ensure_reader_open() return @reader.num_docs() end end
def term_vector(id, field)
to Ferret's document number.
by either a string id to match the id field or an integer corresponding
Retrieves the term_vector for a document. The document can be referenced
def term_vector(id, field) @dir.synchronize do ensure_reader_open() if id.kind_of?(String) or id.kind_of?(Symbol) term_doc_enum = @reader.term_docs_for(@id_field, id.to_s) if term_doc_enum.next? id = term_doc_enum.doc else return nil end end return @reader.term_vector(id, field) end end
def to_s
def to_s buf = "" (0...(size)).each do |i| buf << self[i].to_s + "\n" if not deleted?(i) end buf end
def update(id, new_doc)
the :key attribute.
representing the value in the +id+ field. Also consider using
id:: The number of the document to update. Can also be a string
For batch update of set of documents, for performance reasons, see batch_update
term..
integer or all of the documents which have the term +id+ if +id+ is a
Update the document referenced by the document number +id+ if +id+ is an
def update(id, new_doc) @dir.synchronize do ensure_writer_open() delete(id) if id.is_a?(String) or id.is_a?(Symbol) @writer.commit else ensure_writer_open() end @writer << new_doc flush() if @auto_flush end end
def writer
Get the writer for this index.
def writer ensure_writer_open() return @writer end