Since Solr runs as a different part of your application infrastructure, there is always the chance that it isn't working right while the rest of your application is working fine. By queueing up changes for Solr when your models change, the update parts of your application can remain up even during a Solr outage. This could be critical if it keeps your staff working or your customer placing orders.

=== Better consistency when something goes wrong

If your application stores data in a relational database, there's always the chance that a record update could succeed while the transaction it was in fails. This can result in inconsistent data between your search index and your database.

If you use the tactic of batching Solr updates to commit them at once, you could have a problem if an exception is encountered prior to the batch being committed.

Queueing the updates in the same datastore that is used for persisting your models can prevent these sorts of inconsistencies from happening.

=== Spread out load peaks

If you get a particularly large spike of updates to your indexed models, you could be taxing your Solr server with lots of single document updates. This could lead to downstream performance issues in your application. By queueing the Solr updates, they will be sent to the server in larger batches providing better performance. Furthermore, if Solr gets backed up updating the index, the slow down will be isolated to the background job processing the queue.

This library uses a dedicated work queue for processing Solr requests instead of building off of delayed_job or another background processing library in order to take advantage of efficiently batching requests to Solr. This can give orders of magnitude better performance over indexing and committing individual documents.

== Usage

To use asynchronous indexing, you'll have to set up three things.

=== Session

Your application should be initialized with a Sunspot::IndexQueue::SessionProxy session. This will send queries immediately to Solr queries but send updates to a queue for processing later.

This will set up a session proxy using the default Solr session:

Sunspot.session = Sunspot::IndexQueue::SessionProxy.new

If you have custom configuration settings for your Solr session or need to wrap it with additional proxies, you can pass it to the constructor. For example, if you have a master/slave setup running on specific ports:

master_session = Sunspot::Session.new{|config| config.solr.url = 'http://master.solr.example.com/solr'} slave_session = Sunspot::Session.new{|config| config.solr.url = 'http://slave01.solr.example.com/solr'} master_slave_session = Sunspot::SessionProxy::MasterSlaveSessionProxy(master_session, slave_session) queue = Sunspot::IndexQueue.new(master_slave_session) Sunspot.session = Sunspot::IndexQueue::SessionProxy.new(queue)

=== Queue Implementation

The queue component is designed to be modular so that you can plugin a datastore that fits with your application architecture. To set the implementation, you can set it to one of the included implementations:

Queue implementation backed by ActiveRecord

Sunspot::IndexQueue::Entry.implementation = :active_record

Queue implementation backed by DataMapper

Sunspot::IndexQueue::Entry.implementation = :data_mapper

Queue implementation backed by MongoDB

Sunspot::IndexQueue::Entry.implementation = :mongo

Queue implementation backed by Redis

Sunspot::IndexQueue::Entry.implementation = :redis

You can also provide your own queue implementation

Sunspot::IndexQueue::Entry.implementation = MyQueueImplementation

You'll need to make sure you have the data structures set up properly for the implementation you choose. See the documentation for the implementations for more details

Sunspot::IndexQueue::Entry::ActiveRecordImpl
Sunspot::IndexQueue::Entry::DataMapperImpl
Sunspot::IndexQueue::Entry::MongoImpl
Sunspot::IndexQueue::Entry::RedisImpl

Note that as of version 1.1.0 the data structure for the ActiveRecord and the DataMapper implementations assumes the primary key on the indexed records is an integer. This is done since it is the usual case and far more efficient that using a string index. If your records use a primary key that is not an integer, you'll need to add an additional migration to change the +record_id+ column type.

=== Process The Queue

To process the queue:

queue = Sunspot::IndexQueue.new queue.process

This will process all entries currently in the queue. Of course, you'll probably want to wrap this up in some sort of daemon process. Here is a sample daemon script you could run for a Rails application. You'll need to customize it if your setup is more complicated.

#!/usr/bin/env ruby

require 'rubygems' gem 'daemons-mikehale' require 'daemons'

Assumes the scrip is located in a subdirectory of the Rails root directory

rails_root = File.expand_path(File.join(File.dirname(FILE), '..'))

Daemons.run_proc(File.basename($0), :dir_mode => :normal, :dir => File.join(rails_root, 'log'), :force_kill_wait => 30) do require File.join(rails_root, 'config', 'environment')

# Use the default queue settings.
queue = Sunspot::IndexQueue.new

# Don't want the daemon to fill up the log with SQL queries in development mode
Rails.logger.level = Logger::INFO if Rails.logger.level == Logger::DEBUG

loop do
  begin
    queue.process
    sleep(2)
  rescue Exception => e
    # Make sure we can exit the loop on a shutdown
    raise e if e.is_a?(SystemExit) || e.is_a?(Interrupt)
    # If Solr isn't responding, wait a while to give it time to get back up
    if e.is_a?(Sunspot::IndexQueue::SolrNotResponding)
      sleep(30)
    else
      Rails.logger.error(e)
    end
  end
end

end

The logic in the queue is designed to allow concurrent processing of the queue by multiple processes, however, some documents may end up getting submitted to Solr multiple times. This can speed up the processing of a large number of documents if you need to index a large data set. Forking too many processes to handle your queue, however, will result in more documents being processed multiple times.

== Features

=== Multiple Queues

If you have multiple models segmented into multiple Solr indexes, you can set up multiple queues. They will share the same persistence backend, but can be configured with different Solr sessions.

Index all your content in one index

content_queue = Sunspot::IndexQueue.new(:session => content_session, :class_names => [BlogPost, Review]) content_session_proxy = Sunspot::IndexQueue::SessionProxy.new(content_queue)

And all your products in another

product_queue = Sunspot::IndexQueue.new(:session => product_session, :class_names => Product) product_session_proxy = Sunspot::IndexQueue::SessionProxy.new(product_queue)

=== Priority

When you have updates coming from multiple sources, it is often good to set a priority on when they should get processed so that when there is a backlog, it becomes less noticeable. For example, in an application that has models updated both by human interaction and by an automated feed, you probably want the human updated items to be indexed first. That way when their is a backlog, your human customers won't notice it. You can control the priority of updates inside a block with the Sunspot::IndexQueue.set_priority method.

=== Recoverability

When an entry in the queue cannot be sent to Solr, it will be automatically rescheduled to be tried again later. The amount of time is controlled by the +retry_interval+ setting on IndexQueue (defaults to 1 minute). Every time it is tried and fails, the interval will be increased yet again (i.e. first try wait 1 minute, second try wait 2 minutes, third try wait 3 minutes, etc.).

Error messages and stack traces are stored with the queue entries so the can be debugged.

sThe exception to this is that if Solr is down altogether, the queue will stop processing and entries will be restored to try again immediately. This error is not logged on the entries but is rather thrown as a Sunspot::IndexQueue::SolrNotResponding exception.

FAQs

What is sunspot_index_queue?

Is sunspot_index_queue well maintained?

Package last updated on 19 Jun 2012

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

sunspot_index_queue