= Sunspot Index Queue
This gem provides support for asynchronously updating a Solr index with the sunspot gem.
== Why asynchronous
=== If Solr is down, your application won't (necessarily) be down
Since Solr runs as a different part of your application infrastructure, there is always the chance that it isn't working right while the rest of your application is working fine. By queueing up changes for Solr when your models change, the update parts of your application can remain up even during a Solr outage. This could be critical if it keeps your staff working or your customer placing orders.
=== Better consistency when something goes wrong
If your application stores data in a relational database, there's always the chance that a record update could succeed while the transaction it was in fails. This can result in inconsistent data between your search index and your database.
If you use the tactic of batching Solr updates to commit them at once, you could have a problem if an exception is encountered prior to the batch being committed.
Queueing the updates in the same datastore that is used for persisting your models can prevent these sorts of inconsistencies from happening.
=== Spread out load peaks
If you get a particularly large spike of updates to your indexed models, you could be taxing your Solr server with lots of single document updates. This could lead to downstream performance issues in your application. By queueing the Solr updates, they will be sent to the server in larger batches providing better performance. Furthermore, if Solr gets backed up updating the index, the slow down will be isolated to the background job processing the queue.
This library uses a dedicated work queue for processing Solr requests instead of building off of delayed_job or another background processing library in order to take advantage of efficiently batching requests to Solr. This can give orders of magnitude better performance over indexing and committing individual documents.
== Usage
To use asynchronous indexing, you'll have to set up three things.
=== Session
Your application should be initialized with a Sunspot::IndexQueue::SessionProxy session. This will send queries immediately to Solr queries but send updates to a queue for processing later.
This will set up a session proxy using the default Solr session:
Sunspot.session = Sunspot::IndexQueue::SessionProxy.new
If you have custom configuration settings for your Solr session or need to wrap it with additional proxies, you can pass it to the constructor. For example, if you have a master/slave setup running on specific ports:
master_session = Sunspot::Session.new{|config| config.solr.url = 'http://master.solr.example.com/solr'}
slave_session = Sunspot::Session.new{|config| config.solr.url = 'http://slave01.solr.example.com/solr'}
master_slave_session = Sunspot::SessionProxy::MasterSlaveSessionProxy(master_session, slave_session)
queue = Sunspot::IndexQueue.new(master_slave_session)
Sunspot.session = Sunspot::IndexQueue::SessionProxy.new(queue)
=== Queue Implementation
The queue component is designed to be modular so that you can plugin a datastore that fits with your application architecture. To set the implementation, you can set it to one of the included implementations:
Queue implementation backed by ActiveRecord
Sunspot::IndexQueue::Entry.implementation = :active_record
Queue implementation backed by DataMapper
Sunspot::IndexQueue::Entry.implementation = :data_mapper
Queue implementation backed by MongoDB
Sunspot::IndexQueue::Entry.implementation = :mongo
Queue implementation backed by Redis
Sunspot::IndexQueue::Entry.implementation = :redis
You can also provide your own queue implementation
Sunspot::IndexQueue::Entry.implementation = MyQueueImplementation
You'll need to make sure you have the data structures set up properly for the implementation you choose. See the documentation for the implementations for more details
- Sunspot::IndexQueue::Entry::ActiveRecordImpl
- Sunspot::IndexQueue::Entry::DataMapperImpl
- Sunspot::IndexQueue::Entry::MongoImpl
- Sunspot::IndexQueue::Entry::RedisImpl
Note that as of version 1.1.0 the data structure for the ActiveRecord and the DataMapper implementations assumes the primary key on the indexed records is an integer. This is done since it is the usual case and far more efficient that using a string index. If your records use a primary key that is not an integer, you'll need to add an additional migration to change the +record_id+ column type.
=== Process The Queue
To process the queue:
queue = Sunspot::IndexQueue.new
queue.process
This will process all entries currently in the queue. Of course, you'll probably want to wrap this up in some sort of daemon process. Here is a sample daemon script you could run for a Rails application. You'll need to customize it if your setup is more complicated.
#!/usr/bin/env ruby
require 'rubygems'
gem 'daemons-mikehale'
require 'daemons'
Assumes the scrip is located in a subdirectory of the Rails root directory
rails_root = File.expand_path(File.join(File.dirname(FILE), '..'))
Daemons.run_proc(File.basename($0), :dir_mode => :normal, :dir => File.join(rails_root, 'log'), :force_kill_wait => 30) do
require File.join(rails_root, 'config', 'environment')
# Use the default queue settings.
queue = Sunspot::IndexQueue.new
# Don't want the daemon to fill up the log with SQL queries in development mode
Rails.logger.level = Logger::INFO if Rails.logger.level == Logger::DEBUG
loop do
begin
queue.process
sleep(2)
rescue Exception => e
# Make sure we can exit the loop on a shutdown
raise e if e.is_a?(SystemExit) || e.is_a?(Interrupt)
# If Solr isn't responding, wait a while to give it time to get back up
if e.is_a?(Sunspot::IndexQueue::SolrNotResponding)
sleep(30)
else
Rails.logger.error(e)
end
end
end
end
The logic in the queue is designed to allow concurrent processing of the queue by multiple processes, however, some documents may end up getting submitted to Solr multiple times. This can speed up the processing of a large number of documents if you need to index a large data set. Forking too many processes to handle your queue, however, will result in more documents being processed multiple times.
== Features
=== Multiple Queues
If you have multiple models segmented into multiple Solr indexes, you can set up multiple queues. They will share the same persistence backend, but can be configured with different Solr sessions.
Index all your content in one index
content_queue = Sunspot::IndexQueue.new(:session => content_session, :class_names => [BlogPost, Review])
content_session_proxy = Sunspot::IndexQueue::SessionProxy.new(content_queue)
And all your products in another
product_queue = Sunspot::IndexQueue.new(:session => product_session, :class_names => Product)
product_session_proxy = Sunspot::IndexQueue::SessionProxy.new(product_queue)
=== Priority
When you have updates coming from multiple sources, it is often good to set a priority on when they should get processed so that when there is a backlog, it becomes less noticeable. For example, in an application that has models updated both by human interaction and by an automated feed, you probably want the human updated items to be indexed first. That way when their is a backlog, your human customers won't notice it. You can control the priority of updates inside a block with the Sunspot::IndexQueue.set_priority method.
=== Recoverability
When an entry in the queue cannot be sent to Solr, it will be automatically rescheduled to be tried again later. The amount of time is controlled by the +retry_interval+ setting on IndexQueue (defaults to 1 minute). Every time it is tried and fails, the interval will be increased yet again (i.e. first try wait 1 minute, second try wait 2 minutes, third try wait 3 minutes, etc.).
Error messages and stack traces are stored with the queue entries so the can be debugged.
sThe exception to this is that if Solr is down altogether, the queue will stop processing and entries will be restored to try again immediately. This error is not logged on the entries but is rather thrown as a Sunspot::IndexQueue::SolrNotResponding exception.