Rails background job queue tips part 1

Many Rails apps have some kind of job queue. While the "out of the box" experience with Rails and Sidekiq keeps getting better over time, here's a collection of ways we've improved on the basics at Chargify, to help keep things running smoothly and efficiently in our background job queues.

Note that Sidekiq Pro and Enterprise editions have implemented some of these features since we have, and there's plenty of unofficial gems out there which have similar functionality.

Single parent job class

class SidekiqJob  # Parent class
  include Sidekiq::Worker
  ...
end

class MyJob < SidekiqJob
  ...
end

We decided to make every job class inherit from a single parent class that we control.

This made it easy to augment and override Sidekiq behavior with our own, especially now we have around 70 different job types.

Nominate a primary job argument

class UpdateUserJob < SidekiqJob
  job_arg :user  # Primary job argument

  def run
    UserUpdater.new(user)
  end
end

We also decided that every job class should specify a job argument that is its primary piece of data, and it is almost always a database ID. This has several benefits:

  • Convenience methods are auto-generated, saving time in this common pattern.
    For example, if the primary job arg is user_id, then user() gets auto-created which does a User.find for you.
  • Better reporting and logging code which is then aware of this key piece of data.
  • It comes in handy later, for creating mutex keys for job locking.

De facto job time limit

Our developers have agreed to keep job run times to around 10 seconds as a maximum. If a job is going to take longer than this, we prefer to split it into multiple smaller jobs.

This has enabled us to have reasonable Sidekiq restart and shutdown windows, to make deploying new code go smoothly.

Dynamic Queues


We have only about half a dozen Sidekiq queues that represent a priority level (ie. high, medium, low, lowest, etc.). Normally each job class encodes its queue in code.

But our job admin UI lets us override this dynamically! We did this by having the enqueue() code do a quick Redis lookup. Thus, it affects all jobs enqueued from that point on.

class SidekiqJob
  def self.queue
    Sidekiq.redis {|r| r.hget('sidekiq.overridequeues', name) } ||
      sidekiq_options_hash['queue'] ||
      'default'
    end
  end

For existing enqueued jobs, our admin UI also lets us move them between queues as required.

This has come in really handy for ops when dealing with unexpected load. Lets say a new feature was released that had a longer than expected runtime, that is blocking up the queues. We can reschedule it to the lowest priority with a few mouse clicks, until we have time to fix the code.

Self re-prioritizing

class WebhookJob < SidekiqJob
  job_lock(
    on_busy: Proc.new { enqueue 60.seconds, on_queue: 'lowest' }
  )
  ...

The dynamic queues architecture has had the additional benefit of letting jobs control their own priority.

We have some job types that encounter "soft failures", and upon re-enqueue for a later attempt, move themselves dynamically to a lower priority.

This stops broken upstream services from choking up our queue with retries, while more important jobs languish.

The Paused queue


For those scary times when the lowest priority isn't low enough, we have an extra queue which has no job workers attached to it. Jobs enqueued to paused do not run.

It is a holding area for job types which we simply don't want to run any more. Once the issue with them has been resolved, they can be moved back to their normal queue again.

Deleting jobs with "Killset"

The traditional way to delete jobs from Sidekiq jobs isn't great. You have to find the job in the particular queue it exists in, or the scheduled set if its been delayed. You have to kill it before it runs, and if got re-enqueued, start all over to find its new jid.
Ideally you simply never need to delete a job, but in practice, unforeseen coding mistakes, or runaway CPU and memory consumption, may leave you stuck with a bad job that needs to be stopped, and its a nightmare situation if there's a runaway job.

class SidekiqJob
  def enqueue(interval = 0, jobargs = {})
    return if in_killset?
    ...
  end

  def perform(args={})
    return if in_killset?
    ...
  end

  def in_killset?(jid)
    Sidekiq.redis { |r| r.hget('sidekiq.killset', jid) }
  end

Killset is a concept we came up with to make it easier. It is a Redis list of jids that we want stopped from running.

The parent job class contains wrapper code that checks its own jid against the killset, both before running, and before retrying after an exception.

This has given ops a mechanism to stop any job in the system from bringing everything crashing down, without needing the usual Sidekiq console gymnastics.

Coming next in part 2

Thats it for now. In follow-up posts, there'll be detailed discussion of our job locking and job logging.