Rails background job queue tips part 2

Here's a few more interesting aspects of our background job system. Make sure to check out part 1 first if you haven't already.


We've created several callbacks for job classes to use:

  • on_busy when a job lock could not be acquired within the timeout. This is used to re-queue the job after a short delay.

  • on_fail when there's an exception.

  • on_dead when they've given up retrying a failing job. Example could be used to log an exception only on the final attempt instead of every single attempt.

Log context

class ExportJob < SidekiqJob
  job_arg :export

  def run
    log_context export.attributes.slice('user_id', 'site_id', 'parts_complete')

The log_context method adds data to an internal hash. It can be called any number of times to add more data. This has a double purpose:

  1. For log lines, this adds useful debug info.
  2. The hash is added to our exception notifier's context feature as well.

Measuring lag

We wanted to know: What's the time lag between when a job was supposed to run, and when it actually ran? In other words, the time it spent sitting in a queue.

A piece of Sidekiq middleware can help record this for every job. Sidekiq helpfully records a created_at field internally to show when a job was enqueued.

class JobLagLogger
  def call(worker, msg, _queue)
    intended_run_at = msg['scheduled_at'] || msg['created_at']
    job_lag = Time.now.to_f - intended_run_at.to_f

Sidekiq.configure_server { |config| config.server_middleware { |chain| chain.add JobLagLogger } }

To make things a bit harder, we also wanted to know the lag for scheduled jobs. The scheduling information didn't appear to be accessible to the middleware for a scheduled job that was being run, so the solution there was to add it back into the job data as a custom piece of data, scheduled_at, and fall back to the standard created_at for non-scheduled jobs.

JSON logging

This started out very simple and has grown to become a powerful debugging tool.

We decided to standardize on JSON as a log format to make parsing and aggregating easier. We log every job when it gets enqueued, and again when it runs. Its been great to remove any potential mystery about "what happened, and when?".

As you'd expect, the job arguments, the timestamps (enqueued_at and scheduled_at). Here are some of the other things we log:

  • Job log and run duration
  • Exception class and message, if any
  • Retry number
  • Customer account the job is related to
  • Job ID of re-enqueued jobs (when retrying)

Audit logging

This is not simple line-based logging, but actual database records (which we happen to store in Elasticsearch).

Piggybacking on the JSON log file logging above, this adds to the Audit Log feature we already have in our application, which lets us view an audit trail of what operations were committed (by users, admins, or the system itself) on important pieces of data.

Background jobs that took action on a central piece of data can be seen right next to web and API users of the application, providing a powerful single time-line view.


This isn't specific to job queues! We implemented Statsd-based counting of many facets of jobs, even things that are already logged. For example:

  • duration and lag
  • waits to acquire a lock
  • timeouts
  • retries

Coming next in part 3

In the next part we'll explain our job locking system.