Rails background job queue tips part 2
Here's a few more interesting aspects of our background job system. Make sure to check out part 1 first if you haven't already.
Callbacks
We've created several callbacks for job classes to use:
-
on_busy
when a job lock could not be acquired within the timeout. This is used to re-queue the job after a short delay. -
on_fail
when there's an exception. -
on_dead
when they've given up retrying a failing job. Example could be used to log an exception only on the final attempt instead of every single attempt.
Log context
class ExportJob < SidekiqJob
job_arg :export
def run
log_context export.attributes.slice('user_id', 'site_id', 'parts_complete')
...
end
The log_context
method adds data to an internal hash. It can be called any number of times to add more data. This has a double purpose:
- For log lines, this adds useful debug info.
- The hash is added to our exception notifier's context feature as well.
Measuring lag
We wanted to know: What's the time lag between when a job was supposed to run, and when it actually ran? In other words, the time it spent sitting in a queue.
A piece of Sidekiq middleware can help record this for every job. Sidekiq helpfully records a created_at
field internally to show when a job was enqueued.
class JobLagLogger
def call(worker, msg, _queue)
intended_run_at = msg['scheduled_at'] || msg['created_at']
job_lag = Time.now.to_f - intended_run_at.to_f
yield
end
end
Sidekiq.configure_server { |config| config.server_middleware { |chain| chain.add JobLagLogger } }
To make things a bit harder, we also wanted to know the lag for scheduled jobs. The scheduling information didn't appear to be accessible to the middleware for a scheduled job that was being run, so the solution there was to add it back into the job data as a custom piece of data, scheduled_at
, and fall back to the standard created_at
for non-scheduled jobs.
JSON logging
This started out very simple and has grown to become a powerful debugging tool.
We decided to standardize on JSON as a log format to make parsing and aggregating easier. We log every job when it gets enqueued, and again when it runs. Its been great to remove any potential mystery about "what happened, and when?".
As you'd expect, the job arguments, the timestamps (enqueued_at
and scheduled_at
). Here are some of the other things we log:
- Job log and run duration
- Exception class and message, if any
- Retry number
- Customer account the job is related to
- Job ID of re-enqueued jobs (when retrying)
Audit logging
This is not simple line-based logging, but actual database records (which we happen to store in Elasticsearch).
Piggybacking on the JSON log file logging above, this adds to the Audit Log feature we already have in our application, which lets us view an audit trail of what operations were committed (by users, admins, or the system itself) on important pieces of data.
Background jobs that took action on a central piece of data can be seen right next to web and API users of the application, providing a powerful single time-line view.
Statsd
This isn't specific to job queues! We implemented Statsd-based counting of many facets of jobs, even things that are already logged. For example:
- duration and lag
- waits to acquire a lock
- timeouts
- retries
Coming next in part 3
In the next part we'll explain our job locking system.