Rails background job queue tips part 3

In this final part, the locking techniques in Chargify's background job system will be explained.

Commonly you need a job to run only one at a time, no matter how many there are waiting to run. We refer to this as job locking, where a job running excludes another one from running at the same time, with the same type, or same arguments.

We expanded this into a more general pattern for limiting concurrency. This is a feature that has been since offered by the paid Sidekiq Enterprise version.

Load problems caused by excessive concurrency

Its pretty easy to overload systems when using Sidekiq. If you have say, 50 job worker threads, you don't always want to be doing 50 of the same things at once.

One example we encountered was sending out 50 webhooks to a single HTTP endpoint at once. This level of concurrency can overload the remote webserver, causing the need for extensive retries and eventually failures.

Another example was export jobs; performing a slow SQL query with multiple joins, with 50 at once, could overload the database server and cause slow-downs for other users.

Slot-based limiting: Maximum of n jobs at once

So, we developed this feature to limit on how many jobs can run in parallel. Now we can scale up the number of workers without worrying about overloading anything.

How it works

For a given type of job we allocate a number of slots, lets say 10 slots for example. While running, a job occupies a slot. If no slots are available, it tries again later.

The lockwait queue is where the bulk of these jobs wait. These are invisible to Sidekiq, so they aren't picked up to run. This avoids the inefficiency of a "thundering herd" of jobs trying to occupy the same small number of slots.

Slot Implementation

The slots are implemented using a sorted set (zset) in Redis (https://redis.io/commands#sorted_set). A slot is "occupied" if there's a zset entry of the slot number. The entry's sorting score is used as a timestamp (for timeout purposes).

def available_slots
  (0...slot_limit).to_a.map(&:to_s) - occupied_slots
end

def occupied_slots
  Redis.zscan("lockslot:#{lock_id}", 0, count: 1000).last.map(&:first)
end

A slot is "occupied" if there's a zset entry of the slot number. When idle, occupied_slots would return a simple list: ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"].

def slot_limit
  10
end

def lock_id
  "lock:webhooks:#{args['customer_id']}"
end

The lock_id is a dynamic string which is the name of the lock we're acquiring. The specification varies depending what we need. In this example, we're saying webhook jobs will have a limit of 10 per customer_id, which is one of the job arguments.

def occupy_available_lock_slot
  available_slots.shuffle.each do |slot|
    return slot if Redis.zadd(lock_id, Time.current.to_i, slot, nx: true)
  end
end

def leave_lock_slot(slot)
  Redis.zrem(lock_id, slot)
end

With the "not exist" mode (nx: true), the zadd redis command returns true if it added the item to the set, and false if the item was already in the set. Because redis commands are atomic, this is a true indication of whether the current job was the first to acquire the lock.

def perform
  with_lock_slot do
    run # do job's work
  end
end

def with_slot_lock
  occupied = occupy_available_lock_slot
  yield if occupied
ensure
  leave_lock_slot(occupied) if occupied
  process_lockwait_queue
end

The on_busy_proc callback was discussed in part 1, and allows each job to decide whether to reschedule and how far into the future.

def process_lockwait_queue
  pop_at_most = available_slots(lock_id).count

  pop_at_most.times do
    break unless pop_off_lockwait_queue(lock_id)
  end
end

After all the work is done and the lock is released, the final step is process the remaining jobs by "popping" them off the lockwait queue and onto the normal sidekiq queue, to make them eligible to run.

Note: It isn't shown in this simplified code, but we do some extra work to count (in Redis) how many jobs have been popped (but not yet run). This is to avoid inefficiency caused by the race condition of multiple workers counting the same number of available slots.

Dynamic slot limit

Its also not shown above, but we actually store the numeric slot_limit in Redis, so this is another thing we can control dynamically in our job admin UI.

This also means we can temporarily bring slot-limited jobs to a complete halt by setting this to zero.

Slot limiting: A better fit than rate limiting

After implementing this system, we saw immediate improvements in the webhooks department. We had a customer who had 50% of webhooks failing over a 24 hour period because their web server would randomly get slow and stop responding. Slot limiting completely fixed it without penalizing anyone with a low rate limit.

The benefit of this over traditional rate limiting "N per minute" is a more efficient use of resources: The jobs can proceed as fast as they can within the concurrency limit. Whenever the job's work is done quickly, more can run immediately, with no time penalty.

Slot limiting has since proved useful for the less critical, maintenance-type jobs. Data migrations and back-filling of new database fields can proceed at a reasonable rate without causing excessive load.

Jobs which communicate with small web services can be limited down to the exact capacity of the remote webserver. This may also apply to 3rd party APIs where you don't want to hammer your SaaS vendor with HTTP calls.