Rails background job queue tips part 3
In this final part, the locking techniques in Chargify's background job system will be explained.
Commonly you need a job to run only one at a time, no matter how many there are waiting to run. We refer to this as job locking, where a job running excludes another one from running at the same time, with the same type, or same arguments.
We expanded this into a more general pattern for limiting concurrency. This is a feature that has been since offered by the paid Sidekiq Enterprise version.
Load problems caused by excessive concurrency
Its pretty easy to overload systems when using Sidekiq. If you have say, 50 job worker threads, you don't always want to be doing 50 of the same things at once.
One example we encountered was sending out 50 webhooks to a single HTTP endpoint at once. This level of concurrency can overload the remote webserver, causing the need for extensive retries and eventually failures.
Another example was export jobs; performing a slow SQL query with multiple joins, with 50 at once, could overload the database server and cause slow-downs for other users.
Slot-based limiting: Maximum of n jobs at once
So, we developed this feature to limit on how many jobs can run in parallel. Now we can scale up the number of workers without worrying about overloading anything.
How it works
For a given type of job we allocate a number of slots, lets say 10 slots for example. While running, a job occupies a slot. If no slots are available, it tries again later.
The lockwait queue is where the bulk of these jobs wait. These are invisible to Sidekiq, so they aren't picked up to run. This avoids the inefficiency of a "thundering herd" of jobs trying to occupy the same small number of slots.
Slot Implementation
The slots are implemented using a sorted set (zset) in Redis (https://redis.io/commands#sorted_set). A slot is "occupied" if there's a zset entry of the slot number. The entry's sorting score is used as a timestamp (for timeout purposes).
def available_slots
(0...slot_limit).to_a.map(&:to_s) - occupied_slots
end
def occupied_slots
Redis.zscan("lockslot:#{lock_id}", 0, count: 1000).last.map(&:first)
end
A slot is "occupied" if there's a zset entry of the slot number. When idle, occupied_slots
would return a simple list: ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
.
def slot_limit
10
end
def lock_id
"lock:webhooks:#{args['customer_id']}"
end
The lock_id
is a dynamic string which is the name of the lock we're acquiring. The specification varies depending what we need. In this example, we're saying webhook jobs will have a limit of 10 per customer_id, which is one of the job arguments.
def occupy_available_lock_slot
available_slots.shuffle.each do |slot|
return slot if Redis.zadd(lock_id, Time.current.to_i, slot, nx: true)
end
end
def leave_lock_slot(slot)
Redis.zrem(lock_id, slot)
end
With the "not exist" mode (nx: true
), the zadd redis command returns true if it added the item to the set, and false if the item was already in the set. Because redis commands are atomic, this is a true indication of whether the current job was the first to acquire the lock.
def perform
with_lock_slot do
run # do job's work
end
end
def with_slot_lock
occupied = occupy_available_lock_slot
yield if occupied
ensure
leave_lock_slot(occupied) if occupied
process_lockwait_queue
end
The on_busy_proc
callback was discussed in part 1, and allows each job to decide whether to reschedule and how far into the future.
def process_lockwait_queue
pop_at_most = available_slots(lock_id).count
pop_at_most.times do
break unless pop_off_lockwait_queue(lock_id)
end
end
After all the work is done and the lock is released, the final step is process the remaining jobs by "popping" them off the lockwait queue and onto the normal sidekiq queue, to make them eligible to run.
Note: It isn't shown in this simplified code, but we do some extra work to count (in Redis) how many jobs have been popped (but not yet run). This is to avoid inefficiency caused by the race condition of multiple workers counting the same number of available slots.
Dynamic slot limit
Its also not shown above, but we actually store the numeric slot_limit
in Redis, so this is another thing we can control dynamically in our job admin UI.
This also means we can temporarily bring slot-limited jobs to a complete halt by setting this to zero.
Slot limiting: A better fit than rate limiting
After implementing this system, we saw immediate improvements in the webhooks department. We had a customer who had 50% of webhooks failing over a 24 hour period because their web server would randomly get slow and stop responding. Slot limiting completely fixed it without penalizing anyone with a low rate limit.
The benefit of this over traditional rate limiting "N per minute" is a more efficient use of resources: The jobs can proceed as fast as they can within the concurrency limit. Whenever the job's work is done quickly, more can run immediately, with no time penalty.
Slot limiting has since proved useful for the less critical, maintenance-type jobs. Data migrations and back-filling of new database fields can proceed at a reasonable rate without causing excessive load.
Jobs which communicate with small web services can be limited down to the exact capacity of the remote webserver. This may also apply to 3rd party APIs where you don't want to hammer your SaaS vendor with HTTP calls.