Redis is an in-memory database which has become the default choice for background job queues in Rails applications, because it helps with scaling to large numbers of background job workers.

The Chargify application was no exception, having started out years ago with the SQL-backed delayed_job, moving shortly thereafter to Resque and finally to Sidekiq.

We love Redis, but we love Uptime

We love Redis for its simplicity, ease of getting up and running. We love its performance, since it can easily handling millions of jobs per day, even on a fairly small AWS instance. But we also love uptime!

Getting high availability when you have a dependency on Redis can be challenging:

Out of the box, it gives you a cluster that still causes blips of downtime, potentially with lost data, during fail-overs.

Our solution is a hybrid approach, maximizing uptime by expecting small amounts of downtime.

We try to keep Redis up with clustering, but ensure the application degrades gracefully without Redis present.

Let's go through a few techniques that make it possible:

No disappearing data with BufferedJob

We wrote BufferedJob, an ActiveRecord model backed by a database table that stores a job class, and job arguments (serialized in Sidekiq's JSON format).

class SidekiqJob
  def enqueue(interval = 0, jobargs = {})
    ...
  rescue
    BufferedJob.create!(job_class: self.class, args: [@args])
  end

The enqueue method on the Sidekiq job class is wrapped with an exception handler that serializes the job to the SQL database, if the job enqueue failed for any reason.

If the Redis server is unreachable, jobs get stored in this SQL table until a cron job comes along every few minutes to flush the table back into Redis. If that re-enqueue succeeds, the row is removed from the table.

This technique is not really new, and mimics the way we already treated other network services like our email server. Assume it is potentially unreachable, and reattempt later.

The big win is the overall effect of removing the dependency on Redis in critical code sections that enqueue jobs, such as end users submitting forms.

Because it only kicks in for brief periods, and there's only a single cronjob picking up the jobs, it doesn't present a high load on the SQL database.

Resilience with SafeRedis

But in addition to job queues, we use Redis for other features, e.g. shared in-memory caching.

We created a module called SafeRedis that wraps the Redis class, proxying all access to Redis, but with two important changes:

  1. Exceptions are captured, logged, but not passed on. Instead, nil is returned.

  2. The network timeout for talking to the Redis server is reduced down to only 2 seconds, and failures are cached for 15 seconds (i.e. further attempts to use Redis are not made for 15s).

It is a good pattern to have these fall-back code paths (with test coverage!), to degrade gracefully if a valid response was not received from Redis. It is up to the individual case how to handle a nil response. (e.g. cached values can be recalculated.)

This has the overall effect of the application staying up, and being reasonably responsive, in the event of Redis being down. Our development team is committed to using SafeRedis everywhere possible.

Hands-free fail-overs with HAproxy

Although Redis can be a primary-secondary cluster, it leaves it up to the client to talk to the correct primary server. So we put HAproxy 1.7 between our Redis client and server to automate this.

backend redis-chargify
  option tcp-check
  option tcpka

  tcp-check send info\ replication\r\n
  tcp-check expect string role:master
  timeout check 1s
  timeout server 86400s

  server redis1 redis1:6379 maxconn 1000 check inter 1s rise 2 fall 5 on-marked-down shutdown-sessions

I configured it with tcp-check to ping the Redis servers every second, via the "info replication" command.

It then directs client traffic only to the primary via searching for the string "role:master"

During testing, I also found the config "shutdown-sessions" to be essential. This forces the Ruby Redis client to reconnect after a fail-over event, removing the risk of having open transactions split between two servers.

Fail-overs ... for fun?

I've heard developers worrying about how to achieve the "perfect" database cluster that they can treat as a black box that never fails or needs maintenance.

In reality, downtime due to planned maintenance is more common than unplanned failures. And the networks we build on occasionally have nasty surprises, such as partial network partitions, which the supposedly magical unicorn clusters don't actually handle too well.

The three techniques I presented above are individually helpful, but are especially good together:

Now, Redis downtime is tested and repeatable. We can perform upgrades on the Redis servers whenever we need to, without it being a major scary event that might require manual "clean up" or cause unknown side-effects.

Because of this, I joked to a coworker that I can perform fail-overs "for fun". But I don't condone this practice on production servers!

- @andy_snow