At Chargify, we lean heavily on Elasticsearch for data storage as well as search, but upgrading Elasticsearch major versions requires a full cluster restart.

We came up with a plan for upgrading with no noticeable downtime, although the cluster was placed into "read only" mode for some time.

Pre-upgrade codebase compatibility

Before making any other moves, we ensured that all of our usage of Elasticsearch was done in a way that was compatible with both v2 and v5. This meant the codebase would not need to be changed or redeployed as part of the upgrade, and enables seamless testing on a staging system.

Here's an outline of some changes we needed:

  • Ruby gem upgrades: The appropriate gem versions track the server version, so for Elasticsearch server v5, use 5.x gems.
  • Unlimited aggregation buckets no longer supported (via size: 0) - it is not a good practice to allow unlimited anything, but this is enforced in v5. We went through and placed specific limits on all aggregations.
  • Stored id fields: We had defined some id mappings without a type, because it defaulted to "string". For some reason, this caused problems after the index was upgraded to v5. The symptom was that "plucking" a single field (via _source: [fieldname]) did not always work. The fix was simply to upgrade the mapping with an explicit type: 'string'.
  • Unsupported filter queries: Need to be rewritten, the simplest way is to make sure filter: { terms: { ... is wrapped in an outer query: { bool: { ....
  • Change old scan queries to use scroll instead. To stay efficient at unsorted scrolls, use "sort": "_doc".
  • Change missing queries to a negated exists instead.

Supporting a Read-only cluster

We made our application code degrade gracefully if it could read but not write to Elasticsearch.

This is a major advantage when it comes to do a full cluster restart. Read-only mode enables a seamless failover from old to new cluster without loss of data.

A couple of techniques that enabled this for us:

  • Any time a document fails to be created, save it to SQL table. A cron job comes along later to pick them up and create them in Elasticsearch.
  • As much as possible, only writing to Elasticsearch from background jobs, which can then be paused for short periods without causing issues.

Reindexing

Even if you are using Elasticsearch v2, only indexes which were created in v2 can be upgraded to v5.

We had to make sure all indexes in the cluster that were created in v1 were reindexed. The process for that is described in a previous post.

Dedicated disk volumes

Elasticsearch supports "in place" upgrades of the data on disk. Our upgrade plans revolved around this via the snapshots feature of AWS EBS.

The key here is storing the Elasticsearch data on a dedicated EBS volume mounted seperately from Linux root, at /var/lib/elasticsearch.

In this way, the OS and Elasticsearch software can be upgraded without touching the data, and the old data volume can be cloned and re-used.

Automation and Practicing

With preparation out of the way, the actual upgrade can proceed. Below is a rough guide to how we performed the actual upgrade.

To increase confidence, we practiced several times, first on a staging system, and then all the lead-up steps on the live production system (without actually doing the failover).

Automating as much as possible (via shell scripts which make API calls is fine) made the whole process repeatable and reliable.

Upgrade Day steps

  • Disable shard allocation. Use persistent settings (not transient) to make sure it is still disabled after the cluster restarts.
  • Stop background workers, cron jobs, and any other non-essential application code from running.
  • Set read-only mode and perform a synced flush at the same time.
    curl -XPUT http://127.0.0.1:9200/_all/_settings -d'{"index":{"blocks":{"read_only":true}}}'
    curl http://127.0.0.1:9200/_flush/synced
  • Check that documents are no longer being inserted. (eg. watch that the cluster document count has stopped changing)
  • Take EBS snapshots of all the data volumes. Because the cluster is in read-only mode, you are achieving a consistent database state.
  • At this point, if you want to just do a test of spinning up the new cluster, you can clear read only mode and re-enable shard allocation.
  • Create new volumes from the snapshots and spin up the new cluster instances with them attached.
  • Note: Use different EC2 discovery tags, or even an actual firewall, to prevent the new cluster having any contact with the old cluster.
  • Elasticsearch will take some time to boot up and go from red to yellow status, recovering the primary indexes. Doing a test beforehand will help you see how much time you're looking at.
  • Re-enable shard allocation at this point, and it will start replicating, and be able to go from yellow to green status when that is finished.
  • Once it is yellow, set read-only mode and you can begin testing it by pointing one client at it at first, then your entire app. Verify complete operation while still in read-only mode. At this point, you can still roll back if you encounter a major issue.
  • Clear read-only mode: Now there is no going back! You can shut down the old cluster.
    curl -XPUT http://127.0.0.1:9200/_all/_settings -d'{"index":{"blocks":{"read_only":false}}}'

Additional Notes

  • EBS initialization process means the new volumes are slower than normal for some time after the new cluster is created. Keep this in mind when looking at performance of the new cluster.
  • S3-based Snapshot Repository has a different format in Elasticsearch v5. The first time you boot a cluster it will "upgrade" to the new format. The new cluster can see the old v2 snapshots, but any snapshots taken after that point in v2 will not be visible in v5
    Consider not configuring the S3 repository in the new cluster until after you have switched to it. This is something that needs fixing if you decide to rollback.