Keeping Datadog In (Version) Control

I recently migrated our system monitoring and analytics from the TICK stack to Datadog. One of the things I really wanted to get better about in this transition was keeping more of our config version controlled somehow. This post will just give a quick, high-level view of the tools we're using to do that.

Very high-level, we keep all of our config in one git repository appropriately called datadog_config. We use 2 different tools (I'll cover that later), so this repo has a config file for each of those tools. We have a simple Rakefile that pulls our Datadog API credentials out of AWS' Parameter Store and builds the tool config files for you. Then there is a directory for for monitors and a directory for objects (dashboards/screenboards).

There are 2 distinct parts of Datadog config that I wanted keep in version control (hence the 2 different tools):

  1. Monitors
  2. Everything else (timeboards and screenboards) - this is the folder called objects

Monitors

To keep monitors in version control, we're using DogPush. There's an issue when using it on MacOS, so we use a a slightly tweaked version.

What DogPush gets us that's really great is the ability to set default notifications to send based on what "team" the monitor belongs to. We're using "team" to mean level. So for a low level monitor we'll just send an email, as opposed to paging everyone for a high livel monitor. Datadog doesn't give you an easy way in the UI to change the receipients for multiple monitors at once, but DogPush lets you do that.

So our config for DogPush looks like:

---
datadog:
  api_key: *********************
  app_key: *********************
teams:
  low:
    notifications:
      CRITICAL: "@our_alert_email_address@chargify.com"
      WARNING: "@our_alert_email_address@chargify.com"
  medium:
    notifications:
      CRITICAL: "@our_alert_email_address@chargify.com @slack-our-notice-channel"
      WARNING: "@our_alert_email_address@chargify.com @slack-our-notice-channel"
  high:
    notifications:
      CRITICAL: "@our_alert_email_address@chargify.com @slack-our-notice-channel @opsgenie-OpsGenie"
      WARNING: "@our_alert_email_address@chargify.com @slack-our-notice-channel @opsgenie-OpsGenie"
rule_files:
- 01_low/*.yaml
- 02_medium/*.yaml
- 03_high/*.yaml

So what that config gets us is:

  • dedicated levels (low, medium, high) that each have a default list of notification receipients
  • tells DogPush to look for any/all monitor definitions in the 01_low, 02_medium, 03_high directories.

Then an example of a monitor definition we'd have in 01_low is:

team: low

alerts:
  - multi: true
    name: "Disk usage is at {{value}} % for {{device.name}}  on {{host.name}}"
    options:
      escalation_message: ''
      evaluation_delay: null
      include_tags: true
      locked: true
      new_host_delay: 300
      no_data_timeframe: null
      notify_no_data: false
      renotify_interval: 0
      require_full_window: true
      thresholds: {critical: 95.0, critical_recovery: 80.0, warning: 85.0, warning_recovery: 80.0}
      timeout_h: 0
    query: avg(last_5m):avg:system.disk.in_use{*} by {host,device} * 100 > 95
    tags: ['*']
    type: query alert

So the line team: low tells DogPush which notification recipients you'd like for this monitor. The directories themselves don't determine anything, they're just nice for organization.

The usual flow however for getting one of this definitions is:

  • Create/edit a monitor in the Datadog UI until you get it the way you want it
  • run dogpush diff
  • copy/paste the part of the diff that corresponds with the monitor you've been working with into the right file/directory for the level you want.
  • run dogpush push to update the monitor with the right notification recipients for the level.

What could be better/Future plans

  • Automatically check/poll/alert when monitors change on Datadog, right now everything is manual.
  • Be able to dynamically set the level/team based on alert conditions (i.e. create 2 monitors on Datadog, one for staging, one for production and make the staging one low-level and the production one high-level)
  • Be able to mute monitors, especially if we could automatically do it while running routine maintenance.

Everything Else (Dashboards + Screenboards)

So anything on Datadog that isn't a monitor, we keep under version control using Shopify's doggy although, again we're using our modified version. Basically, the Shopify version makes assumptions for their environment/tools and we took those out to make it work with our environment/tools.

The way we use doggy:

  • Create/edit dashboard/screenboard in the Datadog UI
  • run doggy pull [ID]
  • Make any changes if needed
  • Regardless of wether changes have been made, run doggy push to make sure everything you have in version control is right on Datadog and marked as managed with the 🐶 emoji. (FYI, don't delete that 🐶, or doggy won't now to manage that dashboard)

The single greatest thing using this tool has got us is easily changing a lot of queries as once. This is a common scenario that doggy has made a lot easier:

  • Install a new integration in datadog (MySQL for example)
  • That integration comes with a pre-built dashboard where all the 20 or so graphs are scoped to a $scope template variable that only allows you to look at the dashboard on a host by host basis
  • Clone that dashboard and set up the actual template variable(s) you want (we normally go for $env
  • run doggy pull [ID]
  • Open the corresponding .json definition file and search-and-replace $scope with your own template variable ($env)

That whole process takes maybe 3 minutes, and is so much easier than clicking into each graph and changing the query one by one.

What could be better/future plans:

  • Organizing, currently it's just a list of of .json files where the filename is ID of the dashboard
  • Automate the process of checking/alerting when a managed dashboard on Datadog changes
  • Make doggy sync
  • Make use of doggy mute with our monitors not managed by doggy and/or port the muting to DogPush.

Conclusion

These 2 small tools make it a lot easier to keep a handle on the most critical parts of our Datadog infastructure (monitors and dashboards). With a couple tweaks, they've cut down on the time it takes to make changes to multiple items in Datadog. There's still a lot of room to make the tools more useful and automated.