Keeping Datadog In (Version) Control
I recently migrated our system monitoring and analytics from the TICK stack to Datadog. One of the things I really wanted to get better about in this transition was keeping more of our config version controlled somehow. This post will just give a quick, high-level view of the tools we're using to do that.
Very high-level, we keep all of our config in one git repository appropriately called datadog_config
. We use 2 different tools (I'll cover that later), so this repo has a config file for each of those tools. We have a simple Rakefile
that pulls our Datadog API credentials out of AWS' Parameter Store and builds the tool config files for you. Then there is a directory for for monitors and a directory for objects (dashboards/screenboards).
There are 2 distinct parts of Datadog config that I wanted keep in version control (hence the 2 different tools):
- Monitors
- Everything else (timeboards and screenboards) - this is the folder called
objects
Monitors
To keep monitors in version control, we're using DogPush
. There's an issue when using it on MacOS, so we use a a slightly tweaked version.
What DogPush gets us that's really great is the ability to set default notifications to send based on what "team" the monitor belongs to. We're using "team" to mean level. So for a low level monitor we'll just send an email, as opposed to paging everyone for a high livel monitor. Datadog doesn't give you an easy way in the UI to change the receipients for multiple monitors at once, but DogPush lets you do that.
So our config for DogPush looks like:
---
datadog:
api_key: *********************
app_key: *********************
teams:
low:
notifications:
CRITICAL: "@our_alert_email_address@chargify.com"
WARNING: "@our_alert_email_address@chargify.com"
medium:
notifications:
CRITICAL: "@our_alert_email_address@chargify.com @slack-our-notice-channel"
WARNING: "@our_alert_email_address@chargify.com @slack-our-notice-channel"
high:
notifications:
CRITICAL: "@our_alert_email_address@chargify.com @slack-our-notice-channel @opsgenie-OpsGenie"
WARNING: "@our_alert_email_address@chargify.com @slack-our-notice-channel @opsgenie-OpsGenie"
rule_files:
- 01_low/*.yaml
- 02_medium/*.yaml
- 03_high/*.yaml
So what that config gets us is:
- dedicated levels (
low
,medium
,high
) that each have a default list of notification receipients - tells DogPush to look for any/all monitor definitions in the
01_low
,02_medium
,03_high
directories.
Then an example of a monitor definition we'd have in 01_low
is:
team: low
alerts:
- multi: true
name: "Disk usage is at {{value}} % for {{device.name}} on {{host.name}}"
options:
escalation_message: ''
evaluation_delay: null
include_tags: true
locked: true
new_host_delay: 300
no_data_timeframe: null
notify_no_data: false
renotify_interval: 0
require_full_window: true
thresholds: {critical: 95.0, critical_recovery: 80.0, warning: 85.0, warning_recovery: 80.0}
timeout_h: 0
query: avg(last_5m):avg:system.disk.in_use{*} by {host,device} * 100 > 95
tags: ['*']
type: query alert
So the line team: low
tells DogPush which notification recipients you'd like for this monitor. The directories themselves don't determine anything, they're just nice for organization.
The usual flow however for getting one of this definitions is:
- Create/edit a monitor in the Datadog UI until you get it the way you want it
- run
dogpush diff
- copy/paste the part of the diff that corresponds with the monitor you've been working with into the right file/directory for the level you want.
- run
dogpush push
to update the monitor with the right notification recipients for the level.
What could be better/Future plans
- Automatically check/poll/alert when monitors change on Datadog, right now everything is manual.
- Be able to dynamically set the level/team based on alert conditions (i.e. create 2 monitors on Datadog, one for staging, one for production and make the staging one low-level and the production one high-level)
- Be able to mute monitors, especially if we could automatically do it while running routine maintenance.
Everything Else (Dashboards + Screenboards)
So anything on Datadog that isn't a monitor, we keep under version control using Shopify's doggy
although, again we're using our modified version. Basically, the Shopify version makes assumptions for their environment/tools and we took those out to make it work with our environment/tools.
The way we use doggy:
- Create/edit dashboard/screenboard in the Datadog UI
- run
doggy pull [ID]
- Make any changes if needed
- Regardless of wether changes have been made, run
doggy push
to make sure everything you have in version control is right on Datadog and marked as managed with the 🐶 emoji. (FYI, don't delete that 🐶, or doggy won't now to manage that dashboard)
The single greatest thing using this tool has got us is easily changing a lot of queries as once. This is a common scenario that doggy has made a lot easier:
- Install a new integration in datadog (MySQL for example)
- That integration comes with a pre-built dashboard where all the 20 or so graphs are scoped to a
$scope
template variable that only allows you to look at the dashboard on a host by host basis - Clone that dashboard and set up the actual template variable(s) you want (we normally go for
$env
- run
doggy pull [ID]
- Open the corresponding
.json
definition file and search-and-replace$scope
with your own template variable ($env
)
That whole process takes maybe 3 minutes, and is so much easier than clicking into each graph and changing the query one by one.
What could be better/future plans:
- Organizing, currently it's just a list of of
.json
files where the filename is ID of the dashboard - Automate the process of checking/alerting when a managed dashboard on Datadog changes
- Make
doggy sync
- Make use of
doggy mute
with our monitors not managed bydoggy
and/or port the muting to DogPush.
Conclusion
These 2 small tools make it a lot easier to keep a handle on the most critical parts of our Datadog infastructure (monitors and dashboards). With a couple tweaks, they've cut down on the time it takes to make changes to multiple items in Datadog. There's still a lot of room to make the tools more useful and automated.