Services
What is a service?
We created Cabot because we wanted a way of monitoring services, not just machines.
Cabot allows you to monitor logical services - running on one machine or 100 nodes - as logical units. Instead of (or rather, as well as) monitoring your Hadoop cluster’s disk usage, you can make sure you get alerts when the response time for your ElasticSearch-based search API crosses a threshold, or when a particular database on your Redis server grows past a certain size.
Statuses
Available statuses
Services have four possible overall statuses which feed into alert frequency:
Passing
- all’s good. If a service has previously been marked asError
orCritical
you will get a notification to say that the service is back to normal.Warning
- things you want to be alerted to, but probably not disturbed for.Error
- alerts possible via all channels other than phone call, and repeated every 2 hours.Critical
- this is for when the production database crashes at 2am. Phone alerts and other notifications to duty officers repeated every ten minutes.
It’s easier to understand what happens when a service changes from its Start status
to End status
by looking at the following table.
- A
notification
is either:- An email
- A Hipchat message that does not contain @mentions (which will cause pop-up desktop notifications, emails and/or SMS notifications to be sent)
- An
alert
is:- A Hipchat message with @mentions
- An SMS
- A
telephone alert
is a phone call, not just an SMS
End status | |||||
---|---|---|---|---|---|
Passing | Warning | Error | Critical | ||
Start status | Passing | [n/a] | Notification | Alert | Telephone alert or alert |
Warning | Notification (back to normal) | Notification repeats every 2 hours | Alert | Telephone alert or alert | |
Error | Alert (back to normal) | Notification (of warning) | Alert repeats every 2 hours (notification every 10 mins) | Telephone alert or alert | |
Critical | Alert (back to normal) | Non-alerting notification | Alert | Telephone alert or alert |
Calculating service status
Cabot sets the status of a Service to the status of its most important (Critical > Error > Warning) failing check. (You can configure this by changing the Importance
of a check.)
Cabot will ignore checks that are disabled (by unticking the Active
box on the check configuration page) in calculating service status.
Disabling alerts
You can disable alerts for a service by editing it and unticking Alerts enabled
.
We have thought of implementing a mechanism for acknowledging alerts. Currently we respond by jumping into Hipchat and using that as a war room for coordinating response, but a more formal escalation and acknowledgement procedure may be appropriate for larger organisations. Suggestions and pull requests are welcome.
Adding a service
Adding a service is straightforward - just click New ▾ and then Service.
If you have already configured Checks that you want to make part of the service, you can select them in the Status checks field.
You can also choose which channels you would like to receive notifications for this alert by, select users who should receive alerts/notifications, or disable alerts entirely.
Sharing debugging information
Because one of Cabot’s most important features is helping consolidate information and metrics that tell us what is broken, it’s the first place that we look to work out how to fix it.
Rather than build a wiki, we embed private Hackpads onto each Service overview page.
Add a Hackpad
To add a Hackpad to a service you have to enter its alphanumeric ID on the Edit Service page:
Then it will be visible - and editable - from the corresponding Service overview page.