Cabot - monitor and alert : Services

What is a service?

We created Cabot because we wanted a way of monitoring services, not just machines.

Cabot allows you to monitor logical services - running on one machine or 100 nodes - as logical units. Instead of (or rather, as well as) monitoring your Hadoop cluster’s disk usage, you can make sure you get alerts when the response time for your ElasticSearch-based search API crosses a threshold, or when a particular database on your Redis server grows past a certain size.

Statuses

Available statuses

Services have four possible overall statuses which feed into alert frequency:

Passing - all’s good. If a service has previously been marked as Error or Critical you will get a notification to say that the service is back to normal.
Warning - things you want to be alerted to, but probably not disturbed for.
Error - alerts possible via all channels other than phone call, and repeated every 2 hours.
Critical - this is for when the production database crashes at 2am. Phone alerts and other notifications to duty officers repeated every ten minutes.

It’s easier to understand what happens when a service changes from its Start status to End status by looking at the following table.

A notification is either:
- An email
- A Hipchat message that does not contain @mentions (which will cause pop-up desktop notifications, emails and/or SMS notifications to be sent)
An alert is:
- A Hipchat message with @mentions
- An SMS
A telephone alert is a phone call, not just an SMS

		End status
		Passing	Warning	Error	Critical
Start status	Passing	[n/a]	Notification	Alert	Telephone alert or alert
	Warning	Notification (back to normal)	Notification repeats every 2 hours	Alert	Telephone alert or alert
	Error	Alert (back to normal)	Notification (of warning)	Alert repeats every 2 hours (notification every 10 mins)	Telephone alert or alert
	Critical	Alert (back to normal)	Non-alerting notification	Alert	Telephone alert or alert

Calculating service status

Cabot sets the status of a Service to the status of its most important (Critical > Error > Warning) failing check. (You can configure this by changing the Importance of a check.)

Cabot will ignore checks that are disabled (by unticking the Active box on the check configuration page) in calculating service status.

Disabling alerts

You can disable alerts for a service by editing it and unticking Alerts enabled.

We have thought of implementing a mechanism for acknowledging alerts. Currently we respond by jumping into Hipchat and using that as a war room for coordinating response, but a more formal escalation and acknowledgement procedure may be appropriate for larger organisations. Suggestions and pull requests are welcome.

Adding a service

Adding a service is straightforward - just click New ▾ and then Service.

If you have already configured Checks that you want to make part of the service, you can select them in the Status checks field.

You can also choose which channels you would like to receive notifications for this alert by, select users who should receive alerts/notifications, or disable alerts entirely.

Because one of Cabot’s most important features is helping consolidate information and metrics that tell us what is broken, it’s the first place that we look to work out how to fix it.

Rather than build a wiki, we embed private Hackpads onto each Service overview page.

Add a Hackpad

To add a Hackpad to a service you have to enter its alphanumeric ID on the Edit Service page:

Add Hackpad id

Then it will be visible - and editable - from the corresponding Service overview page.

View Hackpad

Cabot - monitor and alert Get alerted when services go down or metrics go crazy

Services