Caduceus
What is Caduceus?
Caduceus is that long stick with the intertwined snakes that Hermes used to carry around.
It is also a service that will notify you if your scheduled tasks/cronjobs fail to run.
Motivation
You know how you set all these cronjobs to run, and added fancy error reporting and things, only to realize too late that this doesn't help you at all when the server has been down for a month and nobody noticed?
Caduceus won't let this happen again.
Rather than trigger on failure, Caduceus triggers on absence of success.
Services have to actively check in (by visiting a URL), and, if they don't, Caduceus notifies you by email that the task has failed.
If the service starts working again, Caduceus will notify you of that as well.
Installation
To install Caduceus, you can just get it from PyPI:
pip install caduceus
Alternatively, you can pull the Docker image:
docker pull registry.gitlab.com/stavros/caduceus:latest
Usage
To run Caduceus, you need to configure it.
This is done by placing a file called caduceus.toml
in the directory you want to run Caduceus in.
That directory is where the Caduceus SQLite database will be created.
If you installed Caduceus from the repo or with pip
, just run it:
caduceus
It will load the configuration from the file, create its database and start running on http://localhost:5000/
.
To run it via Docker:
docker run -v (pwd):/caduceus registry.gitlab.com/stavros/caduceus:latest
Configuration
Here's a sample configuration file (which is also available as caduceus.toml.example
in the repository):
[config]
secret_key = "somelongkey"
[alerting.email]
recipient_emails = [ "notifyme@example.com", "otherdev@example.com" ]
from_addr = "caduceus@example.com"
hostname = "example.com"
port = 25
username = "myuser"
password = "mypassword"
encryption = "none"
[alerting.telegram]
apikey = "#############:####################"
chat_id = "99999999"
[alerts]
default_channels = [ "telegram" ]
[alerts.cron]
every = "1h"
channels = [ "email" ]
[alerts.backups]
every = "1d"
channels = [ "email", "telegram" ]
recipient_emails = [ "thirdemail@example.com" ]
[alerts.alwaysfail]
every = "1s"
notify_every = "1m"
The above config defines three services, raidscrub
, backups
and alwaysfail
.
raidscrub
needs to check in every hour, backups
needs to check in every day, and alwaysfail
needs to check in every second.
That's why it was called that.
However, as emailing you every second would get spammy, notify_every
is set to one minute, so Caduceus will only email you once a minute, even though the alert will be considered failed if it doesn't check in once per second.
You will get an initial email right when the failure is detected (there is a 10 second notification resolution) and then emails every minute after that.
Always leave a bit of leeway in your tasks, to account for running time.
If a task starts at midnight one day and runs for an hour, it'll check in at 1am.
If the next day it runs for 61 minutes, it will check in more than a day later, so you'll get a "failed" email.
To avoid that, add an buffer of 10% or so to your alerts.
Checking in
Checking in is done by retrieving a URL on the server.
The URL for checking in and resetting the alert timer is /reset/<alert name>/
.
For example, to check in to backups
if you haven't specified a secret_key
(and if Caduceus is running on example.com), you'd simply do:
curl http://example.com/reset/backups/
If you did specify a secret key, just include it:
curl http://example.com/reset/backups/?key=<your secret_key>
If your alert is set up for, say, one hour, and your task does not check in, you will get an email one hour after its last checkin, saying "your task has not checked in".
If it still doesn't check in, you'll get another email an hour after that, then an hour after that, etc, until it does, at which point you'll get an email saying that the job is now fine.