- Projects
- Computer security
- jQuery code
- qpsmtpd code
Steve Kemp's Homepage
|
Free Software
|
This is a simple & reliable proof of concept distributed monitoring system. What does it mean for a monitoring system to be distributed? Well it means that multiple hosts run the service tests - and alerts are only generated if two or more of them see outage. The intention is that an alert will only be generated if a service is really unavailable and you'll see no false alarms caused by a single monitoring host having a flaky network connection. The simplest way to run the service is to install the Debian package - this will add the dmonitor system user, and setup things like init scripts and cronjobs. The more manual approach is to:
Once you've installed the system on a number of hosts you should tighten security by ensuring that only monitoring hosts can talk to each other on the control port 2929. Create the file /etc/dmonitor/nodes.txt with the public hostnames addresses of the hosts running the monitoring software. Each node will attempt to contact its peers when it detects a failure - to determine whether it is a local failure or a genuine one. Remember, again, that you should firewall the port 2929 away from the outside world. The next step is to configure the hosts/services to be tested. To run ping & ssh tests on the host www.example.com you'd run this: mkdir -p /etc/dmonitor/hosts.d/www.example.com/ touch /etc/dmonitor/hosts.d/www.example.com/ping touch /etc/dmonitor/hosts.d/www.example.com/ssh Finally if you wish to email foo@example.com + root@example.com on alerts run: touch /etc/dmonitor/alert.d/foo@example.com touch /etc/dmonitor/alert.d/root@example.com If you don't configure alerting email addresses you'll receive no notices! The monitoring is actually conducted by a series of plugins, located in the directory /etc/dmonitor/plugins. For the test "ping" there should be an executable named check_ping. Each plugin is called with the hostname to test as a single argument and is expected to output OK\n when all is good, and FAIL\n on failure. If the plugin takes longer than 10 seconds to execute it will be killed and a result of "TIMEMOUT" will be inferred - which is equivilent to "FAIL". |