Are you already hosting your own mail- or webserver and do you enjoy the flexibility, control and freedom self-hosting gives you? Besides the many advantages like better privacy and the power to customize it gives you personally, you can also offer your services to other people. Even tough there are a large number of budget hosting companies, many customers are willing to pay for better support or the comfort to have you around for questions.
But what makes the difference between a hobbyist server operation for yourself and a professional hosting business you can charge for? I would argue that the key difference is reliability. If your personal blog or email goes offline for a day because you don't have time to fix it, it's only you who will suffer. On the other hand, when you're responsible for a company's emails or their webshop they will lose real money and probably pick a different service soon.
In this article I will introduce the advanced features of monit and it's bigger brother, m/monit. While there are definitely newer and maybe more advanced monitoring solutions for different use cases, monit hits a sweet spot in terms of functionality for small- and medium sized server operations. I will assume, you have already set up monit and use it to monitor the most basic system properties and a few processes. If not, just work through one of the many introductory articles on monit and then come back to this page.
Web services: the sum of their working parts
For users, the web sites- and services they consume on a daily basis are abstract things that magically descend from a fluffy cloud, sometimes called the internet. As system administrators we know that web services are the sum of many interconnected programs running on ordinary computers. As long as all these services do their job, they go mostly unnoticed, but once any of them breaks the whole service will be rendered unusable. As admins it's our job to keep everything running smoothly 99.9% or more of the time.
Before starting to configure your local monit agents, you should make a list of services and resources required to run your web services. It's a good idea to start at the 'bottom' and work your way to the top:
- CPU and memory are available
- all physical hard drives in your RAID are working
- partitions are mounted
- enough space is available on partitions
- critical system files are available and have the right permissions
- all processes are running and respond to requests
- website is available with correct content
- scheduled jobs are successfully executed
Once you have a rough idea on what can (and will) go wrong, you can start adding appropriate rules to monit.
This class of checks monitors the whole system rather than an individual process. In many cases it will alert you of a problem, even if it didn't occur in a resource you are explicitely monitoring. A fairly complete system check looks like this:
check system hermes if loadavg (5min) > 3 then alert if cpu usage > 80% for 3 cycles then alert if memory > 90% for 3 cycles then alert if swap > 60% then alert
This will give an alert, whenever CPU usage is dangerously high or the system is running out of memory and swap space. Instead of alert one could also use restart.
Once your system has enough CPU cycles and memory, it makes sense to add some data partitions to hold websites, emails or databases. It's good practice to have a small system partition and a bigger data partition. That way your server will stay responsive, even after users fill up the data partition.
check filesystem root with path / if space usage > 90 % then alert if inode usage > 50% then alert check filesystem srv_raid with path /dev/md3 start program = "/sbin/mdadm -A md3" if space usage > 90 % then alert if inode usage > 50% then alert
The first block will check the root partition. The second block checks a RAID. If the RAID is not currently active, it will be assembled by monit. For this to work md3 should be specified in /etc/mdadm.conf.
To mount make sure the partitions are mounted properly, use the directory statement:
check directory srv with path /srv/www start program = "/bin/mount -L srv"
The directory statement can also be used to monitore backups. If your backup script should ever break down, you will be notified.
check directory zeus-backups path /srv/backups/zeus if timestamp > 12 hours then alert check directory athena-backups path /srv/backups/athena if timestamp > 12 hours then alert
Check network devices
In some cases you will also need to take care of other network devices, like routers, access points or switches. Monit can do this as well. If Wifi ever goes offline, you will know before your clients:
check host ap1 with address 192.168.1.30 if failed icmp type echo count 5 with timeout 15 seconds for 3 cycles then alert check host sw2 with address 192.168.1.35 if failed icmp type echo count 5 with timeout 15 seconds for 3 cycles then alert
Check configuration files
In the event your server is hacked, it can make sense to monitore critical configuration files.
check file sshd_config with path "/etc/ssh/sshd_config" if changed checksum then alert
Checking processes is at the very heart of monit. This includes MySQL, Apache, SSHD and many others. Let's start with Apache:
check process apache with pidfile /var/run/apache2.pid depends on mysql depends on memcache start "/etc/init.d/apache2 start" stop "/usr/bin/killall -9 apache2" if cpu > 60% for 2 cycles then alert if cpu > 80% for 5 cycles then restart if totalmem > 300.0 MB for 5 cycles then restart if children > 80 for 5 cycles then restart if loadavg(5min) greater than 10 for 8 cycles then restart
This will check CPU usage, memory, children and load. If your webserver should ever become unresponsive, monit will automatically restart it. Note that I'm using killall to stop it, because in some extreme cases the init-script won't work.
For MySQL and Memcache the checks are a bit simpler:
check process mysql with pidfile /var/run/mysqld/mysqld.pid start program = "/etc/init.d/mysql start" stop program = "/etc/init.d/mysql stop" if failed port 3306 then restart if 10 restarts within 15 cycles then timeout check process memcache with pidfile /var/run/memcached.pid start program = "/etc/init.d/memcached start" stop program = "/etc/init.d/memcached stop" if failed port 11211 protocol memcache then restart if 3 restarts within 5 cycles then timeout
SSHD will rarely fail, but still needs checking:
check process sshd with pidfile /var/run/sshd.pid start program "/etc/init.d/ssh start" stop program "/etc/init.d/ssh stop" if failed port 22 protocol ssh for 2 cycles then restart if 5 restarts within 10 cycles then timeout
These are the configurations I use most often. For other services you can generally find templates in the official Configuration examples.
Check remote hosts and services
There can be cases, when an essential service is not fully under your control. Think of a corporate website or an external SMTP-server. If your own services (or users) depend on those, you can still monitor them with a local monit instance:
check host eos with address srv1.google.com if failed url http://mail.google.com with timeout 20 seconds for 2 cycles then alert if failed url http://www.google.com content == "Google Search" with timeout 20 seconds for 2 cycles then alert
This check will not only alert you, if the site goes offline, but even if a database error renders the site disfunctional or hackers deface it.
Check file contents or command output
Monitoring the content of files is another great monit feature. I use it to keep track of my RAID or even remote ZFS-systems.
This first rule looks into /proc/mdstat and will alert you if one of the drives is offline. You could achieve the same with mdadm directly, but I prefer to have all my monitoring in one system.
check file mdstat with path /proc/mdstat if match "_U" then alert
The rule below looks at the output of /tmp/FreeNAS_move.txt. This output is created by a CRON-job that simply saves the output of a remote command. If the command reports a problem with the ZFS array, you will be alerted by monit.
check file FreeNas_status with path /tmp/FreeNAS_move.txt if timestamp > 48 hours then alert if match "moved folder" then alert if match "DEGRADED" then alert if match "FAULTED" then alert if match "OFFLINE" then alert if match "DEGRADED" then alert if match "UNAVAIL" then alert if match "REMOVED" then alert
Big brother: m/monit
No write-up on monit would be complete without mentioning m/monit. This handy, but commercial program can collect the output of multiple monit instances. If you have multiple servers to manage, it might be worth the investment.
To configure your local monit instance for m/monit add this command to monitrc:
set mmonit https://report:firstname.lastname@example.org/collector
Conclusion and further resources
If you are looking to bring your server game to the next level, the keys are reliability and scalability. Monit can greatly help you with both.
I recommend the following sites, while diving deeper into the topic:
Disclaimer: I'm not affiliated with the provider of m/monit in any ways and don't get a commission from them.