Advanced monit: Keep track of daemons, websites, RAIDs and partitons

ยท 1546 words ยท 8 minute read

Introduction ๐Ÿ”—

Are you already hosting your own mail- or webserver and do you enjoy the flexibility, control and freedom self-hosting gives you? Besides the many advantages like better privacy and the power to customize it gives you personally, you can also offer your services to other people. Even tough there are a large number of budget hosting companies, many customers are willing to pay for better support or the comfort to have you around for questions.

But what makes the difference between a hobbyist server operation for yourself and a professional hosting business you can charge for? I would argue that the key difference is reliability. If your personal blog or email goes offline for a day because you don’t have time to fix it, it’s only you who will suffer. On the other hand, when you’re responsible for a company’s emails or their webshop they will lose real money and probably pick a different service soon.

Prerequisits ๐Ÿ”—

In this article I will introduce the advanced features of monit and it’s bigger brother, m/monit. While there are definitely newer and maybe more advanced monitoring solutions for different use cases, monit hits a sweet spot in terms of functionality for small- and medium sized server operations. I will assume, you have already set up monit and use it to monitor the most basic system properties and a few processes. If not, just work through one of the many introductory articles on monit and then come back to this page.

Web services: the sum of their working parts ๐Ÿ”—

For users, the web sites- and services they consume on a daily basis are abstract things that magically descend from a fluffy cloud, sometimes called the internet. As system administrators we know that web services are the sum of many interconnected programs running on ordinary computers. As long as all these services do their job, they go mostly unnoticed, but once any of them breaks the whole service will be rendered unusable. As admins it’s our job to keep everything running smoothly 99.9% or more of the time.

Before starting to configure your local monit agents, you should make a list of services and resources required to run your web services. It’s a good idea to start at the ‘bottom’ and work your way to the top:

  • CPU and memory are available
  • all physical hard drives in your RAID are working
  • partitions are mounted
  • enough space is available on partitions
  • critical system files are available and have the right permissions
  • all processes are running and respond to requests
  • website is available with correct content
  • scheduled jobs are successfully executed

Once you have a rough idea on what can (and will) go wrong, you can start adding appropriate rules to monit.

System checks ๐Ÿ”—

This class of checks monitors the whole system rather than an individual process. In many cases it will alert you of a problem, even if it didn’t occur in a resource you are explicitely monitoring. A fairly complete system check looks like this:

check system hermes
  if loadavg (5min) > 3 then alert
  if cpu usage > 80% for 3 cycles then alert
  if memory > 90% for 3 cycles then alert
  if swap > 60% then alert

This will give an alert, whenever CPU usage is dangerously high or the system is running out of memory and swap space. Instead of alert one could also use restart.

Filesystem checks ๐Ÿ”—

Once your system has enough CPU cycles and memory, it makes sense to add some data partitions to hold websites, emails or databases. It’s good practice to have a small system partition and a bigger data partition. That way your server will stay responsive, even after users fill up the data partition.

check filesystem root with path /
  if space usage > 90 % then alert
  if inode usage > 50% then alert

check filesystem srv_raid with path /dev/md3  
  start program = "/sbin/mdadm -A md3" 
  if space usage > 90 % then alert
  if inode usage > 50% then alert

The first block will check the root partition. The second block checks a RAID. If the RAID is not currently active, it will be assembled by monit. For this to work md3 should be specified in /etc/mdadm.conf.

To mount make sure the partitions are mounted properly, use the directory statement:

check directory srv with path /srv/www
  start program = "/bin/mount -L srv"

The directory statement can also be used to monitore backups. If your backup script should ever break down, you will be notified.

check directory zeus-backups path /srv/backups/zeus
 if timestamp > 12 hours then alert

check directory athena-backups path /srv/backups/athena
 if timestamp > 12 hours then alert

Check network devices ๐Ÿ”—

In some cases you will also need to take care of other network devices, like routers, access points or switches. Monit can do this as well. If Wifi ever goes offline, you will know before your clients:

check host ap1 with address 192.168.1.30
        if failed icmp type echo count 5 with timeout 15 seconds for 3 cycles then alert

check host sw2 with address 192.168.1.35
        if failed icmp type echo count 5 with timeout 15 seconds for 3 cycles then alert

Check configuration files ๐Ÿ”—

In the event your server is hacked, it can make sense to monitore critical configuration files.

check file sshd_config with path "/etc/ssh/sshd_config"
  if changed checksum then alert

Process checks ๐Ÿ”—

Checking processes is at the very heart of monit. This includes MySQL, Apache, SSHD and many others. Let’s start with Apache:

check process apache with pidfile /var/run/apache2.pid
  depends on mysql
  depends on memcache
  start "/etc/init.d/apache2 start"
  stop  "/usr/bin/killall -9 apache2"
  if cpu > 60% for 2 cycles then alert
  if cpu > 80% for 5 cycles then restart
  if totalmem > 300.0 MB for 5 cycles then restart
  if children > 80 for 5 cycles then restart
  if loadavg(5min) greater than 10 for 8 cycles then restart

This will check CPU usage, memory, children and load. If your webserver should ever become unresponsive, monit will automatically restart it. Note that I’m using killall to stop it, because in some extreme cases the init-script won’t work.

For MySQL and Memcache the checks are a bit simpler:

check process mysql with pidfile /var/run/mysqld/mysqld.pid
  start program = "/etc/init.d/mysql start"
  stop program = "/etc/init.d/mysql stop"
  if failed port 3306 then restart
  if 10 restarts within 15 cycles then timeout

check process memcache with pidfile /var/run/memcached.pid
  start program = "/etc/init.d/memcached start"
  stop  program = "/etc/init.d/memcached stop"
  if failed port 11211 protocol memcache then restart
  if 3 restarts within 5 cycles then timeout

SSHD will rarely fail, but still needs checking:

check process sshd with pidfile /var/run/sshd.pid
  start program  "/etc/init.d/ssh start"
  stop program  "/etc/init.d/ssh stop"
  if failed port 22 protocol ssh for 2 cycles then restart
  if 5 restarts within 10 cycles then timeout

These are the configurations I use most often. For other services you can generally find templates in the official Configuration examples.

Check remote hosts and services ๐Ÿ”—

There can be cases, when an essential service is not fully under your control. Think of a corporate website or an external SMTP-server. If your own services (or users) depend on those, you can still monitor them with a local monit instance:

check host eos with address srv1.google.com
  if failed url http://mail.google.com
    with timeout 20 seconds for 2 cycles then alert

  if failed url http://www.google.com
    content == "Google Search"
    with timeout 20 seconds for 2 cycles then alert

This check will not only alert you, if the site goes offline, but even if a database error renders the site disfunctional or hackers deface it.

Check file contents or command output ๐Ÿ”—

Monitoring the content of files is another great monit feature. I use it to keep track of my RAID or even remote ZFS-systems.

This first rule looks into /proc/mdstat and will alert you if one of the drives is offline. You could achieve the same with mdadm directly, but I prefer to have all my monitoring in one system.

check file mdstat with path /proc/mdstat
  if match "_U" then alert

The rule below looks at the output of /tmp/FreeNAS_move.txt. This output is created by a CRON-job that simply saves the output of a remote command. If the command reports a problem with the ZFS array, you will be alerted by monit.

check file FreeNas_status with path /tmp/FreeNAS_move.txt  
  if timestamp > 48 hours then alert
  if match "moved folder" then alert
  if match "DEGRADED" then alert
  if match "FAULTED" then alert
  if match "OFFLINE" then alert
  if match "DEGRADED" then alert
  if match "UNAVAIL" then alert
  if match "REMOVED" then alert

Big brother: m/monit ๐Ÿ”—

No write-up on monit would be complete without mentioning m/monit. This handy, but commercial program can collect the output of multiple monit instances. If you have multiple servers to manage, it might be worth the investment.

To configure your local monit instance for m/monit add this command to monitrc:

set mmonit https://report:password@mymmonit.mydomain.com/collector

Conclusion and further resources ๐Ÿ”—

If you are looking to bring your server game to the next level, the keys are reliability and scalability. Monit can greatly help you with both.

I recommend the following sites, while diving deeper into the topic:

Monit documentation

More configuration examples

Disclaimer: I’m not affiliated with the provider of m/monit in any ways and don’t get a commission from them.