Ok to disable, or have really high, notification threshholds?
I have a have website monitoring that alerts us when sites are down. The sites are never affected, they are always up. Customers have never complained for all the years this happens.
So I am wondering if its OK to increase the limits even more (I have increased them a lot), or even disable them. What would others suggest? I dont want all these alerts that do not affect the running of websites. I would want alerts if the limits are so high that websites would be affected, like too slow or down, I dont know what those limits would be. You'd think 100% but they all work fine at 140% or whatever.
Here is a typical couple:
Your Linode, has exceeded the notification threshold (95) for CPU Usage by averaging 127.1% for the last 2 hours. The dashboard for this specific Linode is located at:
Your Linode, has exceeded the notification threshold (10) for outbound traffic rate by averaging 34.41 Mb/s for the last 2 hours.
Seen as the hosted websites still work fine when these limits passed, I wonder if I should increase them even more. I dont know what limits to be concerned about though, as they all still work, and it could be my backups which are early hours in the morning, so nothing to be concerned about. Be good if I could put a schedule on it.
If I disable them then my website monitoring system alerts me when websites are down which is the key issue for me.
4 Replies
A few things come to mind:
1) Disable the notifications and just take the risk that if something happens then you won't know about it immediately.
2) Don't disable them at all (and make them a bit more sensitive too) but use better filtering at your end (email client, etc). For example, delete emails that come during your backup event/schedule, but allow them at other times.
3) Use a different kind of monitoring method, that allows more fine-grained configuration of your alerts (and monitor what you want, instead of the generic ones), by using a monitoring system like Nagios, Icinga2, etc.
Assume you have 2 processors: 127% is really just 75% (ish). You can safely lift your values if this is within your acceptable tolerances.
@IfThenElse I do have Nagios, but for some reason it stopped emailing me alerts and I havent managed to fix it yet, I cant find the issue! If I get that working then that would give me more control over monitoring and alerts, so I shall re-look at getting that working.
Thanks both