I don't want to go off an a rant here, but...
Let's get this out of the way
Metrics != Monitoring
Metrics and monitoring are intimately related, which means that unfortunately they are often mistaken for one another, but they really serve different purposes. Metrics are the historical record and show how we got here. Monitoring is the status of everything at this moment.
Monitoring should be telling you what is happening in the environment right now - is everything healthy? Are there warnings or errors for any service? Has disk space utilization for a server crossed a threshold?
Metrics are collections of datapoints over time and enable you to see trends and (hopefully) anomalies in that data. I say “hopefully” for anomalies not because I’m hoping to see them, but because I hope that the data collected is complete enough and displayed appropriately to make them visible.
Metrics are good
Metrics are indispensable for recognizing changes or trends in the environment over time and making predictions about the future for things like scaling and capacity management. They can also be critical in determining failure points in the past when researching what caused an issue, for showing areas where more monitoring is needed, or for adjusting thresholds in monitoring.
They also make for nice graphs to more easily comprehend the trends.
On the other hand...
They’re not so good at telling you what is happening at this very moment. Data is streamed into the metric collection systems and for various reasons the streams can be delayed or even drop datapoints. Ideally this never happens, of course, but loss of a datapoint or two is almost never critical to seeing the overall trends.
It’s desirable, critical even, to collect as much metric data from the environment as possible. More data means more accurate trends and more information for diagnosis of issues. The desire to see that data in graph form, however, can backfire as the graph will be rendered useless when overwhelmed with too much data.
So if all the data is going into the metrics systems, what is monitoring good for?
As I said earlier, monitoring lives in the moment, and is concerned with the health of the environment right now. Think of it as being the EKG machine for our systems. Sure, the EKG has a couple of graphs there for a quick look the last few minutes, but the real value in it is what the patient’s vital signs are right now, and alerting everyone when those vital signs exceed certain thresholds or god forbid flatline.
Like the EKG, a monitoring system’s real value is giving a window of the current vital signs of the environment and calling a Code Blue when something goes wrong. It doesn’t need a wall of displays showing graphs, it surfaces the important issues as they happen.
So what does that mean?
Just to drive that nail home, monitoring does not replace metrics, nor do metrics replace monitoring. They are fraternal twins, not identical. Monitoring, in fact, relies on metrics and should be generating metrics of its own. But metric data is only part of what a monitoring system needs, not every issue can be found by watching a stream of data over time, and some things that require attention don’t lend themselves to these nice streams of datapoints.
Monitoring is driven by events and by measurements.
Your ideas are intriguing to me and I wish to subscribe to your newsletter
At this point I’m hoping you’re thinking to yourself that this is making sense, and that you want to know more about monitoring and what it means for us.
In the simplest terms, monitoring is both proactive and reactive.
It’s proactive in the health checks that it does. These can take a number of forms and are initiated by the monitoring server making remote checks, by monitoring agents running scheduled checks and reporting back to the server, or by monitoring agents performing checks as directed by the server. If you can determine an up/down or passing/failing status for a process/system/piece of hardware, or a measurement with a threshold that shouldn’t be crossed, as long as there’s a way to query that data you can monitor for it. Ping checks are a simple example of an up/down check - the server pings a system and sends an alert if it doesn’t respond. Threshold measurements would be things like disk space utilization, memory utilization, etc.
Reactive events are initiated from outside the monitoring system itself. These are typically not ongoing health checks or threshold monitors, but a response to an error condition somewhere that is then captured by the monitoring system and generates an alert. This could be a function in an application that sends an event to the monitoring system when certain conditions are met (or not met). It could be a process watching a application log and sending an event when an error is found in the log. It could even be a manually generated event.
The central purpose of a monitoring system is to manage events and generate alerts. It doesn’t matter how well you monitor anything if you never send an alert when something is broken. Ideally everything is humming along perfectly and you never get an alert, but if something does go wrong we want to know about it first - if a customer notifies us about an issue that we’re monitoring for before we’ve been alerted by monitoring then we’ve failed.
But alerting should not done with a shotgun approach. Too many alerts are just as bad as too few, over-alerting means that people are getting alerts for things which are not critical, are not really issues, or are not issues for which action needs to be taken. This leads to alert fatigue and increases the likelihood that alerts will be missed or ignored.
Any time an alert is sent it must be specific, concrete, and actionable.
Specific - the alert must be for a specific process, threshold, condition, or event.
Concrete - the alert must be well-defined and refer to an actual problem.
Actionable - most of all, if someone is alerted, more specifically if someone is paged, that alert must be actionable. If there is nothing for the on call person to do in response to the page it should not have been sent. A page is a call to action and must be treated as such, paging for informational purposes only is counter-productive and deadly to morale.
And all alerting must be a function of the monitoring system, a central master where all alerts are tracked and managed. When you get paged you should know exactly where to go to ACK the alert and get the necessary details. Anything outside of the monitoring server that wants to send alerts should do so via the monitoring system.
One ring to rule them all
The central monitoring system is then the single pane of glass view into all monitoring events and alerts, but it absolutely must not be a single point of failure. The monitoring system must have a high-availability configuration so that it is available at all times, and must be scaled to handle the peak event load without faltering. This is our view into the environment, if it fails we’re blind.
As the view into all events and alerting, the dashboard must also be connected to other related systems. Events should link to the relevant metrics, documentation and tickets wherever they're captured.