An Introduction To SolarWinds Best Practices – Custom Alerts
Even outside of the IT world, the first thing that many people think of when you hear the term ‘monitoring’ is alerts. Whether it be a flashing light, a siren, or in the case of SolarWinds®, an email in your inbox, alerts are an important part of a monitoring solution that tells you one thing: “There is something urgent you need to do.”
Notice that key word there: ‘DO’. When I work with customers they often tell me the same thing: “We want visibility over everything”, which may absolutely be true, but can often lead to massive scope creep within the monitoring solution. The better way to think about your monitoring, especially as you are getting started, is “What do I need to do?”
Let’s look at a quick example: When a server goes down you know that you need to bring it back online. How would you do this?
You could:
- Wait to see if it restarts on its own.
- If it’s physical you can go and physically turn it off and on
- If it’s virtual you can go to the hypervisor and reset it that way.
All of these are actions that can start to influence how you think about your monitoring system should be configured. For each of these situations let’s look at some things that could make your life a little easier:
- You can add a delay before the alert fires so that any devices that simply restart are ignored.
If it’s physical you can include details on the server’s location within the alert email, to help the engineer with their troubleshooting. - These can be added to nodes via the use of custom properties.
- If it’s virtual, you can specify which hypervisor is hosting the machine, and even give a link to the management URL for quick access.
As you can see here, every decision we’ve made in SolarWinds is to bolster the activities that the humans on one end of this process need to carry out.
The Condition:
The ‘condition’ is broken down further into the ‘Context’, ‘Scope’ and the ‘Trigger’.
The Context: This is the type of device that is going to trigger the alert, and can only be a single type. This could include ‘Node’, ‘Interface’, ‘Group’, or any other type of element. This will change what information you can use in the trigger and actions later on, so is very important.
The Scope: This section allows you to filter to a subset of devices that are in the context. For example, only Cisco nodes, or only nodes in a certain site. Not only does this make it easier to avoid alert noise, but also has less of a performance hit when you have a large platform.
The Trigger: This section defines exactly what causes the alert to fire. This could be because the status changed to ‘Down’ or ‘Warning’ or because a specific event fired. Again, remember that these conditions shouldn’t only be what triggers the alert to fire, but they should be conditions that would cause a human to have to get involved.
The Action: This section is configured so that some sort of response to the trigger is sent out. This could be an email, which is most common, or an SMS message, or maybe it sends a message into a Microsoft Teams channel. The key thing here is that it should contain all the information required to assist the human carry out their actions to this alert.
With that covered, let’s have a look at some Out-of-the-box alerts. Below is the Trigger Condition page for this out of the box alert. The first thing you’ll notice is that you can’t edit it, as system alerts are locked for editing. One of the first things you should do when you purchase SolarWinds is to ‘Duplicate and Edit’ any system alerts so that they are editable, and disable the original.
Let’s point and problems going through each numbered section
- The Context: No problems here.
- The Scope: Here you can see there is no scope defined, and this alert will fire for all monitored nodes. If this seems fine, as yourself: Would your actions to a Domain Controller going down in the same be the same as if a firewall went down? Because of this, your alerts should have a clear scope to them to help with troubleshooting.
- The Trigger: While ‘Status is equal to down’ is fine, we have to consider that we often see devices go down for a short period due to network blips or restarts and these don’t necessarily warrant a human action. What we can do is add a delay to the trigger, using the section at the very bottom. This way it will only alert us if the device goes down and STAYS down.
Let’s have a look at a better Condition from one of the alerts in the Prosperon Best Practices and you can see that this is very specific alert designed to tell our network engineers when an active Cisco device has a high CPU usage that has persisted longer than 10 minutes:
The Action:
Next up, we need to talk about alert actions as they are the part of the alert that is actually going to TELL you what is going on and, hopefully, how to respond. There are many types of alert actions you can choose, but we are going to focus on the most common which will be an email – most of these tips will be relevant for any information alert however.
First, let’s look at an example of a poor, out-of-the-box alert. As you can see, it does have some information such as the node name, the status of the node and a link to the node itself but it has many problems. First, we have no concept about what this device is. Sure, there is a chance that you just recognize the name of the node and can identify it that way, but especially for people who are newer to the organisation that won’t be the case. Second, we can see that there is a duplication of information across the subject and body, with no increase in detail. Finally, there is no real way to visually distinguish this alert from the many others you may be receiving on a daily basis.
Webinar On-Demand: SolarWinds Best Practices - Out-of-the-Box Vs. Custom Alerts
Conclusion
Tackling the issue of alerting can be a daunting task, as you must find and tread the line between receiving too many emails while also not missing anything that requires and immediate response. The first step will always be starting small, with a small set of alerts that cover your most critical nodes. By using the trigger context we can easily specify these devices to make sure that you don’t begin receiving alert noise immediately. Following this, understanding the exact criteria to trigger the alerts exactly when you need to see it, avoiding false positives as much as possible. Finally, making sure your trigger actions contain all the correct information will help speed up the resolution.
If you have any further questions about alerting, please don’t hesitate to reach out to us here at Prosperon for a conversation with one of our SolarWinds Experts!
Marlie Fancourt
Presales Manager
Marlie is the SolarWinds Presales Manager at Prosperon Networks. As a Pre Sales Engineer, Marlie demonstrates the benefits of SolarWinds technology and helps customers understand the value SolarWinds brings to organisations.
Webinar On-Demand: SolarWinds Best Practices - Out-of-the-Box Vs. Custom Alerts
Managing the Challenge: Time, Resource, Knowledge, and Budget
‘’I am the only person that manages our SolarWinds platform, and staying on top of……… is a challenge" Above, is a statement that is common for us to hear, and if you can...
Webinar On-Demand: SolarWinds Best Practices – Out-of-the-Box Vs. Custom Reports
In this webinar, you will discover how to enhance SolarWinds® by going beyond default reports. This webinar examines how to improve the efficiency of your platform by going...
How To Align Your SolarWinds Maintenance Renewal Dates
Day to day, we speak to customers who see multiple SolarWinds® renewal reminders a year. Unfortunately, this is not that uncommon, and you’re not alone. We have created this...