Automate Remediation for High CPU Memory Processes Services on Windows
We have all had an application or service that runs away with itself and consumes all of the compute resource on a server, where you know that a simple restart of the service or killing of the process will restore operation. For some, this is the pre-cursor IT fix before rebooting the machine, as that invariably gets the right result. If you have a slow running machine or application, and you happened to be on the desktop of that machine, you are going to go look at Task Manager to see what is consuming all of the compute resource.
If you are using a monitoring solution, such as SolarWinds Server and Application Monitor (SAM), you are no doubt monitoring those key servers. Now the big question comes where you know for absolute certainty (notice how this is in bold?!) that if your server consumes continuously more than say 95%, that stopping the process with the highest resource usage will resolve the issue. This is a big if and this post should if you have not already picked up, will give you a method within Orion itself to automate that operation, but you must be at that 100% to apply the capability outlined in this post.
If you are at this place the ability for Orion to identify the issue and fix the issue and all you need to see is the alert notification to confirm it did this, you would be kinda happy right?
OK, I have given enough of a warning to configure this function in your Orion, so lets get to it.
First of all you need to be monitoring your servers within Orion SAM, which will provide CPU and memory utilisation for the whole system.
Now let’s look at these actions in detail:
Action 1: Execute program : SolarWinds.APM.RealTimeProcessPoller.exe
This action in type of “Execute external program” and it executes SolarWinds® out-of-the-box tool; “SolarWinds.APM.RealTimeProcessPoller.exe” in order to poll in realtime for the top 5 processes sorted by “Physical Memory” and insert this information to “Alert Notes” area for the triggered alert object. This is the actual command to be run by the action:
[code]APM\SolarWinds.APM.RealTimeProcessPoller.exe -n=${NodeID} -alert=${N=Alerting;M=AlertDefID} -count=5 -sort=PhysicalMemory -timeout=120[/code]
The available parameters that can be provided for this Solarwinds tool and details can be found in related SolarWinds article.
-n | ID of a Node (NodeID), which is polled. |
-count | The number of processes to show. |
-sort |
The criteria used to select top processes, including:
|
-timeout | Timeout for polling in seconds. |
-alert | The AlertDefID of associated triggered alert. If this argument is provided, then alert notes are updated with the results from polling. |
-activeObject | The ActiveObject property of the associated triggered alert. If this argument is not provided, NodeID is used. |
Action 2: Execute program : Send an Email
This e-mail notification action will be executed 2 minutes after the first action in order to give enough time to the first action to pull top processes and update Alert Notes.
Email Subject: Memory load on ${Caption} is currently ${PercentMemoryUsed}
Email Message:
The Memory on ${Caption} is currently running at ${PercentMemoryUsed}. The top 5 processes running at the time of this poll are listed below:
[code]${N=Alerting;M=Notes}<br /> <code>For more information click the link below.</code><br /> <code>${NodeDetailsURL}[/code]
These parameters will allow us to send the Email notification to related IT contact to inform high memory usage issue on a server with the list of the top 5 processes consuming the most memory. You may wish to stop here, with this information in the external notification and visible in the Alert Details page in Orion, this may be sufficient for you to manually take action.
If you want to automate, then buckle up..
Action 3: Execute program : KillWindowsProcess.ps1
This action in type of “Execute external program” and it executes custom PowerShell script to get the top process information from alert notes and kill that process on the server that triggered the alert. This is the actual command to be run by the action:
[code]C:\Windows\SysWOW64\WindowsPowerShell\v1.0\powershell.exe "C:\Webinar_Scripts\KillWindowsProcess.ps1" "’${N=SwisEntity;M=Caption}’" "’${N=Alerting;M=AlertObjectID}’"[/code]
As shown in this command, we defined custom script “KillWindowsProcess.ps1” in folder “C:\Webinar_Scripts” of Primary SolarWinds server. This script gets two parameters; first parameter is server name and the second parameter is alert object ID.
Action should be configured to be executed with a domain account that can connect and kill process on destination server:
“KillWindowsProcess.ps1” script connects to SolarWinds API to pull top process information from alert note and then checks the found process if it is in protected process list. It is recommended to define a protected process list to prevent killing critical processes. We do NOT want to render a server inoperable!. If the process is not in the protected list, then the script tries to kill that process on the destination server. The script generates a log file in “C:\Webinar_Scripts\logs” folder for each day, which can be reviewed to identify and fault diagnose the actions performed. Here is the content for the script:
[code]#Logging Details
$WORKING_FOLDER = "C:\Webinar_Scripts\"
$LOGFILE = $WORKING_FOLDER + "logs\action_trigger_log_" + $(get-date -f yyyy-MM-dd) + ".txt"
# List of protected services that should not be terminated
$PROTECTED_PROCESSES = @(‘sqlserver.exe’,’WmiPrvSE.exe’, ‘svchost.exe’)
# Variables
$targetProcessName = ""
$targetHostName = $args[0]
$AlertObjectID = $args[1]
$time = get-date
$message = "$Time – Started action for Hostname: $targetHostName and Alert Object ID: $AlertObjectID"
$message >> $LOGFILE
# SolarWinds Connection Properties
$hostname = "SOLARWINDS_HOSTNAME"
$username = "SOLARWINDS_API_USER"
$password = "SOLARWINDS_API_PASSWORD"
$swis= Connect-Swis -Hostname $hostname -username $username -password $password
# Get Alert Note for the provided Alert Object
$alertQuery = "SELECT TOP 1 AlertNote FROM Orion.AlertObjects WHERE AlertObjectID=" + $AlertObjectID
$alertData = Get-SwisData $swis $alertQuery
$noteLine = $alertData.Split([Environment]::NewLine,[System.StringSplitOptions]::RemoveEmptyEntries)
$noteLineCount = $noteLine.Count
# Parse Alert Note to get Top Process Information
if ($noteLineCount -gt 1)
{
$headRow = $noteLine[0]
$topRow = $noteLine[1]
$headItems = $headRow.Split("`t",[System.StringSplitOptions]::RemoveEmptyEntries)
$headerItem = $headItems[0]
if($headerItem -eq ‘Name’)
{
$topProcItems = $topRow.Split("`t",[System.StringSplitOptions]::RemoveEmptyEntries)
$targetProcessName = $topProcItems[0]
}
}
if (($targetProcessName -ne "") -and ($targetProcessName -ne $null))
{
$time = get-date
$message = "$Time – Identified top process is : $targetProcessName for Alert Object ID: $AlertObjectID"
$message >> $LOGFILE
$IsProcessProtected = 0
forEach($p in $PROTECTED_PROCESSES)
{
if($targetProcessName -eq $p)
{
$IsProcessProtected = 1
}
}
if($IsProcessProtected -eq 0)
{
$Processes = Get-WmiObject -Class Win32_Process -ComputerName $targetHostName -Filter "name=’$targetProcessName’"
$foundProcessCount = $Processes.Count
$foundProcessName = $Processes.Name
if (($foundProcessCount -gt 0) -or ($foundProcessName -eq $targetProcessName))
{
foreach ($process in $Processes)
{
$returnval = $process.terminate()
$processid = $process.handle
if($returnval.returnvalue -eq 0)
{
#write-host "The process $targetProcessName `($processid`) terminated successfully"
$time = get-date
$message = "$Time – The process $targetProcessName `($processid`) terminated successfully"
$message >> $LOGFILE
}
else
{
#write-host "The process $targetProcessName `($processid`) termination has some problems"
$time = get-date
$message = "$Time – The process $targetProcessName `($processid`) termination has some problems"
$message >> $LOGFILE
}
}
}
else
{
$time = get-date
$message = "$Time – The process $targetProcessName can not be found on Hostname: $targetHostName "
$message >> $LOGFILE
}
}
else
{
$time = get-date
$message = "$Time – The process $targetProcessName is in the protected list, it will NOT be terminated"
$message >> $LOGFILE
}
}
else
{
#write-host "Unable to find target process for Alert Object ID: $AlertObjectID"
$time = get-date
$message = "$Time – Unable to find target process for Alert Object ID: $AlertObjectID"
$message >> $LOGFILE
}
$time = get-date
$message = "$Time – Completed action for Hostname: $targetHostName and Alert Object ID: $AlertObjectID"
$message >> $LOGFILE[/code]
This is an example of using the Alerting capabilities to take remediation action to resolve a problem, which I hope you find useful, but please use with care!
Training Course: SolarWinds Training Courses
Mark Roberts
Technical Director
Training Course: SolarWinds Training Courses
Enhance Database Monitoring with SolarWinds SQL Sentry
Recent Improvements to SQL Sentry In the fast-paced world of database management, staying on top of performance monitoring and optimisation is crucial. Database...
Database in Distress – important Database metrics on one screen with SolarWinds
Webinar: Database in Distress How to understand important Database metrics on one screen with SolarWindsIn this Webinar on Monday 5th June, you will discover how SolarWinds®...
Webinar On-Demand: SolarWinds Database Monitoring – Actual Bona Fide Database Administrators
In this webinar, you will discover how SolarWinds® can help Database Administrators to meet their advancing Database monitoring and configuration challenges. This webinar...