Automate Remediation for High CPU Memory Processes Services on Windows

by 23, Jun, 2020Systems Management

We have all had an application or service that runs away with itself and consumes all of the compute resource on a server, where you know that a simple restart of the service or killing of the process will restore operation. For some, this is the pre-cursor IT fix before rebooting the machine, as that invariably gets the right result. If you have a slow running machine or application, and you happened to be on the desktop of that machine, you are going to go look at Task Manager to see what is consuming all of the compute resource.

If you are using a monitoring solution, such as SolarWinds Server and Application Monitor (SAM), you are no doubt monitoring those key servers. Now the big question comes where you know for absolute certainty (notice how this is in bold?!) that if your server consumes continuously more than say 95%, that stopping the process with the highest resource usage will resolve the issue. This is a big if and this post should if you have not already picked up, will give you a method within Orion itself to automate that operation, but you must be at that 100% to apply the capability outlined in this post.

If you are at this place the ability for Orion to identify the issue and fix the issue and all you need to see is the alert notification to confirm it did this, you would be kinda happy right?

OK, I have given enough of a warning to configure this function in your Orion, so lets get to it.

First of all you need to be monitoring your servers within Orion SAM, which will provide CPU and memory utilisation for the whole system.

Percent Memory Used (Insight Image) - Prosperon Networks
You can see from the above chart, that the server “PROS_DEV-01” has started to run above 90% memory utilisation, which our alert definition will be using to identify that we are in a known condition for which the automated remediation should be applied; to kill the process causing the problem.
Trigger Condition 2020 (Insight Image) - Prosperon Networks
As you can see in the screenshot above, our alert trigger condition checks if a server exceeds the critical level threshold value, which can be managed globally or at node level using the “Edit Node” page as shown below:
Application Connection Setting (Insight Image) - Prosperon Networks
Within our alert definition, we have a number of actions; capture and include in external notification the top processes causing high memory usage and then our automatic remediation to kill the top process causing this problem.
Trigger Actions (Insight Image) - Prosperon Networks

Now let’s look at these actions in detail:

Action 1: Execute program : SolarWinds.APM.RealTimeProcessPoller.exe

This action in type of “Execute external program” and it executes SolarWinds® out-of-the-box tool; “SolarWinds.APM.RealTimeProcessPoller.exe” in order to poll in realtime for the top 5 processes sorted by “Physical Memory” and insert this information to “Alert Notes” area for the triggered alert object. This is the actual command to be run by the action:

APM\SolarWinds.APM.RealTimeProcessPoller.exe -n=${NodeID} -alert=${N=Alerting;M=AlertDefID} -count=5 -sort=PhysicalMemory -timeout=120

The available parameters that can be provided for this Solarwinds tool and details can be found in related SolarWinds article.

-n ID of a Node (NodeID), which is polled.
-count The number of processes to show.
-sort

The criteria used to select top processes, including:

  • CPU – Processor time. This is the default value if the command line argument is not specified.
  • PhysicalMemory – Process physical memory.
  • VirtualMemory – Process virtual memory.
  • DiskIO – Process disk I/O per second.
-timeout Timeout for polling in seconds.
-alert The AlertDefID of associated triggered alert. If this argument is provided, then alert notes are updated with the results from polling.
-activeObject The ActiveObject property of the associated triggered alert. If this argument is not provided, NodeID is used.

Action 2: Execute program : Send an Email

This e-mail notification action will be executed 2 minutes after the first action in order to give enough time to the first action to pull top processes and update Alert Notes.

Email Subject: Memory load on ${Caption} is currently ${PercentMemoryUsed}

Email Message:

The Memory on ${Caption} is currently running at ${PercentMemoryUsed}. The top 5 processes running at the time of this poll are listed below:

${N=Alerting;M=Notes}<br /> <code>For more information click the link below.</code><br /> <code>${NodeDetailsURL}

These parameters will allow us to send the Email notification to related IT contact to inform high memory usage issue on a server with the list of the top 5 processes consuming the most memory. You may wish to stop here, with this information in the external notification and visible in the Alert Details page in Orion, this may be sufficient for you to manually take action.

If you want to automate, then buckle up..

Action 3: Execute program : KillWindowsProcess.ps1

This action in type of “Execute external program” and it executes custom PowerShell script to get the top process information from alert notes and kill that process on the server that triggered the alert. This is the actual command to be run by the action:

C:\Windows\SysWOW64\WindowsPowerShell\v1.0\powershell.exe "C:\Webinar_Scripts\KillWindowsProcess.ps1" "'${N=SwisEntity;M=Caption}'" "'${N=Alerting;M=AlertObjectID}'"

As shown in this command, we defined custom script “KillWindowsProcess.ps1” in folder “C:\Webinar_Scripts” of Primary SolarWinds server. This script gets two parameters; first parameter is server name and the second parameter is alert object ID.

Action should be configured to be executed with a domain account that can connect and kill process on destination server:

Configure Action - Execute An External Program (Insight Image) - Prosperon Networks

“KillWindowsProcess.ps1” script connects to SolarWinds API to pull top process information from alert note and then checks the found process if it is in protected process list. It is recommended to define a protected process list to prevent killing critical processes. We do NOT want to render a server inoperable!. If the process is not in the protected list, then the script tries to kill that process on the  destination server. The script generates a log file in “C:\Webinar_Scripts\logs” folder for each day, which can be reviewed to identify and fault diagnose the actions performed. Here is the content for the script:

#Logging Details
$WORKING_FOLDER = "C:\Webinar_Scripts\"
$LOGFILE = $WORKING_FOLDER + "logs\action_trigger_log_" + $(get-date -f yyyy-MM-dd) + ".txt"

# List of protected services that should not be terminated
$PROTECTED_PROCESSES =  @('sqlserver.exe','WmiPrvSE.exe', 'svchost.exe')

# Variables
$targetProcessName = ""
$targetHostName = $args[0]
$AlertObjectID = $args[1]

$time = get-date
$message = "$Time - Started action for Hostname: $targetHostName and Alert Object ID: $AlertObjectID"
$message >> $LOGFILE

# SolarWinds Connection Properties
$hostname = "SOLARWINDS_HOSTNAME"
$username = "SOLARWINDS_API_USER"
$password = "SOLARWINDS_API_PASSWORD"

$swis= Connect-Swis -Hostname $hostname -username $username -password $password

# Get Alert Note for the provided Alert Object
$alertQuery = "SELECT TOP 1 AlertNote FROM Orion.AlertObjects WHERE AlertObjectID=" + $AlertObjectID

$alertData = Get-SwisData $swis $alertQuery

$noteLine = $alertData.Split([Environment]::NewLine,[System.StringSplitOptions]::RemoveEmptyEntries)

$noteLineCount = $noteLine.Count

# Parse Alert Note to get Top Process Information
if ($noteLineCount -gt 1)
{
    $headRow = $noteLine[0]
    $topRow = $noteLine[1]

    $headItems = $headRow.Split("`t",[System.StringSplitOptions]::RemoveEmptyEntries)

    $headerItem = $headItems[0]

    if($headerItem -eq 'Name')
    {
        $topProcItems = $topRow.Split("`t",[System.StringSplitOptions]::RemoveEmptyEntries)
        $targetProcessName = $topProcItems[0]

    }

}

if (($targetProcessName -ne "") -and ($targetProcessName -ne $null))
{
    $time = get-date
    $message = "$Time - Identified top process is : $targetProcessName for Alert Object ID: $AlertObjectID"
    $message >> $LOGFILE

    $IsProcessProtected = 0 

    forEach($p in $PROTECTED_PROCESSES)
    {
        if($targetProcessName -eq $p)
        {
            $IsProcessProtected = 1
        }
    }
    if($IsProcessProtected -eq 0)
    {

        $Processes = Get-WmiObject -Class Win32_Process -ComputerName $targetHostName -Filter "name='$targetProcessName'"

        $foundProcessCount = $Processes.Count
        $foundProcessName = $Processes.Name

        if (($foundProcessCount -gt 0) -or ($foundProcessName -eq $targetProcessName))
        {
            foreach ($process in $Processes)
            {
                $returnval = $process.terminate()
                $processid = $process.handle

                if($returnval.returnvalue -eq 0)
                {
                    #write-host "The process $targetProcessName `($processid`) terminated successfully"
                    $time = get-date
                    $message = "$Time - The process $targetProcessName `($processid`) terminated successfully"
                    $message >> $LOGFILE
                }
                else
                {
                    #write-host "The process $targetProcessName `($processid`) termination has some problems"
                    $time = get-date
                    $message = "$Time - The process $targetProcessName `($processid`) termination has some problems"
                    $message >> $LOGFILE
                }
            }
        }
        else
        {
            $time = get-date
            $message = "$Time - The process $targetProcessName can not be found on Hostname: $targetHostName "
            $message >> $LOGFILE
        }
    }
    else
    {
        $time = get-date
        $message = "$Time - The process $targetProcessName is in the protected list, it will NOT be terminated"
        $message >> $LOGFILE
    }
}
else
{
    #write-host "Unable to find target process for Alert Object ID: $AlertObjectID"
    $time = get-date
    $message = "$Time - Unable to find target process for Alert Object ID: $AlertObjectID"
    $message >> $LOGFILE
}

$time = get-date
$message = "$Time - Completed action for Hostname: $targetHostName and Alert Object ID: $AlertObjectID"
$message >> $LOGFILE

This is an example of using the Alerting capabilities to take remediation action to resolve a problem, which I hope you find useful, but please use with care!

Training Course: SolarWinds Training Courses

Mark Roberts

Mark Roberts

Technical Director

Mark Roberts is the Technical Director at Prosperon Networks and a SolarWinds MVP. Mark has been helping customers meet their monitoring needs with SolarWinds IT Management Solutions for over 14 years.

Training Course: SolarWinds Training Courses

Related Insights From The Prosperon Blog
Share This