Esxi Health Monitor: Stale Check in Alerts

This forum supports the ESX Host Health Monitor plugin. When posting post screenshots of issues and any script and command logs listed in the probe consoles.
TPriest@rocketit.com
Posts: 7
Joined: Tue Jul 17, 2018 12:53 pm
5

Esxi Health Monitor: Stale Check in Alerts

Post by TPriest@rocketit.com »

Good Morning Everyone,

Our Esxi Health monitor has been acting a little strange lately. It seems like once a week, we recieve a stale check in alert from all of our hosts for every client we have this setup on. After the alert, we get bombarded with actual alerts that we did not recieve till later. Thankfully, after the first round of stale check ins, we mitigated the detected issues from the hosts, but now im a little unsure if the plugin is working as it should. Espeically since we seem to be getting the stale check in alerts often. Do you guys have any advice? Thanks so much for your help!

User avatar
Cubert
Posts: 2430
Joined: Tue Dec 29, 2015 7:57 pm
8
Contact:

Re: Esxi Health Monitor: Stale Check in Alerts

Post by Cubert »

Stale checkin alert is a monitor that reads SQL data to see when the last porbe of data happened. If it goes off then it believes that data for a given host has not updated recently (24 hours). So you can see why probe didnt respond when probe ran when an alarm hits or adjust the monitor so it is looking at a 28 to 36 hour window thus giving the probe more chance to get fresh data.

Make sure your probes arte running every 4 to 8 hours so if any 1 probe fails there will be two more tries before an stale alarm is sent.

TPriest@rocketit.com
Posts: 7
Joined: Tue Jul 17, 2018 12:53 pm
5

Re: Esxi Health Monitor: Stale Check in Alerts

Post by TPriest@rocketit.com »

Thanks Cubert! I had my Scan Cycle set to 4 hours, I'm going to bump that to 8 Hours instead just to be safe. If I see anything else I will update the thread, thanks so much!

User avatar
Cubert
Posts: 2430
Joined: Tue Dec 29, 2015 7:57 pm
8
Contact:

Re: Esxi Health Monitor: Stale Check in Alerts

Post by Cubert »

Just make sure that probes are running on agents regularly.

jasonhand
Posts: 5
Joined: Tue Jul 17, 2018 1:43 pm
5

Re: Esxi Health Monitor: Stale Check in Alerts

Post by jasonhand »

Shannon, this is Jason Hand. I did more research and it looks like when we see this it happens to all of our ESX probe servers and the logs show that the ESX health monitor maintenance script never ran on any of them. It is like the script was never scheduled after the last reported run. Any idea how to get it running again? We have it set to probe every 8 hours and as you can see from the script log on one of our probe servers it hasn't run since 8/29/18 at midnight so nothing for more than 39 hours.
Attachments
LTClient_2018-08-30_15-35-08.png
LTClient_2018-08-30_15-35-08.png (106.95 KiB) Viewed 15239 times

User avatar
Cubert
Posts: 2430
Joined: Tue Dec 29, 2015 7:57 pm
8
Contact:

Re: Esxi Health Monitor: Stale Check in Alerts

Post by Cubert »

The probes get a script scheduled by the ISync service (a Labtech service operated by the DBagent) based on the probes schedule (1 to 24 hours) The ISync service runs every 6 minutes and checks to see if there is anything our code wants to run. Our code says to look at probe scan setting to see when we should run. If probe says every 8 hours then if Isync runs at 8 am we are to schedule scripts, then when the Isync runs at 4pm we again say we need to schedule scripts. and so on...

This exchange of data between our code and the ISync service is logged on the LT host under c:\program files\labtech\logs\plugin-plugin name.txt and plugin-plugin name-error.txt

Looking at these files should show you when ISync asks for data and when we say it's time to give data (list of probes to schedule the probe script against and the script to schedule).

You should see in logs that we are scheduling script against probe ID XYZ.

If this is not happening or if there is a error present , post that here.

Otherwise it maybe time to reboot the LT host or restart the DBagent.

jasonhand
Posts: 5
Joined: Tue Jul 17, 2018 1:43 pm
5

Re: Esxi Health Monitor: Stale Check in Alerts

Post by jasonhand »

Shannon,

I have attached the plugin_vmwareesxhealthmonitor.txt file and the plugin_vmwareesxhealthmonitor_errors.txt file as a sing le .zip file since I couldn't attach .txt files.

Can you tell what is going on? Sometimes it will go for 2 or 3 days with no polling and then all of the sudden all of the monitors trip and we see a success for all of the probes and then a few days the cycle starts over.
Attachments
RITESXHealthFiles.zip
(6.58 KiB) Downloaded 510 times

User avatar
Cubert
Posts: 2430
Joined: Tue Dec 29, 2015 7:57 pm
8
Contact:

Re: Esxi Health Monitor: Stale Check in Alerts

Post by Cubert »

Ok I see issues with your DBagent and ISync.


Here is a sniplet of your logs:
LTAgent v120.385 - 8/12/2018 1:00:59 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 1:00:59 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 1 Interval = 8:::
LTAgent v120.385 - 8/12/2018 2:05:25 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 2:05:25 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 2 Interval = 8:::
LTAgent v120.385 - 8/12/2018 4:02:57 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 4:02:57 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 4 Interval = 8:::
LTAgent v120.385 - 8/12/2018 7:05:16 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 7:05:16 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 7 Interval = 8:::
LTAgent v120.385 - 8/12/2018 8:01:14 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 8:01:14 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 8 Interval = 8:::
LTAgent v120.385 - 8/12/2018 8:01:14 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Maintenance Script Starting:::
LTAgent v120.385 - 8/12/2018 8:01:14 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Maintenance Script Starting:::
LTAgent v120.385 - 8/12/2018 8:01:15 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Maintenance cleaning up stail records:::
LTAgent v120.385 - 8/12/2018 11:05:03 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 11:05:03 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 11 Interval = 8:::
LTAgent v120.385 - 8/12/2018 12:02:58 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 12:02:58 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 12 Interval = 8:::
LTAgent v120.385 - 8/12/2018 4:00:55 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 4:00:55 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 16 Interval = 8:::
LTAgent v120.385 - 8/12/2018 4:00:56 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Maintenance Script Starting:::
LTAgent v120.385 - 8/12/2018 4:00:56 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Maintenance Script Starting:::
LTAgent v120.385 - 8/12/2018 4:00:56 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Maintenance cleaning up stail records:::
LTAgent v120.385 - 8/12/2018 5:01:39 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 5:01:39 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 17 Interval = 8:::
LTAgent v120.385 - 8/12/2018 6:01:21 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 6:01:21 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 18 Interval = 8:::
LTAgent v120.385 - 8/12/2018 7:02:49 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 7:02:49 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 19 Interval = 8:::
LTAgent v120.385 - 8/12/2018 8:03:38 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 8:03:38 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 20 Interval = 8:::
LTAgent v120.385 - 8/12/2018 9:03:34 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 9:03:34 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 21 Interval = 8:::
LTAgent v120.385 - 8/12/2018 10:04:50 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::
LTAgent v120.385 - 8/12/2018 10:04:50 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 22 Interval = 8:::

On August 12th the probes ran twice once at 8:am and again at 4:pm. It shows what the run hour was and what the probe scan rate is set to.
LTAgent v120.385 - 8/12/2018 12:02:58 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 12 Interval = 8:::
See -> HOUR = 12 Interval = 8 which means the current time is 12PM and we are set to probe every 8 hours.


This shows when we are probing.
TAgent v120.385 - 8/12/2018 8:01:14 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Maintenance Script Starting:::
Since we are set to probe every 8 hours starting at midnight that would be 00:00 hours, 08:00 hours(8:00am), 16:00 hours (4:00pm)

Now look at the full log, The dbagent "should" be firing off ISync every 6 minutes and relaunching at exactly the top of every hour. You should see the following for every hour of the day every day
LTAgent v120.385 - 8/12/2018 4:00:55 PM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor info HOUR = 16 Interval = 8:::
So starting at midnight going to midnight you should see 24 entries lik so where "HOUR" = the hour it ran (0-23) and you are missing hours all throughout the logs.

So your problem is that setting the interval to 8 and ISync skipping anywhere from 1 to 12 hours a day will cause gaps in probes as ISync is skipping the "Hour" the probe should be running.

So I would start looking at all the logs for all plugins being used there. See if there is any consistency between logs and missing hours.

See if multiple logs all have the same time missing, if so then you may have a LThost issue bigger than the plugin that is showing up in the plugin.


Mean while changing probe times making that a smaller number should allow more chances per day that the probe will fall on a hour that is working as it should.

User avatar
Cubert
Posts: 2430
Joined: Tue Dec 29, 2015 7:57 pm
8
Contact:

Re: Esxi Health Monitor: Stale Check in Alerts

Post by Cubert »

Example of missing time:
LTAgent v120.385 - 8/12/2018 8:01:15 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Maintenance cleaning up stail records:::
LTAgent v120.385 - 8/12/2018 11:05:03 AM - Plugin VMWareESXHealthMonitor, Version=4.0.0.57, Culture=neutral, PublicKeyToken=null: P4L ESX HealthMonitor Scanner thread starting:::

Where is 9AM and 10 AM Isync runs? logs missing 2 entries. This is not a plugin issue but ISync and DBagent not running the services.

Also I see ISYnc running probe at 11:05 and not 11:01 or less. We have a 6 minute test where we do not run if executed after the 6 minute mark. Reason for this is ISync relaunches every 6 minutes so we could be executed at 00:01 and again at 00:07 and actually run 2 probes at any given hour. so we test to see if we are "at the top of the hour"

So, theradically we could be seeing a very slow ISync service that is taking longer to get through all tasks it has to do and by the time it gets to our task the time has passed our limits.

I can look at added a log prior to the test which will log every 6 minutes (on every execution no matter the time) so you can see when ISync gets around to running services. This would at least show when ISync executes our plugin code.

User avatar
Cubert
Posts: 2430
Joined: Tue Dec 29, 2015 7:57 pm
8
Contact:

Re: Esxi Health Monitor: Stale Check in Alerts

Post by Cubert »

We have made some adjustments to ISync in the update just posted

4.0.0.58 -> https://delivery.shopifyapps.com/-/97de ... e0a8ced8b1

Install this update and restart the DBagent and monitor the logs. We now will log if we are at the TOTH or not

TOTH = Top Of The Hour

We also adjusted a setting where we said in code

Code: Select all

If "Minute" > 6 then skip 


Should of actually be 7 not 6 . We want to execute between 00 and 06 which is the first run of ISync each hour.

Post Reply

Return to “VMWare ESX Host Health Monitor”